On Fri, 21 Jun 1996 11:31:32 EDT, Bob Metcalfe wrote:
By the way, there are reports from two days ago that 400,000 people lost their Internet access for 13 hours. Sounds like an outage approaching "collapse." Was that just a Netcom thing that NANOG has no interest in? Netcom is not talking very much about what happened. Any clues/facts out there? Were any NAPs involved?
The news reports about the outage were that somehow numerous external routes got into the internal Netcom backbone routing, and the extra load caused a chain reaction that caused everything to go down. Apparently it was mainly confined to Netcom's network. Whether by design or by dumb luck, I don't know. We currently hang off of AGIS in San Jose, and for about four hours after Netcom came back, we were up and down. Couldn't tell from here if it was AGIS, MAE West, or what, or if Netcom coming back had anything to do with it. I watched this outage from the periphery, and was completely blown away by the non-reaction to it. Official statements from Netcom (essentially confirming Bob's numbers above) were quoted on the Reuters newswire, and on the front page of the San Jose Mercury News Business section the next day (although the editor played down the impact of it a little, and mixed a one-hour AOL email outage into the same story and turned it into "outages affect online services"). On the other hand, Netcom has said essentially nothing to its subscriber base about the outage. I've seen only a little mention of it around the net. Am I looking in the wrong places -- or is there no good way to communicate about these sorts of things yet? (I've signed up to the outage discussion list, as Sean suggested.) My impression is starting to be that most Netcom subscribers didn't really notice the difference between normal Internet operations and the 13-20 hour outage, and/or didn't have the diagnostic capabilities to be able to tell. There were technically-oriented folks that could see that something was going on, but even for them, it was hard to tell what. I'm wearing two hats for the next set of questions -- the first as a technical manager for an ISP growing an international backbone, and the second as someone who's concerned about marketing the Internet (and my company) to the public. Can other big parts of the backbone fall down and take 13 (or more) hours to get back up? Or is the rest of the net engineered more redundantly than Netcom? Should I build two backbones, each with separate technologies? Was this a foreshock of the coming Metcalfean Big One, or just lousy procedures at one of the bigger ISPs? Inquiring minds want to know. Right now, it appears to be just a few (thankfully?). And now is the time to develop communications and publicity strategies for this sort of thing -- along with the engineering to hopefully prevent them. -- Pete Kaminski kaminski@nanospace.com