On Wed, Jan 24, 2001 at 11:45:08PM -0800, Sean Donelan wrote:
That's a bit unfair.
There have been a number of lengthy outages.
AS7007 router configuration problem: April 25 1997 lasted 2 hours AOL (ANS router configuration problem): Aug 7 1996 lasted 19 hours ATT frame-relay switch errors: April 13, 14 1998 lasted 26 hours BBN standard power failure: October 11, 1996 lasted about 12 hours(off and on) NETCOM router configuration error: June 20 1996 lasted 13 hours Sprint database problems: September 3 1996 lasted 5 hours NSI root server corruption (operational error): July 16 1997 lasted 4 hours PacBell configuration problems: January 30-31 1997 lasted 48 hours UUNET frame-relay problems: July 1 1997 lasted over 24 hours UUNET cisco/bay router problems: November 7 1997 lasted 5 hours Worldcom frame-relay switch errors: August 1999 lasted 9-10 days
Your point isn't lost on me, but I think there are a couple of distinctions to make here. I do freely admit, however, that I'm not familiar with all of the outages you listed. 1. I think it might be prudent to weed out the 2-5 hour outages here. While that's still an excessively long time to recover a change that should have been monitored and tested properly in the first place (and still probably cause for firing in some shops), I can at least conceive of it taking this amount of time. Too long, yes, but not quite in the jaw-dropping category. 2. Several of the remaining group that I'm familiar with (namely the AT&T and Netcom outages) involved problems which cascaded out to the entire network, and therefore could not be simply undone by backing out the change on that router/switch/etc. I've been through an outage or two which didn't gain such notoriety but still took several hours after identifying the problem just to go out and reboot enough boxes to settle the network down. In the MS DNS case, I don't feel the same point applies. By most accounts, it appears the issue was a change to a single router affecting one or two subnets, and reversing the issue was simply a matter of backing out said change. Again, don't get me wrong. Your point is taken, and I've certainly caused and felt my share of pain with (sometimes unnecessarily) lengthy outages. I agree with Randy about cutting the folks involved some slack, since we've all been and will be there at some point. I do see it the other way as well. This amount of time to resolve a local issue caused by a procedurally implemented change (I'll give them the benefit of the doubt) on a critical network device is surely due some scrutiny. -c