In message <Pine.LNX.4.30.0101242038380.5951-100000@anime.net>, Dan Hollis writ es:
On Wed, 24 Jan 2001, Eric A. Hall wrote:
At 6:30 p.m. Tuesday (PST), a Microsoft technician made a configuration change to the routers on the edge of Microsoft's Domain Name Server network. At approximately 5 p.m. Wednesday (PST), Microsoft removed the changes to the router configuration and immediately saw a massive improvement in the DNS network.
So basically, it took microsoft 23 hours to fix a router configuration.
There's a story (possibly apocryphal) about the time that Steinmetz was called in as a consultant to repair problems with some massive piece of electrical machinery. After poking around for a while, some staring, and much thinking, he adjust one screw, and solved the problem. He then proceeded to write out a bill for $1000. The company was outraged. "$1000 for adjusting one screw? You're crazy!" Steinmetz agreed, took back the bill, and tore it up. He then wrote out a new bill: Adjust one screw $1 Knowing which screw to adjust $999 Remember the other half of Jim Duncan's post from last night: There were clearly some mistakes made, but it is also the case that there were a _lot_ of different things going on that contributed to the problem or complicated its resolution. He *worked* this problem; this is a first-hand statement, not conjecture by those who weren't there. Let me put forth a blatant generalization of my own: *all* major failures are due to complex causes. The proof is simple: if you're small and hence presumably clueless (the "mom and pop" ISPs another poster sneeringly referred to), your problems don't cause major failures for the rest of the net. If you're big (and hence presumably clueful), you solve the simple problems quickly and they don't become major failures. Finding and fixing *the* root cause is hard, when you're in the midst of a swamp full of other alligators, and you don't know which one is (currently) biting you in the rear. I'd love to see a detailed description of what went wrong, and I hope that those in the know will be allowed to post it or present it in Atlanta. But I'm willing to wager that it wasn't just (a) a single router configuration change, (b) brain-damage in Microsoft's DNS code, (c) malicious activity aimed at Microsoft, (d) RAMEN-induced misbehavior, or (e) any other single cause. --Steve Bellovin, http://www.research.att.com/~smb
I'd love to see a detailed description of what went wrong, and I hope that those in the know will be allowed to post it or present it in Atlanta.
was it not don knuth's turing aware lecture which consisted of a detailed analysys of some recent bugs in his code?
[ On Thursday, January 25, 2001 at 09:30:19 (-0500), Steven M. Bellovin wrote: ]
Subject: Re: MS explains
I'd love to see a detailed description of what went wrong, and I hope that those in the know will be allowed to post it or present it in Atlanta. But I'm willing to wager that it wasn't just (a) a single router configuration change, (b) brain-damage in Microsoft's DNS code, (c) malicious activity aimed at Microsoft, (d) RAMEN-induced misbehavior, or (e) any other single cause.
But there was a single root cause that made the problem possible in the first place.... Analysis of the actual event will no doubt be interesting to some, but for all their users out there on the Internet all that matters is that proper deployment of their servers would have never allowed the situation to occur in the first place. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
participants (3)
-
Randy Bush
-
Steven M. Bellovin
-
woods@weird.com