Thus spake "Eric Gauthier" <eric@roxanne.org>
Anyone have any idea what really happened : http://www.boston.com/dailyglobe2/330/science/Got_paper_+.shtml
I can't speak to exactly what happened because of NDA, but I think I can help NANOGers understand the environment and why this happens in general.
I know someone who worked on it, but I've avoided asking what really happened so I don't freak out the day the ambulence drives me up to their emergency room :) The other day, I did forward the article over to our medical school in the hopes that they might "check" their network for similar "issues" before something happens :)
I see a lot of Fortune 500 networks in my job, and I'd say at least 75% of them are in the same state: a house of cards standing only because new cards are added so slowly. Any major event, whether a new bandwidth-hungry application or a parity error in a router, can bring the whole thing down, and there's no way to bring it back up again in its existing state. No matter how many powerpoint slides you send to the CIO, it's always a complete shock when the company ends up in the proverbial handbasket and you're looking at several days of downtime to do 4+ years of maintenance and design changes. And, what's worse, nobody learns the lesson and this repeats every 2-5 years, with varying degrees of public visibility. This is a bit of culture shock for most ISPs, because an ISP exists to serve the network, and proper design is at least understood, if not always adhered to. In the corporate world, however, the network and support staff are an expense to be minimized, and capital or headcount is almost never available to fix things that are "working" today.
I don't know which scares me more: that the hospital messed up spanning-tree so badly (which means they likely had it turned off) that it imploded their entire network. Or that it took them 4 days to figure it out.
It didn't take 4 days to figure out what was wrong -- that's usually apparent within an hour or so. What takes 4 days is having to reconfigure or replace every part of the network without any documentation or advance planning. My nightmares aren't about having a customer crater like this -- that's an expectation. My nightmare is when it happens to the entire Fortune 100 on the same weekend, because it's only pure luck that it doesn't. S