Don't be so fast to point the finger. Generally speaking, blame is obvious from the initial news reports but tends to diminish with retrospective fact-based assessment. For example: it's "obvious" that serious net sites need multihoming. But what if your multihomed bits go through the same pipe (or worse, through the same fiber)? Who do you blame when you find out? Worse, in terms of blame: who can you go to beforehand who actually knows where that can happen? I well remember this slide from Sean Donelan's talk at NANOG23: --------------------------- http://www.nanog.org/mtg-0110/ppt/donelan_files/v3_document.htm What Didn't Work - Diversity and Avoidance * Equipment in the World Trade Center primarily served tenants in complex (shared fate) * SONET ring through WTC tower 1 and alternate path through WTC tower 2 * Damage to 140 West Street central office and surrounding underground infrastructure * Backup circuit routed through same facility * “Advanced” data circuits (ISDN/DSL) concentrated in a few central offices --------------------------- The real answer, found elsewhere in Sean's talk, is that the design of the net has always encouraged redundancy as an engineering principle. Stress situations is where that pays off, even though it can't solve every possible eventuality (and as has already been noted, redundant equipment also fails as well as creating more complex failure modes). The net had problems on 9/11, especially around the WTC, but Sean's slides document remarkable resiliency even in that area. The power went off at a key spot in the San Francisco infrastructure today. But as far as I know, even though it was mentioned in the Chron article, Craigslist stayed online because they have a distributed and redundant system (which is not to say, impervious to all failure modes). Some shortcomings are obvious, but all I am saying is, before rushing to cast blame, it's a good idea to try and collect some facts. fh ------------------- http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2007/07/24/BAG9NR67253.DTL&tsp1 Power restored in San Francisco Marisa Lagos and Matthew B. Stannard, Chronicle Staff Writers Tuesday, July 24, 2007 (07-24) 16:57 PDT SAN FRANCISCO -- Between 30,000 and 50,000 Pacific Gas and Electric Co. customers in San Francisco and the northern Peninsula lost power for several hours this afternoon after what witnesses described as an explosion under a manhole cover on Mission Street, the utility said. Brian Swanson, a spokesman for the utility, said power failures were reported throughout wide swaths of the east side of San Francisco, including downtown and at PG&E's own office on Beale Street near the Ferry Building. The outage first occurred at about 1:50 p.m., and electricity flickered on and off at least five times before power was restored at about 4 p.m. PG&E officials said the source of the power outage was an underground failure. Standing at a manhole in a plaza at 560 Mission St. in San Francisco, where witnesses reported hearing an explosion, Swanson said it could have been the source of the outage, but officials were still investigating. The incident recalled an August 2005 explosion in an underground vault at Post and Kearny streets that critically injured a woman who was walking by. At the time, PG&E blamed high levels of moisture in the attached high-voltage chambers and said it was checking the safety of about 1,000 other high-voltage chambers. Swanson said today's incident -- in which no one was injured -- was caused by some sort of fault in the line. "It is completely unrelated to what happened two years ago," he said. Witnesses said they heard an explosion at about 1:50 p.m., then saw flames coming from the manhole. Actor Torino Von Jones, 32, said he was filming a Fruit of the Loom commercial down the block at the time. "We were standing over there waiting for the camera cue when we heard a big explosion," he said. "Flames came up taller than I am, and I'm 6-foot-2." "Naturally, when you hear an explosion, you think the worst," Von Jones said. Nevertheless, he hurried back to work. "We're Fruit of the Loom -- we've got to make this commercial." The outage briefly affected some Muni buses and trains, but all were back to normal by 3 p.m., a spokeswoman said. Workers at several downtown and South of Market offices were reportedly sent home for the day following the outage. Additionally, the datacenter 365 Main -- which hosts Web sites including Craigslist and Yelp -- lost power. ------ mail forwarded, original message follows ------ To: nanog@merit.edu From: sethm@rollernet.us <Seth Mattinen> Subject: Re: San Francisco Power Outage Date: Tue, 24 Jul 2007 15:54:08 -0700 Jonathan Lassoff wrote:
Just a heads up to anyone on list that PG&E has just sustained a large outage in San Francisco that has caused a few hiccups (both network, electrical, infrastructural, etc.) around the city.
I've confirmed that both customers in 365 Main and parts of telecom 1 have both sustained brief blackouts. No word yet form 200 Paul.
Anyone in the area that could use a hand with anything, I'll probably be wrapping up fixes for my stuff soon, and would be glad to help however I can.
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it. If you do accept this is a good reason for failure, why? ~Seth