FWD: Explanation for the recent major downtime

15 Sep 2005

      My personal website is hosted with DreamHost.  They sent this out to 
their customers today.  Of interest to NANOG is the bit about the N+1 
redundant genset system having 2 generators quickly fail, and in doing 
so having the UPS fail and the entire data center go dark.  Something to 
consider in your data center design...

jc

DreamHost Announcement Team wrote:
...
On Monday, September 12 the greater Los Angeles area experienced a major power outage 
affecting large sections of the city, including our main data center.  The power outage began 
shortly before 1pm PST and continued until about 4:30pm PST.  Our data center is equipped 
with a redundant backup power system with both battery UPS systems and diesel generators, 
but the backup failed and our entire data center was powered down.
We have previously covered much of this information on our official weblog (http://
blog.dreamhost.com/) but many of you have not seen that information so we will summarize 
the events here.
When the grid power to our building was cut, the UPS system kicked in and kept everything in 
the building up and running.  The five generators also fired up and began providing power.  
The building needs four generators to operate at full power so the system is designed to 
tolerate a single failure.  Unfortunately, two of the five generators failed within minutes of each 
other.  We receive our power from the building housing our data center and they also manage 
the redundant power system.  We do not know the exact reason for the generator failures at 
this time.  We have received some vague explanations that we have not found to be satisfactory.  
Regardless, the remaining three generators were not sufficient to meet the building's power 
needs and that caused the emergency electrical systems to transfer into a “load shedding 
mode” and the building’s UPS system to turn itself off, thus preventing permanent UPS and 
related equipment damage.  That shut everything down, including emergency lighting, and the 
building was evacuated.
About 15 minutes later, one of the generators was started up to power emergency lighting and 
a couple of our senior technicians made their way into the (still evacuated) building and down 
to our data center to assess the damage.  Since the backup power had failed, our own data 
center power remained off until the main grid power came back.  We then proceeded to slowly 
power up our equipment.  Servers (and all computers) consume significantly more power when 
booting up than when up and running so there is some risk of overloading the power circuits if 
too many of them are flipped on at once.  Keeping that in mind, we powered everything on as 
quickly as possible.  At that time the majority of our services were fully back up and running 
but some services were still down and we began the process of systematically verifying all 
services and making any necessary repairs and adjustments.  Whenever a large number of 
servers suddenly loses power a certain small percentage of them will not come back up properly 
and when you have several hundred servers it takes awhile to verify all of them.
Once our own access to our servers was restored our staff continued working into the night to 
restore as much service as possible and to respond to as many of your support cases as 
possible.  Some of our staff continued working all the way through the night and we were able 
to restore almost everything that first night.
Tuesday (September 13) started off early with all of us addressing the residual issues.  At 
around noon that day one of our core routers experienced an internal failure stemming from 
damage previously sustained during the power outage.  Our routers handle all of the Internet 
traffic coming in and out of our network and they are set up in a redundant way to minimize 
network disruption when a failure does occur.  In this case, the main cpu of the router (called 
the 'supervisor') died and the secondary one took over.  Everything continued working almost as 
it should have, but there is a remaining router issue that we are still working with Cisco support 
on.  That issue is responsible for the slower than normal performance of our network and it will 
be resolved absolutely as soon as possible.
During this outage, our off-network Emergency Status Page (http://status.dreamhost.com/) 
proved to be an invaluable resource for disseminating information among our customers.  That 
status page remained up throughout the power outage and was updated regularly as we 
received new information.  Unfortunately, not everyone knows about it and we will be working 
to improve that situation in the coming days.  Those bloggers among you that did know to 
check the status page were extra helpful in passing along the information to other 
dreamhosters who were still in the dark.  Thank you to everyone who helped out with that!
This announcement will be followed by another explaining what went wrong with our processes 
and what we plan to do to address them.  That will come in the next few days.
We will be continuing to provide more detailed information on our official weblog found here:  
http://blog.dreamhost.com/
Also, everyone who has not bookmarked our Emergency Status Page should do so now.  That 
page is found here:
http://status.dreamhost.com/
We will be improving on the basic page we have there to provide as useful of an avenue of 
information as possible.
If you have any additional questions about this outage, please let us know.  We will be happy to 
address all of your questions or concerns.
The Un-Happy DreamHost Powerless Team

jc dill

Christopher X. Candreva

tags

participants (2)