FWD: Explanation for the recent major downtime
My personal website is hosted with DreamHost. They sent this out to their customers today. Of interest to NANOG is the bit about the N+1 redundant genset system having 2 generators quickly fail, and in doing so having the UPS fail and the entire data center go dark. Something to consider in your data center design... jc DreamHost Announcement Team wrote:
On Monday, September 12 the greater Los Angeles area experienced a major power outage affecting large sections of the city, including our main data center. The power outage began shortly before 1pm PST and continued until about 4:30pm PST. Our data center is equipped with a redundant backup power system with both battery UPS systems and diesel generators, but the backup failed and our entire data center was powered down.
We have previously covered much of this information on our official weblog (http:// blog.dreamhost.com/) but many of you have not seen that information so we will summarize the events here.
When the grid power to our building was cut, the UPS system kicked in and kept everything in the building up and running. The five generators also fired up and began providing power. The building needs four generators to operate at full power so the system is designed to tolerate a single failure. Unfortunately, two of the five generators failed within minutes of each other. We receive our power from the building housing our data center and they also manage the redundant power system. We do not know the exact reason for the generator failures at this time. We have received some vague explanations that we have not found to be satisfactory. Regardless, the remaining three generators were not sufficient to meet the building's power needs and that caused the emergency electrical systems to transfer into a “load shedding mode” and the building’s UPS system to turn itself off, thus preventing permanent UPS and related equipment damage. That shut everything down, including emergency lighting, and the building was evacuated.
About 15 minutes later, one of the generators was started up to power emergency lighting and a couple of our senior technicians made their way into the (still evacuated) building and down to our data center to assess the damage. Since the backup power had failed, our own data center power remained off until the main grid power came back. We then proceeded to slowly power up our equipment. Servers (and all computers) consume significantly more power when booting up than when up and running so there is some risk of overloading the power circuits if too many of them are flipped on at once. Keeping that in mind, we powered everything on as quickly as possible. At that time the majority of our services were fully back up and running but some services were still down and we began the process of systematically verifying all services and making any necessary repairs and adjustments. Whenever a large number of servers suddenly loses power a certain small percentage of them will not come back up properly and when you have several hundred servers it takes awhile to verify all of them.
Once our own access to our servers was restored our staff continued working into the night to restore as much service as possible and to respond to as many of your support cases as possible. Some of our staff continued working all the way through the night and we were able to restore almost everything that first night.
Tuesday (September 13) started off early with all of us addressing the residual issues. At around noon that day one of our core routers experienced an internal failure stemming from damage previously sustained during the power outage. Our routers handle all of the Internet traffic coming in and out of our network and they are set up in a redundant way to minimize network disruption when a failure does occur. In this case, the main cpu of the router (called the 'supervisor') died and the secondary one took over. Everything continued working almost as it should have, but there is a remaining router issue that we are still working with Cisco support on. That issue is responsible for the slower than normal performance of our network and it will be resolved absolutely as soon as possible.
During this outage, our off-network Emergency Status Page (http://status.dreamhost.com/) proved to be an invaluable resource for disseminating information among our customers. That status page remained up throughout the power outage and was updated regularly as we received new information. Unfortunately, not everyone knows about it and we will be working to improve that situation in the coming days. Those bloggers among you that did know to check the status page were extra helpful in passing along the information to other dreamhosters who were still in the dark. Thank you to everyone who helped out with that!
This announcement will be followed by another explaining what went wrong with our processes and what we plan to do to address them. That will come in the next few days.
We will be continuing to provide more detailed information on our official weblog found here: http://blog.dreamhost.com/
Also, everyone who has not bookmarked our Emergency Status Page should do so now. That page is found here: http://status.dreamhost.com/
We will be improving on the basic page we have there to provide as useful of an avenue of information as possible.
If you have any additional questions about this outage, please let us know. We will be happy to address all of your questions or concerns.
The Un-Happy DreamHost Powerless Team
On Thu, 15 Sep 2005, jc dill wrote:
My personal website is hosted with DreamHost. They sent this out to their customers today. Of interest to NANOG is the bit about the N+1 redundant genset system having 2 generators quickly fail, and in doing so having the UPS fail and the entire data center go dark. Something to consider in
For some reason this immediately made me think of " VAXen, my children, just don't belong some places." http://www.crash.com/fun/texts/vaxen-dont.html We appear to have progressed, marginally, from those days. :-) ========================================================== Chris Candreva -- chris@westnet.com -- (914) 967-7816 WestNet Internet Services of Westchester http://www.westnet.com/
participants (2)
-
Christopher X. Candreva
-
jc dill