Tornados in Ashburn
John Starta <john@starta.org> writes:
http://www.washingtonpost.com/wp-dyn/articles/A29911-2004Sep17.html
Printer-friendly version for your signin-bypassing pleasure: http://www.washingtonpost.com/ac2/wp-dyn/A29911-2004Sep17?language=printer I was a little closer to the Ashburn one than I really wanted to be - was able to see it in the distance to the north (heading south to north as would be expected given it was from the eastern edge of a northbound heading hurricane remnant) as I drove along the Greenway. Notwithstanding an incident report sent out by Equinix at 2012 stating "The Chiller plant is fully functional", our temperature graphs indicate that there was a cooling issue at Equinix Ashburn F from 1815 until 1855; the start time corresponds with the Chantilly/Dulles/Ashburn tornado being in the area of Equinix. Another tenant at Ashburn F states that there were AC power disturbances. I can not speak to that; as far as I can tell (no special instrumentation in my installations) my power was fine. The reason that I bring this up is that I believe a report which is posted two hours after the event and glosses over potentially serious operational anomalies by stating that everything is cool (in the present tense) does not serve anyone's best interests. I understand and accept the two hour delay from the start of the incident, but I expect scrupulous honesty in after-action assessments, not a marketing-driven assertion that everything is Just Fine. I encourage the powers that be at Equinix to make public (or at least send to its customers) a revised statement that truthfully reflects what happened Friday night. ---Rob (KE4DJT, spotter FXN16)
I was in the building last night when the weather went bad here :( It was definatly scramble mode. When the power went out to the welcome area obviously it got silent and I could hear the sounds of the magnetic doors releasing. (I do not like that sound) I saw a loss of HVAC but not a loss of power to the floor, (by floor I mean customer machines) I did however talk to a friend of mine last night that administers voip stuff here and he said they he lost power to a few devices but not all. To my understanding none of our devices lost power just HVAC. I expect a full report of the events will be released in a few days but not today (well not in detail anyway) TO fully understand what happened and in what order someones going to have alot of digging to do thru alot of data to get the real story. As of right now this area is still without power but EQUINIX has 40,000 gallons of fuel. A 500 gallon an hour burn rate, trucks coming with fuel, and ETA Of midnight for the transformer being fixed in the area. Dre G. On Sat, 18 Sep 2004, Robert E.Seastrom wrote:
John Starta <john@starta.org> writes:
http://www.washingtonpost.com/wp-dyn/articles/A29911-2004Sep17.html
Printer-friendly version for your signin-bypassing pleasure: http://www.washingtonpost.com/ac2/wp-dyn/A29911-2004Sep17?language=printer
I was a little closer to the Ashburn one than I really wanted to be - was able to see it in the distance to the north (heading south to north as would be expected given it was from the eastern edge of a northbound heading hurricane remnant) as I drove along the Greenway.
Notwithstanding an incident report sent out by Equinix at 2012 stating "The Chiller plant is fully functional", our temperature graphs indicate that there was a cooling issue at Equinix Ashburn F from 1815 until 1855; the start time corresponds with the Chantilly/Dulles/Ashburn tornado being in the area of Equinix.
Another tenant at Ashburn F states that there were AC power disturbances. I can not speak to that; as far as I can tell (no special instrumentation in my installations) my power was fine.
The reason that I bring this up is that I believe a report which is posted two hours after the event and glosses over potentially serious operational anomalies by stating that everything is cool (in the present tense) does not serve anyone's best interests. I understand and accept the two hour delay from the start of the incident, but I expect scrupulous honesty in after-action assessments, not a marketing-driven assertion that everything is Just Fine.
I encourage the powers that be at Equinix to make public (or at least send to its customers) a revised statement that truthfully reflects what happened Friday night.
---Rob (KE4DJT, spotter FXN16)
Robert E.Seastrom wrote:
John Starta <john@starta.org> writes:
http://www.washingtonpost.com/wp-dyn/articles/A29911-2004Sep17.html
Printer-friendly version for your signin-bypassing pleasure: http://www.washingtonpost.com/ac2/wp-dyn/A29911-2004Sep17?language=printer
I was a little closer to the Ashburn one than I really wanted to be - was able to see it in the distance to the north (heading south to north as would be expected given it was from the eastern edge of a northbound heading hurricane remnant) as I drove along the Greenway.
Notwithstanding an incident report sent out by Equinix at 2012 stating "The Chiller plant is fully functional", our temperature graphs indicate that there was a cooling issue at Equinix Ashburn F from 1815 until 1855; the start time corresponds with the Chantilly/Dulles/Ashburn tornado being in the area of Equinix.
Some additional graphs from F building for last 24 hours: http://www.deliver3.com/ash/
Another tenant at Ashburn F states that there were AC power disturbances. I can not speak to that; as far as I can tell (no special instrumentation in my installations) my power was fine.
The reason that I bring this up is that I believe a report which is posted two hours after the event and glosses over potentially serious operational anomalies by stating that everything is cool (in the present tense) does not serve anyone's best interests. I understand and accept the two hour delay from the start of the incident, but I expect scrupulous honesty in after-action assessments, not a marketing-driven assertion that everything is Just Fine.
Or even acknowledgement that the incident existed, when we called in our temperature spike, we were told "hrm, that's odd, we'll send somebody over to look".
I encourage the powers that be at Equinix to make public (or at least send to its customers) a revised statement that truthfully reflects what happened Friday night.
---Rob (KE4DJT, spotter FXN16)
From the NWS:
A tornadic thunderstorm moved into eastern Loudoun County from western Fairfax County in the vicinity of the Washington Dulles International Airport. This tornado passed within one half mile of the National Weather Service forecast office in Sterling. This prompted the weather forecast office staff on duty to seek shelter in the safe room constructed in the office. The tornado traveled north from Dulles Airport... just west of Route 28 into portions of Ashburn. The tornado produced some damage on the America online Campus off of Waxpool Rd and more extensive damage to the north in the beaumeade corporate park. Many trees were snapped and uprooted along the path of the tornado in the corporate park. Additionally... three roofs were blown off of buildings and one wall collapsed on one building. The tornado also tumbled two automobiles into the side of a building and turned over a tractor trailer. Based on the damage produced in the corporate park... the tornado reached a maximum intensity of F2 on the fujita scale.
I was at Dulles airport at the time, and the result was chaos. Everyone had to go into the basement of the terminal building, and many people experienced flight delays (mine was about 5 hours). Regards Marshall Eubanks On Sep 18, 2004, at 7:19 PM, jmalcolm@uraeus.com wrote:
From the NWS:
A tornadic thunderstorm moved into eastern Loudoun County from western Fairfax County in the vicinity of the Washington Dulles International Airport. This tornado passed within one half mile of the National Weather Service forecast office in Sterling. This prompted the weather forecast office staff on duty to seek shelter in the safe room constructed in the office. The tornado traveled north from Dulles Airport... just west of Route 28 into portions of Ashburn. The tornado produced some damage on the America online Campus off of Waxpool Rd and more extensive damage to the north in the beaumeade corporate park. Many trees were snapped and uprooted along the path of the tornado in the corporate park. Additionally... three roofs were blown off of buildings and one wall collapsed on one building. The tornado also tumbled two automobiles into the side of a building and turned over a tractor trailer. Based on the damage produced in the corporate park... the tornado reached a maximum intensity of F2 on the fujita scale.
T.M. Eubanks e-mail : marshall.eubanks@telesuite.com http://www.telesuite.com
On Sat, 18 Sep 2004, Robert E.Seastrom wrote:
The reason that I bring this up is that I believe a report which is posted two hours after the event and glosses over potentially serious operational anomalies by stating that everything is cool (in the present tense) does not serve anyone's best interests. I understand and accept the two hour delay from the start of the incident, but I expect scrupulous honesty in after-action assessments, not a marketing-driven assertion that everything is Just Fine.
I have no inside information, I haven't worked for Equinix in over three year. Regardless of the company, these things are always written by the marketing/legal departments in the end. In a sole proprietorship, one person may do it all. You have to learn how to read the reports. The fact they sent out a report is a good indication there were problems. The fact they mentioned cooling is a good indication there were cooling problems. The fact they didn't mention other things (i.e. no earthquakes, no tsunami, no volcano) is a good indication those other things weren't an issue. Its just how marketing/legal departments think. Despite marketing departments, engineers know there will be failures. A N+1 design means two faults will result in an interruption. A N+2 design means three faults wil result in an interruption. And so on. I agree its frustrating when companies won't tell their paying customers what's happening. I'm not sure its always dishonesty, a lot of the time the company doesn't know what's happening either. Most companies are honest in their reporting, as far as what they say. But there is a lot of "spin."
Despite marketing departments, engineers know there will be failures. A N+1 design means two faults will result in an interruption. A N+2 design means three faults wil result in an interruption. And so on.
Only caveat here (that I want to add) is this: 1) No matter what the company, no matter what the design, N+x doesn't necessarily mean >x failures have to occur at all, or even simultaneously. 2) Just because a design is believed to be N+x or yN doesn't mean all single points of failure are really eliminated. N+x or yN implies that the failures they planned for have to be >(y-1)N or >x. Doesn't mean that they have planned for every possibile failure mode. For example, static transfer switches can and do fail. Even when they are in pairs, the coupling mechanisms and paralleling mechanisms often don't work and aren't easy to repair/bypass in an emergency. 3) Many new systems [say datacenters built/upgraded in the last 5 years] haven't been around long enough to really test 99.999% and above levels of availability... many new systems won't start showing problems for 5-10 years. Specifically in Equinix's case: 1) Good that they [seemed] to have maintained partial power. 2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected]. 3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly. 4) SLA credits. Depending on your contract, even possible breach unless they can prove >x or >(y-1)N failures had occurred in their physical plant. The latter is only useful if you want to get out of Equinix/Ash or reduce your commits to it. Deepak Jain AiNET
<quote who="Deepak Jain">
Specifically in Equinix's case:
1) Good that they [seemed] to have maintained partial power.
2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].
3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.
I would have to agree. We have a setup in this facility and even with the quick temperature spike, we didn't skip a beat. Can't ask for much more than that. It seems to me like things worked nearly as they should have, and if they didn't, the contingency plans were effective. -david ---------------------------------------------------- David A. Ulevitch - Founder, EveryDNS.Net http://david.ulevitch.com -- http://everydns.net ----------------------------------------------------
On Sat, 18 Sep 2004, Deepak Jain wrote:
3) Many new systems [say datacenters built/upgraded in the last 5 years] haven't been around long enough to really test 99.999% and above levels of availability... many new systems won't start showing problems for 5-10 years.
Past performance is not a guarantee of future results. Sometimes you get lucky. My residence with no UPS, no backup generator, no surge protection hasn't lost power in almost 5 years even during the California rolling blackouts. Nevertheless I wouldn't recommend using my residence as co-location. The 5 9s is a bit of a myth and causes some creative statistics. There are datacenters over 5 years old which have met 100% scheduled availability. They are rare and probably exceeded their design expectations. All of them I know about are private data centers, not co-location, and all the owners have backup data centers because they know one day they will have a problem. On the other hand, there are many private data centers worse than professionally operated co-location facilities.
1) Good that they [seemed] to have maintained partial power.
It would be interesting to find out what happened to the two UPSes that apparently failed. Was it something that exceeded the design, i.e. a lightning strike greater than X joules? Or something else? Equinix tests the heck out of their systems, but there is always the potential for a problem.
2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].
The initial spike looks normal, although a bit bigger than is comfortable. Chiller plants and compressors take several minutes to reset and restart when the backup generators come online. The storm may have had some impact on the recovery because the temperature appears to take a long time to stabilize.
3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.
Yep, whatever the problem, restoration that quickly tends to indicate their team was on the ball. Stuff will always fail. The real test is how quickly is it fixed.
Sean Donelan <sean@donelan.com> writes:
1) Good that they [seemed] to have maintained partial power.
It would be interesting to find out what happened to the two UPSes that apparently failed. Was it something that exceeded the design, i.e. a lightning strike greater than X joules? Or something else? Equinix tests the heck out of their systems, but there is always the potential for a problem.
Where did you hear this? If it was posted to NANOG, I missed it.
2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].
The initial spike looks normal, although a bit bigger than is comfortable. Chiller plants and compressors take several minutes to reset and restart when the backup generators come online. The storm may have had some impact on the recovery because the temperature appears to take a long time to stabilize.
If this is to be expected and normal, then a statement to that effect ("Some customers may note a transient temperature spike of as much as 10 degrees C on their equipment due to designed-in characteristics of an unplanned transfer of the chiller plant to backup power") in the customer announcement would have gone a long way towards allaying fears and creating positive spin. A statement that the "chillers are OK", when your inlet temperature has just spiked 9 degrees and is currently sitting six degrees high is simply disingenuous. Anyway, based on my information (including a couple of phone calls at the time), suggesting that everything was nominal would be an overly charitable assessment of the situation.
3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.
Yep, whatever the problem, restoration that quickly tends to indicate their team was on the ball. Stuff will always fail. The real test is how quickly is it fixed.
Absolutely. In case it was not clear in my original message, let me state for the record: 1) I don't have a problem with facilities being screwed up due to Acts of God that are outside of the design parameters of the facility. If an Airbus on short final to Runway 19R at Dulles magically fell out of the sky on top of Equinix, that would just be spectacularly bad luck, not Equinix's fault. 1a) In the words of a friend of mine who grew up in Texas, regarding tornadoes: "The odds of being in the path are actually quite low; the consequences of being in the path are extremely high". An F2 tornado, while perhaps not impressive to our friends from the Great Plains, is capable of causing substantial damage. 1b) No substitute for site diversity if your project is important enough to justify the cost. 2) Under the circumstances, I think the Equinix staff did an excellent job of bringing things under control quickly. I'm sure glad this happened during the day and not at night or on a weekend when due to cost-cutting measures they have maybe one tech, two max, on duty. 3) I believe that the statements made by Equinix to its customers so far, are outside the acceptable and expectable envelope of positive spin to which Sean alluded in a previous message. We're paying customers, and when things go south we deserve frankness and full disclosure, not a pep talk. ---Rob
On Sun, 19 Sep 2004, Robert E. Seastrom wrote:
1b) No substitute for site diversity if your project is important enough to justify the cost.
And even when you have site diversity, Murphy and Mother Nature can still get you. The federal National Finance Center in New Orleans, LA shutdown due to Hurricane Ivan, their backup call center is in Cumberland, MD. Ivan swept through both of them. http://www.federalnewsradio.com/index.php?nid=22&sid=136393
participants (10)
-
David A. Ulevitch
-
Deepak Jain
-
Dre G.
-
jmalcolmļ¼ uraeus.com
-
John Starta
-
Marshall Eubanks
-
Matt Levine
-
Robert E. Seastrom
-
Robert E.Seastrom
-
Sean Donelan