Sean Donelan <sean@donelan.com> writes:
1) Good that they [seemed] to have maintained partial power.
It would be interesting to find out what happened to the two UPSes that apparently failed. Was it something that exceeded the design, i.e. a lightning strike greater than X joules? Or something else? Equinix tests the heck out of their systems, but there is always the potential for a problem.
Where did you hear this? If it was posted to NANOG, I missed it.
2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].
The initial spike looks normal, although a bit bigger than is comfortable. Chiller plants and compressors take several minutes to reset and restart when the backup generators come online. The storm may have had some impact on the recovery because the temperature appears to take a long time to stabilize.
If this is to be expected and normal, then a statement to that effect ("Some customers may note a transient temperature spike of as much as 10 degrees C on their equipment due to designed-in characteristics of an unplanned transfer of the chiller plant to backup power") in the customer announcement would have gone a long way towards allaying fears and creating positive spin. A statement that the "chillers are OK", when your inlet temperature has just spiked 9 degrees and is currently sitting six degrees high is simply disingenuous. Anyway, based on my information (including a couple of phone calls at the time), suggesting that everything was nominal would be an overly charitable assessment of the situation.
3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.
Yep, whatever the problem, restoration that quickly tends to indicate their team was on the ball. Stuff will always fail. The real test is how quickly is it fixed.
Absolutely. In case it was not clear in my original message, let me state for the record: 1) I don't have a problem with facilities being screwed up due to Acts of God that are outside of the design parameters of the facility. If an Airbus on short final to Runway 19R at Dulles magically fell out of the sky on top of Equinix, that would just be spectacularly bad luck, not Equinix's fault. 1a) In the words of a friend of mine who grew up in Texas, regarding tornadoes: "The odds of being in the path are actually quite low; the consequences of being in the path are extremely high". An F2 tornado, while perhaps not impressive to our friends from the Great Plains, is capable of causing substantial damage. 1b) No substitute for site diversity if your project is important enough to justify the cost. 2) Under the circumstances, I think the Equinix staff did an excellent job of bringing things under control quickly. I'm sure glad this happened during the day and not at night or on a weekend when due to cost-cutting measures they have maybe one tech, two max, on duty. 3) I believe that the statements made by Equinix to its customers so far, are outside the acceptable and expectable envelope of positive spin to which Sean alluded in a previous message. We're paying customers, and when things go south we deserve frankness and full disclosure, not a pep talk. ---Rob