On Sat, 18 Sep 2004, Deepak Jain wrote:
3) Many new systems [say datacenters built/upgraded in the last 5 years] haven't been around long enough to really test 99.999% and above levels of availability... many new systems won't start showing problems for 5-10 years.
Past performance is not a guarantee of future results. Sometimes you get lucky. My residence with no UPS, no backup generator, no surge protection hasn't lost power in almost 5 years even during the California rolling blackouts. Nevertheless I wouldn't recommend using my residence as co-location. The 5 9s is a bit of a myth and causes some creative statistics. There are datacenters over 5 years old which have met 100% scheduled availability. They are rare and probably exceeded their design expectations. All of them I know about are private data centers, not co-location, and all the owners have backup data centers because they know one day they will have a problem. On the other hand, there are many private data centers worse than professionally operated co-location facilities.
1) Good that they [seemed] to have maintained partial power.
It would be interesting to find out what happened to the two UPSes that apparently failed. Was it something that exceeded the design, i.e. a lightning strike greater than X joules? Or something else? Equinix tests the heck out of their systems, but there is always the potential for a problem.
2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].
The initial spike looks normal, although a bit bigger than is comfortable. Chiller plants and compressors take several minutes to reset and restart when the backup generators come online. The storm may have had some impact on the recovery because the temperature appears to take a long time to stabilize.
3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.
Yep, whatever the problem, restoration that quickly tends to indicate their team was on the ball. Stuff will always fail. The real test is how quickly is it fixed.