On Mon, Jan 15, 2024 at 9:55 AM, William Herrin <bill@herrin.us> wrote:

On Mon, Jan 15, 2024 at 6:08 AM Mike Hammett <nanog@ics-il.net> wrote:

Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What should be expected in the aftermath?

Hi Mike,

A decade or so ago I maintained a computer room with a single air conditioner because the boss wouldn't go for n+1. It failed in exactly this manner several times.



And in the early 2000s I worked at a (very crappy) ISP/Colo provider which had their primary locations in a small, brick garage. It *did* have redundant AC — in the form of two large window units, stuck into a hole which had been hacked through the brick wall. They were redundant — there were two of them, and they were on separate circuits. What more could you ask for?!

At 2AM one morning I'm awakened from my slumber by a warning page from the monitoring system (Whatsup Gold. Remember Whatsup Gold?) letting me know that the temperature is out of range. This is a fairly common occurrences, so I ack it and go back to sleep. A short while later I'm awakened again, and this time it's a critical alert and the temperature is really high.

So, I grumble, get dressed, and drive over to the location. I open the door, and, yes, it really *is* hot. This is because the AC units have been vibrating over the years, and the entire row of bricks above have popped out. There is now an even larger hole in the wall, and both AC units are lying outside, still running.

'Twas not a good day….
W




After the overheat was detected by the monitoring system, it would be brought under control with a combination of spot cooler and powering down to a minimal configuration. But of course it takes time to get people there and set up the mitigations, during which the heat continues to rise.

The main thing I noticed was a modest uptick in spinning drive failures for the couple months that followed. If there was any other consequence it was at a rate where I'd have had to be carefully measuring before and after to detect it.

Regards,
Bill Herrin

--
William Herrin
bill@herrin.us
https://bill.herrin.us/