"Hypothetical" Datacenter Overheating
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath? ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP
On Mon, Jan 15, 2024 at 6:08 AM Mike Hammett <nanog@ics-il.net> wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What should be expected in the aftermath?
Hi Mike, A decade or so ago I maintained a computer room with a single air conditioner because the boss wouldn't go for n+1. It failed in exactly this manner several times. After the overheat was detected by the monitoring system, it would be brought under control with a combination of spot cooler and powering down to a minimal configuration. But of course it takes time to get people there and set up the mitigations, during which the heat continues to rise. The main thing I noticed was a modest uptick in spinning drive failures for the couple months that followed. If there was any other consequence it was at a rate where I'd have had to be carefully measuring before and after to detect it. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Mon, Jan 15, 2024 at 9:55 AM, William Herrin <bill@herrin.us> wrote:
On Mon, Jan 15, 2024 at 6:08 AM Mike Hammett <nanog@ics-il.net> wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What should be expected in the aftermath?
Hi Mike,
A decade or so ago I maintained a computer room with a single air conditioner because the boss wouldn't go for n+1. It failed in exactly this manner several times.
And in the early 2000s I worked at a (very crappy) ISP/Colo provider which had their primary locations in a small, brick garage. It *did* have redundant AC — in the form of two large window units, stuck into a hole which had been hacked through the brick wall. They were redundant — there were two of them, and they were on separate circuits. What more could you ask for?! At 2AM one morning I'm awakened from my slumber by a warning page from the monitoring system (Whatsup Gold. Remember Whatsup Gold?) letting me know that the temperature is out of range. This is a fairly common occurrences, so I ack it and go back to sleep. A short while later I'm awakened again, and this time it's a critical alert and the temperature is really high. So, I grumble, get dressed, and drive over to the location. I open the door, and, yes, it really *is* hot. This is because the AC units have been vibrating over the years, and the entire row of bricks above have popped out. There is now an even larger hole in the wall, and both AC units are lying outside, still running. 'Twas not a good day…. W After the overheat was detected by the monitoring system, it would be
brought under control with a combination of spot cooler and powering down to a minimal configuration. But of course it takes time to get people there and set up the mitigations, during which the heat continues to rise.
The main thing I noticed was a modest uptick in spinning drive failures for the couple months that followed. If there was any other consequence it was at a rate where I'd have had to be carefully measuring before and after to detect it.
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/
Once upon a time, Izaac <izaac@setec.org> said:
On Tue, Jan 16, 2024 at 08:37:09AM -0800, Warren Kumari wrote:
ISP/Colo provider
The good ole days. When one stacked modems with two pencils in between them and box fans blew through the gaps.
The back-room ISP I started at was at least owned by a company with their own small machine shop, so we had them make plates we could mount two Sportsters (sans top) to and slide them into card cages. 20 modems in a 5U cage! -- Chris Adams <cma@cmadams.net>
I remember those days --- I think we bought cages from someone and pulled the boards out of the modems to mount them. -----Original Message----- From: "Chris Adams" <cma@cmadams.net> Sent: Tuesday, January 16, 2024 1:30pm To: nanog@nanog.org Subject: Re: "Hypothetical" Datacenter Overheating Once upon a time, Izaac <izaac@setec.org> said:
On Tue, Jan 16, 2024 at 08:37:09AM -0800, Warren Kumari wrote:
ISP/Colo provider
The good ole days. When one stacked modems with two pencils in between them and box fans blew through the gaps.
The back-room ISP I started at was at least owned by a company with their own small machine shop, so we had them make plates we could mount two Sportsters (sans top) to and slide them into card cages. 20 modems in a 5U cage! -- Chris Adams <cma@cmadams.net>
On 1/16/24 10:33, Shawn L via NANOG wrote:
I remember those days --- I think we bought cages from someone and pulled the boards out of the modems to mount them.
We made our own. And then we had to deal with all the wall warts. We rigged up a power supply with a big snake of barrel jacks. Livingston Portmaster 2s, of course. -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
On Tue, 2024-01-16 at 10:44 -0800, Jay Hennigan wrote:
We made our own. And then we had to deal with all the wall warts. We rigged up a power supply with a big snake of barrel jacks.
Luxury. We had a hamster in a hamster wheel for each modem. Ah, the old days. Regards, K. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer
For completeness' sake at the first commercial ISP to sell individual dial-up to the public, The World, we had six of those typical desktop 2400bps modems (I forget the brand tho I still have them, a photo also I think) sitting on a file cabinet in an office space in Brookline, MA plugged into a Sun 4/280. I bought those modems from a local computer retail store on my own personal credit card thinking maybe others would like to try this internet thing, and we'd just gotten access to a T1. On January 17, 2024 at 09:28 kauer@biplane.com.au (Karl Auer) wrote:
On Tue, 2024-01-16 at 10:44 -0800, Jay Hennigan wrote:
We made our own. And then we had to deal with all the wall warts. We rigged up a power supply with a big snake of barrel jacks.
Luxury. We had a hamster in a hamster wheel for each modem.
Ah, the old days.
Regards, K.
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer
-- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
On Tue, Jan 16, 2024 at 10:30 AM Chris Adams <cma@cmadams.net> wrote:
The back-room ISP I started at was at least owned by a company with their own small machine shop, so we had them make plates we could mount two Sportsters (sans top) to and slide them into card cages. 20 modems in a 5U cage!
The ISP where I worked bought wall-mounted wire shelves and routed the serial cable through the shelf to stably hold the modems vertically in place and apart from each other. Worked pretty well. The open shelving let air move up past them keeping them cool and offered plenty of tie points for cable management. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
and none in the other two facilities you operate in that same building had any failures. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: sronan@ronan-online.com To: "Mike Hammett" <nanog@ics-il.net> Cc: "NANOG" <nanog@nanog.org> Sent: Monday, January 15, 2024 9:14:49 AM Subject: Re: "Hypothetical" Datacenter Overheating I’m more interested in how you lose six chillers all at once. Shane On Jan 15, 2024, at 9:11 AM, Mike Hammett <nanog@ics-il.net> wrote: <blockquote> Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath? ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP </blockquote>
and none in the other two facilities you operate in that same building had any failures.
Quoting directly from their outage ticket updates : CH2 does not have chillers, cooling arrangement is DX CRACs manufactured by
another company. CH3 has Smart chillers but are water cooled not air cooled so not susceptible to cold ambient air temps as they are indoor chillers.
On Mon, Jan 15, 2024 at 10:19 AM Mike Hammett <nanog@ics-il.net> wrote:
and none in the other two facilities you operate in that same building had any failures.
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg> ------------------------------ *From: *sronan@ronan-online.com *To: *"Mike Hammett" <nanog@ics-il.net> *Cc: *"NANOG" <nanog@nanog.org> *Sent: *Monday, January 15, 2024 9:14:49 AM *Subject: *Re: "Hypothetical" Datacenter Overheating
I’m more interested in how you lose six chillers all at once.
Shane
On Jan 15, 2024, at 9:11 AM, Mike Hammett <nanog@ics-il.net> wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg>
Well right, which came well after the question was posited here. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Tom Beecher" <beecher@beecher.cc> To: "Mike Hammett" <nanog@ics-il.net> Cc: sronan@ronan-online.com, "NANOG" <nanog@nanog.org> Sent: Thursday, January 18, 2024 9:00:34 AM Subject: Re: "Hypothetical" Datacenter Overheating and none in the other two facilities you operate in that same building had any failures. Quoting directly from their outage ticket updates : <blockquote> CH2 does not have chillers, cooling arrangement is DX CRACs manufactured by another company. CH3 has Smart chillers but are water cooled not air cooled so not susceptible to cold ambient air temps as they are indoor chillers. </blockquote> On Mon, Jan 15, 2024 at 10:19 AM Mike Hammett < nanog@ics-il.net > wrote: <blockquote> and none in the other two facilities you operate in that same building had any failures. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP From: sronan@ronan-online.com To: "Mike Hammett" < nanog@ics-il.net > Cc: "NANOG" < nanog@nanog.org > Sent: Monday, January 15, 2024 9:14:49 AM Subject: Re: "Hypothetical" Datacenter Overheating I’m more interested in how you lose six chillers all at once. Shane <blockquote> On Jan 15, 2024, at 9:11 AM, Mike Hammett < nanog@ics-il.net > wrote: </blockquote> <blockquote> Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath? ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP </blockquote> </blockquote>
Well right, which came well after the question was posited here.
Wasn't poo pooing the question, just sharing the information as I didn't see that cited otherwise in this thread. On Thu, Jan 18, 2024 at 10:15 AM Mike Hammett <nanog@ics-il.net> wrote:
Well right, which came well after the question was posited here.
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg> ------------------------------ *From: *"Tom Beecher" <beecher@beecher.cc> *To: *"Mike Hammett" <nanog@ics-il.net> *Cc: *sronan@ronan-online.com, "NANOG" <nanog@nanog.org> *Sent: *Thursday, January 18, 2024 9:00:34 AM *Subject: *Re: "Hypothetical" Datacenter Overheating
and none in the other two facilities you operate in that same building had
any failures.
Quoting directly from their outage ticket updates :
CH2 does not have chillers, cooling arrangement is DX CRACs manufactured
by another company. CH3 has Smart chillers but are water cooled not air cooled so not susceptible to cold ambient air temps as they are indoor chillers.
On Mon, Jan 15, 2024 at 10:19 AM Mike Hammett <nanog@ics-il.net> wrote:
and none in the other two facilities you operate in that same building had any failures.
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg> ------------------------------ *From: *sronan@ronan-online.com *To: *"Mike Hammett" <nanog@ics-il.net> *Cc: *"NANOG" <nanog@nanog.org> *Sent: *Monday, January 15, 2024 9:14:49 AM *Subject: *Re: "Hypothetical" Datacenter Overheating
I’m more interested in how you lose six chillers all at once.
Shane
On Jan 15, 2024, at 9:11 AM, Mike Hammett <nanog@ics-il.net> wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg>
----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Tom Beecher" <beecher@beecher.cc> To: "Mike Hammett" <nanog@ics-il.net> Cc: sronan@ronan-online.com, "NANOG" <nanog@nanog.org> Sent: Thursday, January 18, 2024 9:19:09 AM Subject: Re: "Hypothetical" Datacenter Overheating Well right, which came well after the question was posited here. Wasn't poo pooing the question, just sharing the information as I didn't see that cited otherwise in this thread. On Thu, Jan 18, 2024 at 10:15 AM Mike Hammett < nanog@ics-il.net > wrote: <blockquote> Well right, which came well after the question was posited here. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP From: "Tom Beecher" < beecher@beecher.cc > To: "Mike Hammett" < nanog@ics-il.net > Cc: sronan@ronan-online.com , "NANOG" < nanog@nanog.org > Sent: Thursday, January 18, 2024 9:00:34 AM Subject: Re: "Hypothetical" Datacenter Overheating <blockquote> and none in the other two facilities you operate in that same building had any failures. </blockquote> Quoting directly from their outage ticket updates : <blockquote> CH2 does not have chillers, cooling arrangement is DX CRACs manufactured by another company. CH3 has Smart chillers but are water cooled not air cooled so not susceptible to cold ambient air temps as they are indoor chillers. </blockquote> On Mon, Jan 15, 2024 at 10:19 AM Mike Hammett < nanog@ics-il.net > wrote: <blockquote> and none in the other two facilities you operate in that same building had any failures. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP From: sronan@ronan-online.com To: "Mike Hammett" < nanog@ics-il.net > Cc: "NANOG" < nanog@nanog.org > Sent: Monday, January 15, 2024 9:14:49 AM Subject: Re: "Hypothetical" Datacenter Overheating I’m more interested in how you lose six chillers all at once. Shane <blockquote> On Jan 15, 2024, at 9:11 AM, Mike Hammett < nanog@ics-il.net > wrote: </blockquote> <blockquote> Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath? ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP </blockquote> </blockquote> </blockquote>
Easy. Climate change. Lol! -mel On Jan 15, 2024, at 7:17 AM, sronan@ronan-online.com wrote: I’m more interested in how you lose six chillers all at once. Shane On Jan 15, 2024, at 9:11 AM, Mike Hammett <nanog@ics-il.net> wrote: Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath? ----- Mike Hammett Intelligent Computing Solutions<http://www.ics-il.com/> [http://www.ics-il.com/images/fbicon.png]<https://www.facebook.com/ICSIL>[http://www.ics-il.com/images/googleicon.png]<https://plus.google.com/+IntelligentComputingSolutionsDeKalb>[http://www.ics-il.com/images/linkedinicon.png]<https://www.linkedin.com/company/intelligent-computing-solutions>[http://www.ics-il.com/images/twittericon.png]<https://twitter.com/ICSIL> Midwest Internet Exchange<http://www.midwest-ix.com/> [http://www.ics-il.com/images/fbicon.png]<https://www.facebook.com/mdwestix>[http://www.ics-il.com/images/linkedinicon.png]<https://www.linkedin.com/company/midwest-internet-exchange>[http://www.ics-il.com/images/twittericon.png]<https://twitter.com/mdwestix> The Brothers WISP<http://www.thebrotherswisp.com/> [http://www.ics-il.com/images/fbicon.png]<https://www.facebook.com/thebrotherswisp>[http://www.ics-il.com/images/youtubeicon.png]<https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg>
On 1/15/24 07:21, Mel Beckman wrote:
Easy. Climate change. Lol!
It was -8°F in Chicago yesterday.
On Jan 15, 2024, at 7:17 AM, sronan@ronan-online.com wrote:
I’m more interested in how you lose six chillers all at once.
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
My sarcasm generator is clearly set incorrectly :) -mel
On Jan 15, 2024, at 10:33 AM, Jay Hennigan <jay@west.net> wrote:
On 1/15/24 07:21, Mel Beckman wrote:
Easy. Climate change. Lol!
It was -8°F in Chicago yesterday.
On Jan 15, 2024, at 7:17 AM, sronan@ronan-online.com wrote:
I’m more interested in how you lose six chillers all at once.
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
On Mon, Jan 15, 2024 at 7:14 AM <sronan@ronan-online.com> wrote:
I’m more interested in how you lose six chillers all at once.
Extreme cold. If the transfer temperature is too low, they can reach a state where the refrigerant liquifies too soon, damaging the compressor. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Mon, Jan 15, 2024 at 7:14 AM <sronan@ronan-online.com> wrote:
I’m more interested in how you lose six chillers all at once. Extreme cold. If the transfer temperature is too low, they can reach a state where the refrigerant liquifies too soon, damaging the compressor. Regards, Bill Herrin
Our 70-ton Tranes here have kicked out on 'freeze warning' before; there's a strainer in the water loop at the evaporator that can clog, restricting flow enough to allow freezing to occur if the chiller is actively cooling. It's so strange to have an overheating data center in subzero (F) temps. The flow sensor in the water loop can sometimes get too cold and not register the flow as well. ________________________________
On Mon, Jan 15, 2024 at 10:14:49AM -0500, sronan@ronan-online.com wrote:
I’m more interested in how you lose six chillers all at once.
Because you're probably mistaking an air handling unit for a chiller. I usually point people at this to get us on the same page: https://www.youtube.com/watch?v=1cvFlBLo4u0 If you are not so mistaken, it is important to realize that neither roofs nor utility risers are infinitely wide to accommodate the six independent cooling towers and refrigerant lines that would thus be required for a single floor. They're sharing something somewhere. -- . ___ ___ . . ___ . \ / |\ |\ \ . _\_ /__ |-\ |-\ \__
In this case, they actually have EXTREMELY LARGE utility risers and yes they have SIX Independent Chillers, which support just this single floor of a multi-floor building. Shane On Tue, Jan 16, 2024 at 10:25 AM Izaac <izaac@setec.org> wrote:
On Mon, Jan 15, 2024 at 10:14:49AM -0500, sronan@ronan-online.com wrote:
I’m more interested in how you lose six chillers all at once.
Because you're probably mistaking an air handling unit for a chiller.
I usually point people at this to get us on the same page: https://www.youtube.com/watch?v=1cvFlBLo4u0
If you are not so mistaken, it is important to realize that neither roofs nor utility risers are infinitely wide to accommodate the six independent cooling towers and refrigerant lines that would thus be required for a single floor. They're sharing something somewhere.
-- . ___ ___ . . ___ . \ / |\ |\ \ . _\_ /__ |-\ |-\ \__
On 1/15/24 10:14, sronan@ronan-online.com wrote:
I’m more interested in how you lose six chillers all at once.
According to a post on a support forum for one of the clients in that space: "We understand the issue is due to snow on the roof affecting the cooling equipment." Never overlook the simplest single points of failure. Snow on cooling tower fan blades....failed fan motors are possible or even likely at that point. Assuming the airflow won't be clogged; conceptually much like the issue in having multiple providers for redundancy but they're all in the same cable or conduit.
On Mon, 2024-01-15 at 08:08 -0600, Mike Hammett wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees
Major double-take there for this non-US reader, until I realised you just had to mean Fahrenheit. Regards, K. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer
Someone I talked to while on scene today said their area got to 130 and cooked two core routers. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Mike Hammett" <nanog@ics-il.net> To: "NANOG" <nanog@nanog.org> Sent: Monday, January 15, 2024 8:08:25 AM Subject: "Hypothetical" Datacenter Overheating Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath? ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP
On 16/01/2024 01:32, Mike Hammett wrote:
Someone I talked to while on scene today said their area got to 130 and cooked two core routers.
We've lost one low-end switch. I'm very glad it wasn't two core routers! We're still looking into what recourse we have against the datacenter operator. Ray
350 Cermak Chicago is a "historic" building which means you can't change the visible outside. Someone had long discussions about the benefits of outside air economizers, but can't change the windows. Need to hide HVAC plant (as much as possible). I would design all colos to look like 375 Pearl St (formerly Verizon, formerly AT&T) New York. Vents and concrete. Almost all the windows visible on the outside of 350 Cermak Chicago are "fake." They are enclosed on the inside (with fake indoor decor) because of 1912 glass panes aren't very weatherproof. But they preserve the look and feel of the neighborhood :-) 350 Cermak rebuilt as a colo is over 20-years old. It will be interesting to read the final root cause analysis. Of course, as always, networks and data centers should not depend on a single POP. Diversify the redudancy, because something will always fail. There are multiple POP/IXP in major cities. And multiple cities with POPs and IXPs.
Free air cooling loops maybe? (Not direct free air cooling with air exchange, the version with something much like an air handler outside with a coil and an fan running cold outside air over the coil with the water/glycol that would normally be the loop off of the chiller) the primary use of them is cost savings by using less energy to cool when it's fairly cold out, but it can also prevent low temperature issues on compressors by not running them when it's cold. I'd expect it would not require the same sort of facade changes as it could be on the roof and depending only need water/glycol lines into the space, depending on cooling tower vs air cooled and chiller location it could also potentially use the same piping (which I think is the traditional use). I'm also fairly curious to see the root cause analysis, also hoping someone is at least looking at some mechanism to allow transferring chiller capacity between floors if they had multiple floors and only had the failure on one floor. This sort of mass failure seems to point towards either design issues (like equipment selection/configuration vs temperature range for the location), systemic maintenance issues, or some sort of single failure point that could take all the chillers out, none of which I'd be happy to see in a data center. Anyone have any idea what the total cost of this incident is likely to reach (dead equipment, etc.) On 1/16/2024 4:08 PM, Sean Donelan wrote:
350 Cermak Chicago is a "historic" building which means you can't change the visible outside. Someone had long discussions about the benefits of outside air economizers, but can't change the windows. Need to hide HVAC plant (as much as possible).
I would design all colos to look like 375 Pearl St (formerly Verizon, formerly AT&T) New York. Vents and concrete.
Almost all the windows visible on the outside of 350 Cermak Chicago are "fake." They are enclosed on the inside (with fake indoor decor) because of 1912 glass panes aren't very weatherproof. But they preserve the look and feel of the neighborhood :-)
350 Cermak rebuilt as a colo is over 20-years old. It will be interesting to read the final root cause analysis.
Of course, as always, networks and data centers should not depend on a single POP. Diversify the redudancy, because something will always fail.
There are multiple POP/IXP in major cities. And multiple cities with POPs and IXPs.
This sort of mass failure seems to point towards either design issues (like equipment >selection/configuration vs temperature range for the location), systemic maintenance issues, or some sort of single failure point that could take all the chillers out, none of which I'd be happy to see in a data center.
If these chillers are connected to BACnet or similar network, then I wouldn't rule out the possibility of an attack.
If these chillers are connected to BACnet or similar network, then I wouldn't rule out the possibility of an attack.
Don't insinuate something like this without evidence. Completely unreasonable and inappropriate. On Wed, Jan 17, 2024 at 8:31 AM Lamar Owen <lowen@pari.edu> wrote:
This sort of mass failure seems to point towards either design issues (like equipment >selection/configuration vs temperature range for the location), systemic maintenance issues, or some sort of single failure point that could take all the chillers out, none of which I'd be happy to see in a data center.
If these chillers are connected to BACnet or similar network, then I wouldn't rule out the possibility of an attack.
On 1/17/24 20:06, Tom Beecher wrote:
If these chillers are connected to BACnet or similar network, then I wouldn't rule out the possibility of an attack.
Don't insinuate something like this without evidence. Completely unreasonable and inappropriate.
I wasn't meaning to insinuate anything; it's as much of a reasonable possibility as any other these days. Perhaps I should have worded it differently: "if my small data centers' chillers were connected to some building management network such as BACnet and all of them went down concurrently I would be investigating my building management network for signs of intrusion in addition to checking other items, such as shared points of failure in things like chilled water pumps, electrical supply, emergency shut-off circuits, chiller/closed-loop configurations for various temperature, pressure, and flow set points, etc." Bit more wordy, but doesn't have the same implication. But I would think it unreasonable, if I were to find myself in this situation in my own operations, to rule any possibility out that can explain simultaneous shutdowns. And this week we did have a chiller go out on freeze warning, but the DC temp never made it quite up to 80F before the temperature raised back into double digits and the chiller restarted.
----- Original Message -----
From: "Tom Beecher" <beecher@beecher.cc> To: "Lamar Owen" <lowen@pari.edu> Cc: nanog@nanog.org Sent: Wednesday, January 17, 2024 8:06:07 PM Subject: Re: "Hypothetical" Datacenter Overheating
If these chillers are connected to BACnet or similar network, then I wouldn't rule out the possibility of an attack.
Don't insinuate something like this without evidence. Completely unreasonable and inappropriate.
WADR, horsecrap. It's certainly one of many possible root causes which someone doing an AAR on an event like this should be thinking about, and looking for in their evaluation of the data they see. He didn't *accuse* anyone, which would be out of bounds. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
t's certainly one of many possible root causes which someone doing an AAR on an event like this should be thinking about, and looking for in their evaluation of the data they see.
And I'm sure they are and will. By the time that post was made, the vendor had shared multiple updates about what the actual cause seemed to be, which were very plausible. An unaffiliated 3rd party stating 'maybe an attack!' when there has been no observation or information shared that even remotely points to that simply spreads FUD for no reason. I respectfully disagree. On Sun, Jan 21, 2024 at 1:22 AM Jay R. Ashworth <jra@baylink.com> wrote:
----- Original Message -----
From: "Tom Beecher" <beecher@beecher.cc> To: "Lamar Owen" <lowen@pari.edu> Cc: nanog@nanog.org Sent: Wednesday, January 17, 2024 8:06:07 PM Subject: Re: "Hypothetical" Datacenter Overheating
If these chillers are connected to BACnet or similar network, then I wouldn't rule out the possibility of an attack.
Don't insinuate something like this without evidence. Completely unreasonable and inappropriate.
WADR, horsecrap.
It's certainly one of many possible root causes which someone doing an AAR on an event like this should be thinking about, and looking for in their evaluation of the data they see.
He didn't *accuse* anyone, which would be out of bounds.
Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
----- Original Message -----
From: "Tom Beecher" <beecher@beecher.cc>
It's certainly one of many possible root causes which someone doing an AAR on an event like this should be thinking about, and looking for in their evaluation of the data they see.
And I'm sure they are and will.
By the time that post was made, the vendor had shared multiple updates about what the actual cause seemed to be, which were very plausible. An unaffiliated 3rd party stating 'maybe an attack!' when there has been no observation or information shared that even remotely points to that simply spreads FUD for no reason.
I didn't see any of them in the thread, which was the only thing I was paying attention to, so those are fact not in evidence to *me*. I didn't see an exclamation point in his comment, which seemed relatively measured to me. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://www.bcp38.info 2000 Land Rover DII St Petersburg FL USA BCP38: Ask For It By Name! +1 727 647 1274
On Wed, Jan 17, 2024 at 12:07:42AM -0500, Glenn McGurrin via NANOG wrote:
Free air cooling loops maybe? (Not direct free air cooling with air exchange, the version with something much like an air handler outside with a coil and an fan running cold outside air over the coil with the water/glycol that would normally be the loop off of the chiller) the primary use of them is cost savings by using less energy to cool when it's fairly cold out, but it can also prevent low temperature issues on compressors by not running them when it's cold. I'd expect it would not require the same sort of facade changes as it could be on the roof and depending only need water/glycol lines into the space, depending on cooling tower vs air cooled and chiller location it could also potentially use the same piping (which I think is the traditional use).
You're looking for these: https://en.wikipedia.org/wiki/Thermal_wheel Basically, an aluminum honeycomb wheel. One half of its housing is an air duct "outside" while the other half is an air duct that's "inside." Cold outside air blows through the straws and cools the metal. Wheel rotates slowly. That straw is now "inside." Inside air blows through it and deposits heat onto the metal. Turn turn turn. A surprisingly effective way to lower heating/cooling costs. Basically "free," as you just need to turn it on the bearing. Do you get deposits in the comb? Yes, if you don't filter properly. Do you get condensation in the comb? Yeah. Treat it with desiccants. -- . ___ ___ . . ___ . \ / |\ |\ \ . _\_ /__ |-\ |-\ \__
I'm actually referring to something like, I've not yet had a system where they have made sense, I mostly deal with either places where I have no say in the hvac or very small server rooms, but I've thought these were an interesting concept since I first saw them years ago. https://www.chiltrix.com/server-room-chiller.html quoting from the page: The Chiltrix SE Server Room Edition adds a "free cooling" option to CX34. Server rooms need cooling all year, even when it is cold outside. If you operate in a northern area with cold winters, this option is for you. When outdoor temperatures drop below 38F, the CX34 glycol-water loop is automatically extended through a special water-to-air heat exchanger to harvest outdoor cold ambient conditions to pre-cool the glycol-water loop so that the CX34 variable speed compressor can drop to a very slow speed and consume less power. This can save about 50% off of it's already low power consumption without lowering capacity. At and below 28F, the CX34 chiller with Free Cooling SE add-on will turn off the compressor entirely and still be able to maintain its rated cooling capacity using only the variable speed pump and fan motors. At this point, the CX34 achieves a COP of of >41 and EER of >141. Enjoy the savings of 2 tons of cooling for less than 75 watts. The colder it gets, the less water flow rate is needed, allowing the VSD pump power draw to drop under 20 watts. Depending on location, for some customers free cooling mode can be active up to 3 months per year during the daytime and up to 5 months per year at night. On 1/17/2024 3:10 PM, Izaac wrote:
On Wed, Jan 17, 2024 at 12:07:42AM -0500, Glenn McGurrin via NANOG wrote:
Free air cooling loops maybe? (Not direct free air cooling with air exchange, the version with something much like an air handler outside with a coil and an fan running cold outside air over the coil with the water/glycol that would normally be the loop off of the chiller) the primary use of them is cost savings by using less energy to cool when it's fairly cold out, but it can also prevent low temperature issues on compressors by not running them when it's cold. I'd expect it would not require the same sort of facade changes as it could be on the roof and depending only need water/glycol lines into the space, depending on cooling tower vs air cooled and chiller location it could also potentially use the same piping (which I think is the traditional use).
You're looking for these: https://en.wikipedia.org/wiki/Thermal_wheel
Basically, an aluminum honeycomb wheel. One half of its housing is an air duct "outside" while the other half is an air duct that's "inside." Cold outside air blows through the straws and cools the metal. Wheel rotates slowly. That straw is now "inside." Inside air blows through it and deposits heat onto the metal. Turn turn turn.
A surprisingly effective way to lower heating/cooling costs. Basically "free," as you just need to turn it on the bearing. Do you get deposits in the comb? Yes, if you don't filter properly. Do you get condensation in the comb? Yeah. Treat it with desiccants.
Great question. Stop powering off the non-essential equipment and get natural air going in a flow. And then start praying and watching the utility's GIS outage map. lol -- Later, Joe On Mon, Jan 15, 2024 at 2:26 PM Mike Hammett <nanog@ics-il.net> wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg>
participants (19)
-
bzs@theworld.com
-
Chris Adams
-
Glenn McGurrin
-
Izaac
-
Jay Hennigan
-
Jay R. Ashworth
-
JoeSox
-
Karl Auer
-
Lamar Owen
-
Mel Beckman
-
Mike Hammett
-
Ray Bellis
-
Sean Donelan
-
Shane Ronan
-
Shawn L
-
sronan@ronan-online.com
-
Tom Beecher
-
Warren Kumari
-
William Herrin