Re: "Hypothetical" Datacenter Overheating
At 09:08 AM 2024-01-15, Mike Hammett wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired. All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans. You will start to see premature failure of equipment over the coming weeks/months/years. Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :-(
Coincidence indeed.... ;-) ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Clayton Zekelman" <clayton@MNSi.Net> To: "Mike Hammett" <nanog@ics-il.net>, "NANOG" <nanog@nanog.org> Sent: Monday, January 15, 2024 8:23:37 AM Subject: Re: "Hypothetical" Datacenter Overheating At 09:08 AM 2024-01-15, Mike Hammett wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired. All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans. You will start to see premature failure of equipment over the coming weeks/months/years. Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :-(
I think we're beyond "hypothetical" at this point, Mike ... ;) On 1/15/24 15:49, Mike Hammett wrote:
Coincidence indeed.... ;-)
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL><https://plus.google.com/+IntelligentComputingSolutionsDeKalb><https://www.linkedin.com/company/intelligent-computing-solutions><https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix><https://www.linkedin.com/company/midwest-internet-exchange><https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp><https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg> ------------------------------------------------------------------------ *From: *"Clayton Zekelman" <clayton@MNSi.Net> *To: *"Mike Hammett" <nanog@ics-il.net>, "NANOG" <nanog@nanog.org> *Sent: *Monday, January 15, 2024 8:23:37 AM *Subject: *Re: "Hypothetical" Datacenter Overheating
At 09:08 AM 2024-01-15, Mike Hammett wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired.
All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans.
You will start to see premature failure of equipment over the coming weeks/months/years.
Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :-(
Our Zayo circuit just came up 30 minutes ago and it routes through 350 E Cermak. Chillers were all messed up. No hypothetical there. :-) It was down for over 16 hours! On 1/15/24 10:04 AM, Bryan Holloway wrote:
I think we're beyond "hypothetical" at this point, Mike ... ;)
On 1/15/24 15:49, Mike Hammett wrote:
Coincidence indeed.... ;-)
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL><https://plus.google.com/+IntelligentComputingSolutionsDeKalb><https://www.linkedin.com/company/intelligent-computing-solutions><https://twitter.com/ICSIL>
Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix><https://www.linkedin.com/company/midwest-internet-exchange><https://twitter.com/mdwestix>
The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp><https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg>
------------------------------------------------------------------------ *From: *"Clayton Zekelman" <clayton@MNSi.Net> *To: *"Mike Hammett" <nanog@ics-il.net>, "NANOG" <nanog@nanog.org> *Sent: *Monday, January 15, 2024 8:23:37 AM *Subject: *Re: "Hypothetical" Datacenter Overheating
At 09:08 AM 2024-01-15, Mike Hammett wrote: >Let's say that hypothetically, a datacenter you're in had a cooling >failure and escalated to an average of 120 degrees before >mitigations started having an effect. What are normal QA procedures >on your behalf? What is the facility likely to be doing? >What should be expected in the aftermath?
One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired.
All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans.
You will start to see premature failure of equipment over the coming weeks/months/years.
Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :-(
Something worth a thought is that as much as devices don't like being too hot they also don't like to have their temperature change too quickly. Parts can expand/shrink variably depending on their composition. A rule of thumb is a few degrees per hour change but YMMV, depends on the equipment. Sometimes manufacturer's specs include this. Throwing open the windows on a winter day to try to rapidly bring the room down to a "normal" temperature may do more harm than good. It might be worthwhile figuring out what is reasonable in advance with buy-in rather than in a panic because, from personal experience, someone will be screaming in your ear JUST OPEN ALL THE WINDOWS WHADDYA STUPID? On January 15, 2024 at 09:23 clayton@MNSi.Net (Clayton Zekelman) wrote:
At 09:08 AM 2024-01-15, Mike Hammett wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired.
All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans.
You will start to see premature failure of equipment over the coming weeks/months/years.
Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :-(
-- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
On Tue, 16 Jan 2024 at 08:51, <bzs@theworld.com> wrote:
A rule of thumb is a few degrees per hour change but YMMV, depends on the equipment. Sometimes manufacturer's specs include this.
Is this common sense, or do you have reference to this, like paper showing at what temperature change at what rate occurs what damage? I regularly bring fine electronics, say iPhone, through significant temperature gradients, as do most people who have to live in places where inside and outside can be wildly different temperatures, with no particular observable effect. iPhone does go into 'thermometer' mode, when it overheats though. Manufacturers, say Juniper and Cisco describe humidity, storage and operating temperatures, but do not define temperature change rate. Does NEBS have an opinion on this, or is this just a common case of yours? -- ++ytti
On Mon, Jan 15, 2024 at 11:08 PM Saku Ytti <saku@ytti.fi> wrote:
On Tue, 16 Jan 2024 at 08:51, <bzs@theworld.com> wrote:
A rule of thumb is a few degrees per hour change but YMMV, depends on the equipment. Sometimes manufacturer's specs include this.
Is this common sense, or do you have reference to this, like paper showing at what temperature change at what rate occurs what damage?
It's uncommon sense. You have a computer room humidified to 40% and you inject cold air below the dew point. The surfaces in the room will get wet. See also: https://en.wikipedia.org/wiki/Thermal_stress Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Tue, 16 Jan 2024 at 11:00, William Herrin <bill@herrin.us> wrote:
You have a computer room humidified to 40% and you inject cold air below the dew point. The surfaces in the room will get wet.
I think humidity and condensation is well understood and indeed documented but by NEBS and vendors as verboten. I am more interested in temperature changes when not condensating and causing water damage. Like we could theorise, some soldering will expand/contract too fast, breaking or various other types of scenarios one might guess without context, and indeed electronics often have to experience large temperature gradients and appear to survive. When you turn these things on, various parts rapidly heat from ambient to 80-90c. So I have some doubts if this is actually a problem you need to consider, in absence of condensation. -- ++ytti
On 16/01/2024 at 10:50:13 PM, Saku Ytti <saku@ytti.fi> wrote:
On Tue, 16 Jan 2024 at 11:00, William Herrin <bill@herrin.us> wrote:
You have a computer room humidified to 40% and you inject cold air
below the dew point. The surfaces in the room will get wet.
I think humidity and condensation is well understood and indeed documented but by NEBS and vendors as verboten.
I am more interested in temperature changes when not condensating and causing water damage. Like we could theorise, some soldering will expand/contract too fast, breaking or various other types of scenarios one might guess without context, and indeed electronics often have to experience large temperature gradients and appear to survive. When you turn these things on, various parts rapidly heat from ambient to 80-90c. So I have some doubts if this is actually a problem you need to consider, in absence of condensation.
Here’s some manufacturer specs: https://www.dell.com/support/manuals/en-nz/poweredge-r6515/per6515_ts_pub/environmental-specifications?guid=guid-debd273c-0dc8-40d8-abbc-be059a0ce59c&lang=en-us 3rd section, “Maximum temperature gradient”.
From memory, the management cards alarm when the gradient is exceeded, too.
-- Nathan Ward
On Tue, 16 Jan 2024 at 12:22, Nathan Ward <lists+nanog@daork.net> wrote:
Here’s some manufacturer specs: https://www.dell.com/support/manuals/en-nz/poweredge-r6515/per6515_ts_pub/environmental-specifications?guid=guid-debd273c-0dc8-40d8-abbc-be059a0ce59c&lang=en-us
3rd section, “Maximum temperature gradient”.
Thanks. It seems quite many compute context quote ASHRAE gradients, but in networking kit context it seems very rarely quoted (unless indirectly via NEBS), while I wouldn't expect intuitively their tolerances to be significantly different. -- ++ytti
For gear climate controls and rate of change look at ASHRAE ratings on the equipment. Most vendors publish the rating of equipment on their data sheets (sometimes they include the ASHRAE rating), and it gives the required operating conditions as well as acceptable rates of change. Most well run data centres follow these recommendations; this “hypothetical” data centre normally does as well, but may have missed some maintenance tasks it appears. I have equipment in the same building which hasn’t been effected in another providers suite. The ASHRAE (American Society of Heating, Refrigerating and Air-Conditioning Engineers) has a committee - ASHRAE Technical Committee 9.9 that covers Mission Critical Facilities, Data Centers, Technology Spaces and Electronic Equipment. 2021 Data Center Cooling Resiliency Brief https://tpc.ashrae.org/Documents?cmtKey=fd4a4ee6-96a3-4f61-8b85-43418dfa988d 2016 ASHRAE Data Center Power Equipment Thermal Guidelines and Best Practices https://tpc.ashrae.org/Documents?cmtKey=fd4a4ee6-96a3-4f61-8b85-43418dfa988d 2020 Cold Weather Shipping Acclimation and Best Practices (included this one because it is fitting this time of year) https://tpc.ashrae.org/FileDownload?idx=809784d5-911b-4e9a-a2da-ff3ab6ff9eea BTW; it hit -50°C (-58°F) in Alberta last week and you aren’t hearing about the data centres in that province going offline. The record low for Chicago was -27°F set in 1985; this building wasn’t a data centre at that time, and only became a data centre in 1999 so they would have known how cold it could get there when they did the initial system planning and should have accounted for this. Rob [cid:25-3_b9d46eb0-9a15-418d-a30b-ea51cacaf6f8.png] Robert Mercier CTO Next Dimension Inc. [mobilePhone] Tel: 1-800-461-0585 ext 421 [emailAddress] RMercier@nextdimensioninc.com [website] www.nextdimensioninc.com<https://www.nextdimensioninc.com> [facebook]<https://www.facebook.com/ndinc.ca> [twitter] <https://twitter.com/NextDimensionCA> [linkedin] <https://www.linkedin.com/company/next-dimension-inc.> [instagram] <https://www.instagram.com/next.dimension.inc> -----Original Message----- From: NANOG <nanog-bounces+rmercier=nextdimensioninc.com@nanog.org> On Behalf Of Saku Ytti Sent: January 16, 2024 2:09 AM To: bzs@theworld.com Cc: NANOG <nanog@nanog.org> Subject: Re: "Hypothetical" Datacenter Overheating On Tue, 16 Jan 2024 at 08:51, <bzs@theworld.com<mailto:bzs@theworld.com>> wrote:
A rule of thumb is a few degrees per hour change but YMMV, depends on
the equipment. Sometimes manufacturer's specs include this.
Is this common sense, or do you have reference to this, like paper showing at what temperature change at what rate occurs what damage? I regularly bring fine electronics, say iPhone, through significant temperature gradients, as do most people who have to live in places where inside and outside can be wildly different temperatures, with no particular observable effect. iPhone does go into 'thermometer' mode, when it overheats though. Manufacturers, say Juniper and Cisco describe humidity, storage and operating temperatures, but do not define temperature change rate. Does NEBS have an opinion on this, or is this just a common case of yours? -- ++ytti
Others have pointed to references, I found some others, it's all pretty boring but perhaps one should embrace the general point that some equipment may not like abrupt temperature changes. But phones (well, modern mobile phones) don't generally have moving parts. So the issue is more likely with things like hard drives, the kind with fast spinning platters with heads flying microns above those platters, or even just fans and similar gear. It leads me to another question which is that IN THE BEFORE DAYS you had to wait to remove a removable disk until you were sure it was spun down, maybe 30 seconds or so. If you lifted one and felt that gyroscopic pull you may well have toasted it. Today we have spinning disks in laptops you can toss across the room or whatever so clearly they solved that problem. And they're apparently a lot more tolerant of temperature and other environmental changes. So the tolerances may be much greater than one is ever likely to run into. I'm still not sure I'd be comfortable opening the windows to let sub-freezing air into a 120F room with petabytes of spinning rust. P.S. Please don't tell me what an SSD is. Yes they're probably much more tolerant of environmental changes. On January 16, 2024 at 09:08 saku@ytti.fi (Saku Ytti) wrote:
On Tue, 16 Jan 2024 at 08:51, <bzs@theworld.com> wrote:
A rule of thumb is a few degrees per hour change but YMMV, depends on the equipment. Sometimes manufacturer's specs include this.
Is this common sense, or do you have reference to this, like paper showing at what temperature change at what rate occurs what damage?
I regularly bring fine electronics, say iPhone, through significant temperature gradients, as do most people who have to live in places where inside and outside can be wildly different temperatures, with no particular observable effect. iPhone does go into 'thermometer' mode, when it overheats though.
Manufacturers, say Juniper and Cisco describe humidity, storage and operating temperatures, but do not define temperature change rate. Does NEBS have an opinion on this, or is this just a common case of yours?
-- ++ytti
-- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
On Wed, 17 Jan 2024 at 03:18, <bzs@theworld.com> wrote:
Others have pointed to references, I found some others, it's all pretty boring but perhaps one should embrace the general point that some equipment may not like abrupt temperature changes.
Can you share them? Only one I've found is: https://www.ashrae.org/file%20library/technical%20resources/bookstore/supple... Which quotes 20c/h, which is a much higher rate than almost anyone has ability to perform in their DC ambient. But it makes no explanation where this comes from. I believe in reality there is immense complexity here - Gradient depends on processes and materials used in manufacturing (like pre/post ROHS will certainly have different gradient) - Gradient has directionality, unlike ASHRAE quotes, because devices are engineered to go from 20C to 90C in very short moment, when turned on, but there was less engineering pressure for similar cooling rates - Gradient has positionality going 20C between any two pairs does not mean equal risk And likely no one knows well, because no one has had to know well, because it's not expensive enough to derisk. But what we do know well - ASHRAE quotes rate which you are unlikely to be able to hit - Devices that travel with you, regularly see 50c instant ambient gradients, both directions, multiple times a day - Devices see large fast gradients when turned on, but slower when turned off - Compute people quote ASHRAE, Networking people appear not to, perhaps like you say spindles is the ultimately reason for the limits to exist I think generally we have bias in that we like to identify risks and then add them as organisational knowledge, but ultimately all these new rules and exceptions you introduce, increase cost, complexity, reduce efficiency and productivity. So we should be very critical about them. It is fine to realise risks, and use realised risks as data to analyse if avoiding those risks makes sense. It's very easy to build poorly defined rules over poorly defined rules and arrive in high cost, low efficiency operations. Like this 'few centigrades per hour' is an exceedingly palatable rule-of-thumb, it sounds good, unless you stop to think about it. I would not recommend spending any time or money derisking gradients, I would hope that rules that redisk condensation are enough to cover derisking gradients and I would re-evaluate after sufficient realised risks. -- ++ytti
Good thing there are no windows at this “hypothetical” location :)
On Jan 16, 2024, at 1:51 AM, bzs@theworld.com wrote:
Something worth a thought is that as much as devices don't like being too hot they also don't like to have their temperature change too quickly. Parts can expand/shrink variably depending on their composition.
A rule of thumb is a few degrees per hour change but YMMV, depends on the equipment. Sometimes manufacturer's specs include this.
Throwing open the windows on a winter day to try to rapidly bring the room down to a "normal" temperature may do more harm than good.
It might be worthwhile figuring out what is reasonable in advance with buy-in rather than in a panic because, from personal experience, someone will be screaming in your ear JUST OPEN ALL THE WINDOWS WHADDYA STUPID?
On January 15, 2024 at 09:23 clayton@MNSi.Net (Clayton Zekelman) wrote:
At 09:08 AM 2024-01-15, Mike Hammett wrote:
Let's say that hypothetically, a datacenter you're in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?
One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired.
All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans.
You will start to see premature failure of equipment over the coming weeks/months/years.
Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :-(
-- -Barry Shein
Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
participants (10)
-
Bryan Holloway
-
bzs@theworld.com
-
Clayton Zekelman
-
Jason Canady
-
Mike Hammett
-
Nathan Ward
-
Robert Mercier
-
Saku Ytti
-
sronan@ronan-online.com
-
William Herrin