What to expect after a cooling failure
As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded. One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite. For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share? Thanks -- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
Hello, In my experience with heating issues the only thing that really degrades quickly in event of overheating are hard drives. If you had them spun down it should be fine. CPU / Memory / Motherboards will be fine. The only other thing I can think of having possible issues are PSU's but if they were powered off should be fine as well. Maybe melted wires but I dont think it was hot enough for that. Thanks On Tue, Jul 9, 2013 at 9:28 PM, Erik Levinson <erik.levinson@uberflip.com>wrote:
As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded.
One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite.
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Thanks
-- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
-- -------------------- Bryan Tong Nullivex LLC | eSited LLC (507) 298-1624
Thanks. I should also mention that most of the gear was still on but we had turned off many VMs on physical servers within the first 2.5 hours, so the CPU and hard drive / io load was around zero on such servers. Most of the servers in the hotter suite had fans running at over 75% vs. about 35% in the cooler suite and ambient temp was down to 32 degrees Celcius within four hours. -- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com -----Original Message----- From: "Bryan Tong" <contact@nullivex.com> Sent: Tuesday, July 9, 2013 11:42pm To: "Erik Levinson" <erik.levinson@uberflip.com> Cc: "NANOG mailing list" <nanog@nanog.org> Subject: Re: What to expect after a cooling failure Hello, In my experience with heating issues the only thing that really degrades quickly in event of overheating are hard drives. If you had them spun down it should be fine. CPU / Memory / Motherboards will be fine. The only other thing I can think of having possible issues are PSU's but if they were powered off should be fine as well. Maybe melted wires but I dont think it was hot enough for that. Thanks On Tue, Jul 9, 2013 at 9:28 PM, Erik Levinson <erik.levinson@uberflip.com>wrote:
As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded.
One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite.
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Thanks
-- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
-- -------------------- Bryan Tong Nullivex LLC | eSited LLC (507) 298-1624
----- Original Message -----
From: "Erik Levinson" <erik.levinson@uberflip.com>
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
If the HDDs were spinning while above rated maximum ambient intake temp, *especially* if they're not *right out front in the intake path* (is anything not built that way anymore? Yeah; the back side of 45-drive Supermicro racks, among other things), you should probably plan on doing a preemptive replacement cycle, or at the very least, pay *very* close attention to smartctld, and have a good stock of pre-trayed replacements. Remember that you may fall in the RAID Hole if you wait for failures, and hence lose data which isn't backed up anyway -- if more drives in a raid group fail *during rebuilds*, you're essentially screwed. If your raid groups were properly dispersed across drive build dates, then this will probably be *slightly* less dangerous, but still. Also watch bearing-type fans. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA #natog +1 727 647 1274
On 7/9/13, Erik Levinson <erik.levinson@uberflip.com> wrote:
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Realistically... you had a single short-lived stress event. There are likely to be some number of random component failures in the future. It is unlikely that you will be able to attribute the failures to such a short lived stress event of that magnitude -- there might on average be a small increase over normal failure rates. The bigger concern, may be that /a lot of different components/ could have been subject to the same kind of abuse at the same time: including sets of components that are supposed to be in a redundant pair and not fail simultaneously. I wouldn't necessarily be so concerned about premature failures --- I would be more concerned, that you may have redundant components that were exposed to the same stress event at the same time; now the assumption that their chances of failure are independent may become more questionable --- the chance of a correlated failure in the future might be greatly increased, reducing the level of effective redundancy/risk reduction today. That would apply mainly to mechanical devices such as HDDs.
Thanks -- -JH
Honestly, I think your hardware will be fine just like everyone else said keep an eye on your hard drives they are by far the most sensitive. Anything not mechanical if it didnt melt you're good. One data center we had equipment in was 153F for about a week and all we saw were drive failures and they were still fairly sparse. 1 out of 10 I would say. Thanks On Tue, Jul 9, 2013 at 11:07 PM, Jimmy Hess <mysidia@gmail.com> wrote:
On 7/9/13, Erik Levinson <erik.levinson@uberflip.com> wrote:
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Realistically... you had a single short-lived stress event. There are likely to be some number of random component failures in the future. It is unlikely that you will be able to attribute the failures to such a short lived stress event of that magnitude -- there might on average be a small increase over normal failure rates.
The bigger concern, may be that /a lot of different components/ could have been subject to the same kind of abuse at the same time: including sets of components that are supposed to be in a redundant pair and not fail simultaneously.
I wouldn't necessarily be so concerned about premature failures --- I would be more concerned, that you may have redundant components that were exposed to the same stress event at the same time; now the assumption that their chances of failure are independent may become more questionable --- the chance of a correlated failure in the future might be greatly increased, reducing the level of effective redundancy/risk reduction today.
That would apply mainly to mechanical devices such as HDDs.
Thanks -- -JH
-- -------------------- Bryan Tong Nullivex LLC | eSited LLC (507) 298-1624
On 09/07/13 20:28, Erik Levinson wrote:
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
While others have already talked about what to look out for in terms of systems and drives, I haven't seen anyone mention things like your UPS batteries. Were they also heat-soaked? At one place I worked at, we lost a whole bank of batteries in the UPS room when it overheated. I think that was somewhere around a $95,000 replacement and required rush-delivery of a lot of SLAs from all over the place.
On Tue, 9 Jul 2013, Erik Levinson wrote:
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
I have experience with a different kind of event that might be of interest to a wider audience. When the fire suppression system went off in a site, we had a lot of instant harddrive failures. I don't have any numbers, but let's say 5-10% of all hdd:s in the room died more or less instantly. Supposedly this was because of the air pressure shock when the inert fire suppression gas was released and the vents weren't big enough to release the overpressurised air outside. I did some research and there are forum posts etc about these kinds of events happening in other places. So, takeaway from this was RAID is an uptime tool, not a substitute for backups, and also, get a qualified ventilation/fire supression systems engineer to inspect your sites from this aspect. -- Mikael Abrahamsson email: swmike@swm.pp.se
* Erik Levinson <erik.levinson@uberflip.com>: [cooling failure]
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
We had a similar event (temperatures were a bit higher at 49°C, duration was a bit shorter, 10am to 3pm) this January. In the two days after the event, two of our HP servers had drives that went from "OK" to "Predictive Failure", which is the SmartArray controller's way of telling about high error rates. Two weeks after, we had a single DIMM with an uncorrectable ECC error, causing a server reboot. Three weeks after, a single PSU failed. In our opinion, the disk problems were caused by the cooling failure, while the ECC error and the faulted PSU were probably not related. I believe that your hardware will be fine, but it probably wouldn't be a bad idea to check if you have current maintenance contracts/warranty for your servers, or any other way of obtaining replacement drives in a reasonably short time. Cheers Stefan
Numbers from memory and filed off a bit for anonymity, but.... A site I was consulting with had statistically large numbers of x86 servers (say, 3000), SPARC enterprise gear (100), NetApp units (60) and NetApp drives (5000+) go through a roughly 42C excursion. It was much hotter at ceiling level but fortunately high (20 foot) ceilings. Within about 1C of the (wet pipes) sprinkler system head fuse temp... (shudder) Both NetApp and X86 server PSUs had significantly increased failure rates for the next year. Say in rough numbers 10% failed in the year. About 2% were instant fails. Hard drives had a significantly higher fail rate for the next year, also in the 10% range. No change in rate of motherboard or CPU or RAM failures was noted that I recall. George William Herbert Sent from my iPhone On Jul 9, 2013, at 8:28 PM, "Erik Levinson" <erik.levinson@uberflip.com> wrote:
As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded.
One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite.
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Thanks
-- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded.
One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure
Ugly. If the batteries that were in the facility's power distribution system were affected by the heat, then their life is likely significantly shortened. This is in terms of their capacity to supply power in the event of an outage and a shortened shelf life. Lorell On Jul 9, 2013, at 8:28 PM, "Erik Levinson" <erik.levinson@uberflip.com> wrote: that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite.
For those who have gone through such events in the past, what can one
expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Thanks
-- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
Another failure I've seen connected to overheating events is AC power supply failures. On 07/09/2013 10:28 PM, Erik Levinson wrote:
As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded.
One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite.
For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?
Thanks
-- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
This has been a very interesting thread. Google pointed me to this Dell document which specs some of their servers having an expanded operating temperature range *** based on the amount of time spent at the elevated temperature, as a percentage of annual operating hours. *** ftp://ftp.dell.com/Manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r710_User%27s%20Guide4_en-us.pdf I mention that because the "1% of annual operating hours" at 45 C would be two degrees higher than the 43 C stated as reached in the original email. It would seem that Dell recognizes that there might be situations, such as this, where the "continuous operation" range (35 C) is briefly exceeded. Tony Patti CIO S. Walter Packaging Corp. -----Original Message----- From: Erik Levinson [mailto:erik.levinson@uberflip.com] Sent: Tuesday, July 09, 2013 11:28 PM To: NANOG mailing list Subject: What to expect after a cooling failure As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded. One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite. For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share? Thanks -- Erik Levinson CTO, Uberflip 416-900-3830 1183 King Street West, Suite 100 Toronto ON M6K 3C5 www.uberflip.com
participants (11)
-
Bryan Tong
-
Daniel Taylor
-
Erik Levinson
-
George Herbert
-
Jake Khuon
-
Jay Ashworth
-
Jimmy Hess
-
Lorell Hathcock
-
Mikael Abrahamsson
-
Stefan Förster
-
Tony Patti