HPE SAS Solid State Drives - Critical Firmware Upgrade Required
I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement. https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us -- TTFN, patrick
Since this is a SSD manufacturer problem does it impact other servers that might have SSD from the same manufacturer??? HP hasn't said who the manufacturer is? Geoff On 11/26/19 1:45 PM, Patrick W. Gilmore wrote:
I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us
Looking at a handful of images and listings online, it appears at least some (?) are Samsung - for example, HP 816562-B21 is just a rebadged Samsung MZ-ILS4800. Unknown whether it only affects the HPE digitally signed firmware, or all firmwares, though. On Tue, Nov 26, 2019 at 3:58 PM <nanog08@mulligan.org> wrote:
Since this is a SSD manufacturer problem does it impact other servers that might have SSD from the same manufacturer???
HP hasn't said who the manufacturer is?
Geoff
I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check
On 11/26/19 1:45 PM, Patrick W. Gilmore wrote: this announcement.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us
-- *Alex Buie* Technical Support Expert, Level 3 - Networking Datto, Inc. 475-288-4550 (o) 585-653-8779 (c) www.datto.com <http://www.datto.com/support-sig/> Join the conversation! [image: Facebook] <http://www.facebook.com/dattoinc> [image: Twitter] <https://twitter.com/Datto> [image: LinkedIn] <https://www.linkedin.com/company/5213385> [image: Blog RSS] <http://blog.datto.com/blog> [image: Slideshare] <http://www.slideshare.net/backupify> [image: Spiceworks] <https://community.spiceworks.com/pages/datto>
Hey Patrick,
I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us
Couple years back lot of folk had this problem with many vendors' optics. There was one particular vendor whose microcontrollers were commonly used by many vendors and this microcontroller had bug that after 2**31 1/100th of a second it started to write uptime to memory location of temperature, and many systems including Cisco and Juniper didn't react well optic temperatures reaching maximum possible values. So say you had large network wide upgrades 2**31 1/100th of a second ago, with enough time between upgrades to ensure that everything works before continuing on redundant parts. Then you'd suddenly lose like stack of cards all legs from all devices, no matter how much redundancy was built in. Just goes to show that focus on MTBF is usually not a great investment, it's hard to predict what brings you down and we tend to bias on thinking it's some physical problem, solved by redundant HW design, when it probably is not, it's probably something related to software or operator and hard to predict or prepare for. MTTR focus will have much more predictable ROI. I can't really point finger at HP here, these are common bugs and easy thing to miss for a human. Perhaps static analysis or more complexity to compiler and compile time guarantees should have covered this. -- ++ytti
I’ve been bitten by these sorts of issues before, so I tend to swap one OEM drive in every RAID-1 pair with a retail drive from (if possible) a different vendor. When I re-purpose servers, I try to use drives from two different vendors in each array. That way, if a drive barfs for any intrinsic reason, things keep working. This can impact performance, but is cheap insurance. paul
On Nov 26, 2019, at 3:45 PM, Patrick W. Gilmore <patrick@ianai.net> wrote:
I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us
-- TTFN, patrick
On Thu, 28 Nov 2019 at 01:25, Paul Nash <paul@nashnetworks.ca> wrote:
I’ve been bitten by these sorts of issues before, so I tend to swap one OEM drive in every RAID-1 pair with a retail drive from (if possible) a different vendor. When I re-purpose servers, I try to use drives from two different vendors in each array. That way, if a drive barfs for any intrinsic reason, things keep working.
I think the problem here is that it adds complexity, cost, may impact support thus contracts and it's not clear if it has saved you any outage. It becomes belief engineering. I think more useful would have been to figure out how you can replace that device with data in the shortest possible window, to cover every unexpected failure mode with smallest possible outage. -- ++ytti
participants (5)
-
Alex Buie
-
nanog08@mulligan.org
-
Patrick W. Gilmore
-
Paul Nash
-
Saku Ytti