Hey Patrick,
I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us
Couple years back lot of folk had this problem with many vendors' optics. There was one particular vendor whose microcontrollers were commonly used by many vendors and this microcontroller had bug that after 2**31 1/100th of a second it started to write uptime to memory location of temperature, and many systems including Cisco and Juniper didn't react well optic temperatures reaching maximum possible values. So say you had large network wide upgrades 2**31 1/100th of a second ago, with enough time between upgrades to ensure that everything works before continuing on redundant parts. Then you'd suddenly lose like stack of cards all legs from all devices, no matter how much redundancy was built in. Just goes to show that focus on MTBF is usually not a great investment, it's hard to predict what brings you down and we tend to bias on thinking it's some physical problem, solved by redundant HW design, when it probably is not, it's probably something related to software or operator and hard to predict or prepare for. MTTR focus will have much more predictable ROI. I can't really point finger at HP here, these are common bugs and easy thing to miss for a human. Perhaps static analysis or more complexity to compiler and compile time guarantees should have covered this. -- ++ytti