At 03:08 PM 7/2/2012, George Herbert wrote: If folks have not read it, I would suggest reading Normal Accidents by Charles Perrow. The "it can't happen" is almost guaranteed to happen. ;-) And when it does, it'll often interact in ways we can't predict or sometimes even understand. As for pulling the plug to test stuff. I recall a demo at Netapps in the early 00's. They were talking about their fault tolerance and how great it was. So I walked up to their demo array and said, "So, it shouldn't be a problem if I pulled this drive right here?" Before I could the salesperson or tech guy, can't remember, told me to stop. He didn't want to risk it. That right there said loads about their confidence in their own system.
Late reply, but:
... Second, and more important. I *was* a "computer science guy" in a
On Sat, Jun 30, 2012 at 12:30 AM, Lynda <shrdlu@deaddrop.org> wrote: past life,
and this is nonsense. You can have astonishingly large software projects that just continue to run smoothly, day in, day out, and they don't hit the news, because they don't break. There are data centers that don't hit the news, in precisely the same way.
I really need to write the book on IT reliability I keep meaning to.
There's reliability - backwards looking statistical, which can be 100% for a given service or datacenter - and then there's dependability, forwards-predicted outage risks, which people often *assert* equals the prior reliability record, but in reality you often have a number of latent failures (and latent cascade paths) that you do not understand, did not identify previously, and are not aware of.
I've had or had to respond to over a billion dollars of culminative IT disaster loss over my consulting career so far; I have NEVER seen anyone who did it perfect, even the best pros. And I include myself in that list.
Looking at other fields like aerospace and nuclear engineering, what is done in IT is not anywhere close to the same level of QA and engineering analysis and testing. We cannot assert better results with less work.
"Oh, that never happens", except I've had my stuff in three locations that had catastrophic generator failures. "Oh, that never happens" when you're doing power maintenance and the best-rated electrical company in California, in conjunction with the generator vendor and a couple of independent power EEs, mis-balance the maintenance generator loads between legs and blow the generators and datacenter. "Oh, that never happens" that the datacenter burns (or starts to burn and then gets flooded). "Oh, that never happens" that the FM-200 goes off or preaction breaks and water leaks. "Oh, that never happens" that well maintained and monitored and triple-redundant AC units all trip offline due to a common mode failure over the course of a weekend and the room gets up to 106 degrees. Oh thank god the next thing didn't go wrong in THAT situation, because the spot temperature meters indicated that the ceiling height of that particular room peaked at 1 degree short of the temp at which the sprinkler heads are supposed to discharge, so we nearly lost that room to flooding rather than just a 10% disk and 15% power supply attrition over the next year...
Don't be so confident in the infrastructure. It's not engineered or built or maintained well enough to actually support that assertion. The same can be said of the application software and application architecture and integration.
-- -george william herbert george.herbert@gmail.com
Greg D. Moore http://greenmountainsoftware.wordpress.com/ CEO QuiCR: Quick, Crowdsourced Responses. http://www.quicr.net