On Mon, 26 Nov 2001, Christian Kuhtz wrote:
Now, if lack of infrastructure realiability can harm human life you may feel differently, but that isn't the case for most of us at the present time.
I've designed software and networks used for public safety and emergencies. And yes, people have died on my watch. It is a somewhat different mindset, but not that different. A lot of "good engineering practice" applies to any engineering activity, including software engineering. Its not even a matter of cost. A typical hospital spends less on their emergency power system than a Internet/telco hotel. The major difference is the hospital staff knows (more or less) what to do when the generators don't work. The big secret is most "life safety" systems fail regularly. Most of the time it doesn't matter because the "big one" doesn't coincide with the failure.
Faults will happen. And nothing matters as much as how your prepare for when they do.
Mean Time To Repair is a bigger contributor to Availability calculations than the Mean Time To Failure. It would be great if things never failed. But some people are making their systems so complicated chasing the Holy Grail of 100% uptime, they can't figure out what happened when it does fail. Murphy's revenge: The more reliable you make a system, the longer it will take you to figure out what's wrong when it breaks.