Wandering off the subject of BT's misfortune ... Sean Donelan wrote:
On Mon, 26 Nov 2001, Christian Kuhtz wrote:
[...]
Faults will happen. And nothing matters as much as how your prepare for when they do.
Mean Time To Repair is a bigger contributor to Availability calculations than the Mean Time To Failure. It would be great if things never failed.
And Mean Time To Fault Detected (Accurately) is usually the biggest sub-contributor within Repair but that's kinda your point.
But some people are making their systems so complicated chasing the Holy Grail of 100% uptime, they can't figure out what happened when it does fail.
Similar people pursue creation of perpetuum mobile. A strange and somewhat congruent example stumbled into recently is: http://www.sce.carleton.ca/netmanage/perpetum.shtml. Overall simplicity of the system, including failure detection mechanisms, and real redundancy are the most reliable tools for availablity. Of course, popping just a few layers out, profit and politics are elements of most systems.
Murphy's revenge: The more reliable you make a system, the longer it will take you to figure out what's wrong when it breaks.
Hmm.