On Sat, 30 September 2000, Alex Bligh wrote:
Excluding force-majeure from availability is not totally unuseful (provided you can still compare apples with apples), on the basis that (historically at least) many other things are likely to cause more worry than failure of your internet/telecoms service in the event of war/asteroid strike etc.
I also think its reasonable to exclude force majeure in a SLA. When I wrote SLA's I excluded all sorts of stuff in the force majeure clause. My point wasn't to suggest it is unreasonable to do so, but customers should understand a provider includes only certain risks in their SLA. Insurance companies are very good for covering the other risks. What drove me crazy was I couldn't get the information I needed to do my job from providers. Fiber cuts happen. I don't want a 99.999% network guarantee, I want an accurate map of my circuits on your network so I can plan where to put my backup circuits. If I could outsource the reliability of my network, it would be great. But the fact is, I was the one who had to live with fallout when my network failed; not the provider. At most large providers, my "dedicated account team" rarely lasted a fiscal quarter let alone a fiscal year. I sometimes think folks, and I was once one of them, are just looking for the provider which promises the highest availability number (100% service guarantee) without realizing some providers are achieving such high numbers by excluding a lot of disruptive events. If everyone included and excluded the same things, you could compare the numbers. But everyone doesn't. So customers need to understand what the numbers mean. It is possible for a provider with a 98% availability guarantee to have better actual performance than a provider promising 100% availability.
Finally, the availability number is meaningless unless there is a clear way of measuring what period it applies to. Five nines availabilty over a day is completely different to five nines availability over a year, if there is a fixed MTTR (think about it).
Ah, Mean Time To Repair. For availability purposes, MTTR is frequently a bigger contributor than MTBF (Mean Time Between Failure). Can you repair one half of a redundant system before the second half fails.
IE availability numbers are *not* useless - but they generally aren't comparible without looking at the contract, and system in depth.
If you look at what happened in the disk drive market, for a while Seagate and the others waged the battle of MTBF. You could buy a 400,000 hour MTBF drive or a 800,000 hour MTBF drive or a 1.2 million hour MTBF drive. Was there a difference between the drives, in some cases yes. In most cases, you were actually buying the same drive with different extended warranties. Did the difference in MTBF's mean the disk drives never failed? No. If the drive did fail, did you get data back because it was a 1.2 million hour MTBF drive? No. If you didn't have a backup, did Seagate lose their job, or did the computer operator lose theirs? Does RAID held? Yes. Does RAID completely eliminate disk failures? No. If you look at network backbones, almost everyone uses the same vendors, supplying essentially the same equipment, and nearly the same network design. So why would different providers have different availability numbers? Is it just an accident of the statistical series, some providers had their failures earlier but everyone will end up the same in Year Infinity? Or are there real differences, besides price, between providers?