Statistical Games Providers Play (RE: availability and resiliency)
Whenever providers start throwing numbers around, you've lost the battle. I always suggest people should talk to their insurance agents, not their technical people. Insurance agents are very good at understanding risk, and how much to spend mitigating that risk. Sometimes it is cheaper to buy a new computer every few years than trying to build the perfect protection around it. The used car dealer is almost never the best source of information about car insurance. Don't expect any better from a sales person at a provider. You are going to end up with expensive undercoat and fabric protection package. As far as I know, the system with the highest publicly stated reliability and availability is FEDWIRE. Fedwire exceeds everything I've seen at NORAD, NASA, or any service provider (carrier, internet, web hosting, etc). Fedwire has five-way redundancy of some systems. It also has the full faith and credit of the US Treasury backing up its service guarantee. A software error in 1985 resulted in a $23 billion (with a "B") accounting imbalance. If I take Fedwire as the upper limit, I need to ask what about providers whose claims exceed those delivered by Fedwire? They aren't lying, but you need to understand the numbers. And if any provider does think their system exceeds Fedwire, I would love a tour of your facility. Due to their history as a regulated monopoly, telephone companies have developed interesting ways to calculated reliability. For example, some telephone companies ignore events which exceed the design parameters of the network. Or in other words, they don't include Mother's Day in their calculations. Some telephone companies also don't include disruptions due to Acts Of God or Force Majure in their reported numbers. I chuckle whenever I hear someone say "carrier-grade." Availability statistics are much like flood and storm statistics. A once in 100 year flood does not mean it will flood only once in any 100 year period. You can have back to back floods. And you can have back to back computer failures. Nor does it limit the length of an outage. You could have a 43 minute failure in Year 1, and no failures in Years 2-5. Or an 86 minute failure in Year 1 and no failures in Years 2-10. Or even a 86 minute failure in Year 1, and a 86 minute failure in Year 2, and no failures in Years 3-20. Remember in statistics when you calculated the series to infinity. If you are still around in Year Infinity, then you can discuss X 9's of availability. Asking a provider how many 9's of reliability they provide, or the MTBF of their systems is really a red herring. What you really want to know, and what you should ask is When a failure does occur (and it will): how will you respond? how will you keep me informed? what do I need to do? and after you understand those answers how often would I expect this? No matter how many 9's you have, there is always a .1, 01, .001, .0001, etc chance. Murphy is exceedingly good at his job. Ok, if you are still reading, and you still want to build a system as reliable as Fedwire, lets talk. Fedwire has shown it can be done, however expect to pay as much as Fedwire. On the other hand, if you are willing to settle for just a little less, the price drops dramatically. Its a lot cheaper to build a system to meet a certain level of design risk, and buying insurance to cover the excess. It may double the price to add another "9" of reliability, but only 10% to cover the risk with insurance. I am not a lawyer, banker, insurance agent, doctor, or indian chief. You should always consult a licensed professional for advice.
Sean et al., Excluding force-majeure from availability is not totally unuseful (provided you can still compare apples with apples), on the basis that (historically at least) many other things are likely to cause more worry than failure of your internet/telecoms service in the event of war/asteroid strike etc. However, I've seen force-majeure clauses which exclude for instance weather (drizzle?), actions by telecoms providers (mmm.. didn't even exclude the contracting party and its suppliers) etc. etc. - these are about as useful as excluding actions by backhoe operators and train derailments. Part of the problem is noone ever reads these clauses, often titled (at least in the UK) 'Acts of God'. Many of us do not consider backhoe operators to be God. Also it's reasonable common to exclude actions by the customer or failures of their equipment - given many /system/ faults are still down to customer power etc., this may give telecoms elements with higher availability than the system as a whole, which is what you were refering to with FEDWIRE. (i.e. when users look at their systems they need to combine availability data and carefully consider whether the probabilities of failures of particular elements are or are not independent). Some ISPs exclude the tail circuit from their availability figures in its entirity. Finally, the availability number is meaningless unless there is a clear way of measuring what period it applies to. Five nines availabilty over a day is completely different to five nines availability over a year, if there is a fixed MTTR (think about it). IE availability numbers are *not* useless - but they generally aren't comparible without looking at the contract, and system in depth. -- Alex Bligh VP Core Network, XO Communications - http://www.xo.com/ (formerly Nextlink Inc, Concentric Network Corporation GX Networks, Xara Networks)
participants (2)
-
Alex Bligh
-
Sean Donelan