Statistical Games Providers Play (RE: availability and resiliency)

30 Sep 2000

      Whenever providers start throwing numbers around, you've lost the battle.

I always suggest people should talk to their insurance agents, not their
technical people.  Insurance agents are very good at understanding risk,
and how much to spend mitigating that risk.  Sometimes it is cheaper to
buy a new computer every few years than trying to build the perfect
protection around it.

The used car dealer is almost never the best source of information
about car insurance.  Don't expect any better from a sales person at
a provider.  You are going to end up with expensive undercoat and
fabric protection package.

As far as I know, the system with the highest publicly stated reliability
and availability is FEDWIRE.  Fedwire exceeds everything I've seen at
NORAD, NASA, or any service provider (carrier, internet, web hosting, etc).
Fedwire has five-way redundancy of some systems.  It also has the full
faith and credit of the US Treasury backing up its service guarantee.  A
software error in 1985 resulted in a $23 billion (with a "B") accounting
imbalance.

If I take Fedwire as the upper limit, I need to ask what about providers
whose claims exceed those delivered by Fedwire?  They aren't lying, but
you need to understand the numbers.  And if any provider does think their
system exceeds Fedwire, I would love a tour of your facility.

Due to their history as a regulated monopoly, telephone companies have
developed interesting ways to calculated reliability.  For example, some
telephone companies ignore events which exceed the design parameters of
the network.  Or in other words, they don't include Mother's Day in their
calculations.  Some telephone companies also don't include disruptions
due to Acts Of God or Force Majure in their reported numbers.  I chuckle
whenever I hear someone say "carrier-grade."

Availability statistics are much like flood and storm statistics. A once
in 100 year flood does not mean it will flood only once in any 100 year
period.  You can have back to back floods.  And you can have back to back
computer failures.  Nor does it limit the length of an outage. You could
have a 43 minute failure in Year 1, and no failures in Years 2-5.  Or an
86 minute failure in Year 1 and no failures in Years 2-10.  Or even a 86
minute failure in Year 1, and a 86 minute failure in Year 2, and no failures
in Years 3-20.  Remember in statistics when you calculated the series to
infinity.  If you are still around in Year Infinity, then you can discuss
X 9's of availability.

Asking a provider how many 9's of reliability they provide, or the MTBF of
their systems is really a red herring.  What you really want to know, and
what you should ask is

   When a failure does occur (and it will):
     how will you respond?
     how will you keep me informed?
     what do I need to do?
and after you understand those answers
     how often would I expect this?

No matter how many 9's you have, there is always a .1, 01, .001, .0001, etc
chance.  Murphy is exceedingly good at his job.

Ok, if you are still reading, and you still want to build a system as
reliable as Fedwire, lets talk.  Fedwire has shown it can be done, however
expect to pay as much as Fedwire.  On the other hand, if you are willing
to settle for just a little less, the price drops dramatically.  Its a lot
cheaper to build a system to meet a certain level of design risk, and buying
insurance to cover the excess.  It may double the price to add another "9"
of reliability, but only 10% to cover the risk with insurance.

I am not a lawyer, banker, insurance agent, doctor, or indian chief.  You
should always consult a licensed professional for advice.

Sean Donelan

Alex Bligh

tags

participants (2)