Re: Statistical Games Providers Play (RE: availability and resiliency)

1 Oct 2000

      On Sat, 30 September 2000, Alex Bligh wrote:
...
Excluding force-majeure from availability is not totally unuseful
(provided you can still compare apples with apples), on the basis
that (historically at least) many other things are likely to
cause more worry than failure of your internet/telecoms service
in the event of war/asteroid strike etc.
I also think its reasonable to exclude force majeure in a SLA. When I
wrote SLA's I excluded all sorts of stuff in the force majeure clause.
My point wasn't to suggest it is unreasonable to do so, but customers
should understand a provider includes only certain risks in their SLA.
Insurance companies are very good for covering the other risks.

What drove me crazy was I couldn't get the information I needed to do
my job from providers.  Fiber cuts happen. I don't want a 99.999% network
guarantee, I want an accurate map of my circuits on your network so I
can plan where to put my backup circuits.  If I could outsource the
reliability of my network, it would be great.  But the fact is, I was
the one who had to live with fallout when my network failed; not the
provider.  At most large providers, my "dedicated account team" rarely
lasted a fiscal quarter let alone a fiscal year.

I sometimes think folks, and I was once one of them, are just looking for
the provider which promises the highest availability number (100% service
guarantee) without realizing some providers are achieving such high numbers
by excluding a lot of disruptive events.  If everyone included and excluded
the same things, you could compare the numbers.  But everyone doesn't.  So
customers need to understand what the numbers mean.  It is possible for a
provider with a 98% availability guarantee to have better actual performance
than a provider promising 100% availability.
...
Finally, the availability number is meaningless unless there is
a clear way of measuring what period it applies to. Five nines
availabilty over a day is completely different to five nines
availability over a year, if there is a fixed MTTR (think about
it).
Ah, Mean Time To Repair.  For availability purposes, MTTR is frequently
a bigger contributor than MTBF (Mean Time Between Failure).  Can you
repair one half of a redundant system before the second half fails.
...
IE availability numbers are *not* useless - but they generally
aren't comparible without looking at the contract, and system
in depth.
If you look at what happened in the disk drive market, for a while Seagate
and the others waged the battle of MTBF.  You could buy a 400,000 hour
MTBF drive or a 800,000 hour MTBF drive or a 1.2 million hour MTBF drive.
Was there a difference between the drives, in some cases yes.  In most
cases, you were actually buying the same drive with different extended
warranties.

Did the difference in MTBF's mean the disk drives never failed?  No.  If
the drive did fail, did you get data back because it was a 1.2 million hour
MTBF drive? No.  If you didn't have a backup, did Seagate lose their job,
or did the computer operator lose theirs?  Does RAID held? Yes.  Does RAID
completely eliminate disk failures? No.

If you look at network backbones, almost everyone uses the same vendors,
supplying essentially the same equipment, and nearly the same network
design.  So why would different providers have different availability
numbers?  Is it just an accident of the statistical series, some providers
had their failures earlier but everyone will end up the same in Year Infinity?
Or are there real differences, besides price, between providers?

Re: Statistical Games Providers Play (RE: availability and resiliency)

Sean Donelan