Re: Statistical Games Providers Play (RE: availability and resiliency)
On Sat, 30 September 2000, Alex Bligh wrote:
Excluding force-majeure from availability is not totally unuseful (provided you can still compare apples with apples), on the basis that (historically at least) many other things are likely to cause more worry than failure of your internet/telecoms service in the event of war/asteroid strike etc.
I also think its reasonable to exclude force majeure in a SLA. When I wrote SLA's I excluded all sorts of stuff in the force majeure clause. My point wasn't to suggest it is unreasonable to do so, but customers should understand a provider includes only certain risks in their SLA. Insurance companies are very good for covering the other risks. What drove me crazy was I couldn't get the information I needed to do my job from providers. Fiber cuts happen. I don't want a 99.999% network guarantee, I want an accurate map of my circuits on your network so I can plan where to put my backup circuits. If I could outsource the reliability of my network, it would be great. But the fact is, I was the one who had to live with fallout when my network failed; not the provider. At most large providers, my "dedicated account team" rarely lasted a fiscal quarter let alone a fiscal year. I sometimes think folks, and I was once one of them, are just looking for the provider which promises the highest availability number (100% service guarantee) without realizing some providers are achieving such high numbers by excluding a lot of disruptive events. If everyone included and excluded the same things, you could compare the numbers. But everyone doesn't. So customers need to understand what the numbers mean. It is possible for a provider with a 98% availability guarantee to have better actual performance than a provider promising 100% availability.
Finally, the availability number is meaningless unless there is a clear way of measuring what period it applies to. Five nines availabilty over a day is completely different to five nines availability over a year, if there is a fixed MTTR (think about it).
Ah, Mean Time To Repair. For availability purposes, MTTR is frequently a bigger contributor than MTBF (Mean Time Between Failure). Can you repair one half of a redundant system before the second half fails.
IE availability numbers are *not* useless - but they generally aren't comparible without looking at the contract, and system in depth.
If you look at what happened in the disk drive market, for a while Seagate and the others waged the battle of MTBF. You could buy a 400,000 hour MTBF drive or a 800,000 hour MTBF drive or a 1.2 million hour MTBF drive. Was there a difference between the drives, in some cases yes. In most cases, you were actually buying the same drive with different extended warranties. Did the difference in MTBF's mean the disk drives never failed? No. If the drive did fail, did you get data back because it was a 1.2 million hour MTBF drive? No. If you didn't have a backup, did Seagate lose their job, or did the computer operator lose theirs? Does RAID held? Yes. Does RAID completely eliminate disk failures? No. If you look at network backbones, almost everyone uses the same vendors, supplying essentially the same equipment, and nearly the same network design. So why would different providers have different availability numbers? Is it just an accident of the statistical series, some providers had their failures earlier but everyone will end up the same in Year Infinity? Or are there real differences, besides price, between providers?
On 30 Sep 2000, Sean Donelan wrote:
If you look at network backbones, almost everyone uses the same vendors, supplying essentially the same equipment, and nearly the same network design. So why would different providers have different availability numbers? Is it just an accident of the statistical series, some providers had their failures earlier but everyone will end up the same in Year Infinity? Or are there real differences, besides price, between providers?
There appear to be two major and some minor variants in backbone engineering and architecture. The major ones being the UUNET design and the Sprint/Qwest design for circuit layout and aggregation/hierarchy. There are a lot more variants regarding the routing architecture (IGP setup, bgp setup, et al), and depending on various failure modes, some are better than others for a subset of failures and vice versa. The hierarchical UUNET design for example, is fairly dense in terms of volume, with a small time diameter per region for a network of that size, which allows for some local optimizations. And if you get two circuits into two regions, some failures in one region can be isolated and compartmentalized, without a major spillover into the neighboring regions, which would not be the case in a large flat network. As always, with good engineering, comparable reliability can be established, given appropriate amounts of money being thrown at the problem. /vijay
<snip>
If you look at network backbones, almost everyone uses the same vendors, supplying essentially the same equipment, and nearly the same network design. So why would different providers have different availability numbers? Is it just an accident of the statistical series, some providers had their failures earlier but everyone will end up the same in Year Infinity? Or are there real differences, besides price, between providers?
While most backbones are running similar equipment, the service contract arrangements with vendors are probably across the board. When talking about edge (customer aggregation) devices and not the core, there are some that spare every type of card in every single hub, some who declare spare centers that are in geographic proximity to many hubs, there are some who pay Vendor X some $$ to have X hour response to RMAs and there are others who rely on boilerplate RMA procedures. It is surprising how much these factors play in restoral times on edge devices and affect MTTR. On top of that.. a little brains in the NOC along with decent documentation on spares and resources is another factor which can affect that MTTR. The core and backbone should always have redundancy built in (both network and hardware) and should continue to provide some level of service should a failure occur -- no matter what the ultimate design is. But like Vijay said, designing the netowrk and reducing the number of dependencies (whether protocol, hardware or backbone) will also decrease MTTR. -dave
participants (3)
-
Dave Cooper
-
Sean Donelan
-
Vijay Gill