Limits of reliability or is 99.999999999% realistic
SLAs are the point where commercial & operational worlds collide. The numbers being offered should: Reflect the targeted quality of the service Tie in with meaningful damages for not achieving them In a market where I can offer 99.7% or 100% and the difference is a whole day's service credit, I know what I'd be offering. No question. If on the other hand, the expectation is that I pay out a year's contract value in cash upfront every time I miss a target in a month, then I'd only want to offer a target that will *always* be achieved. In the market at the moment, the situation is much closer to the first scenario than the second - ie damages for SLAs do not mean anything to either: the buyer, as compensation for not receiving the service that they have contracted for; or the seller as a motivation to work within the targets, because capped service credit agreements do not touch the bottom line Today, in the IP market, it is irrelevant whether services come with 99.0 or 99.99999 SLAs & at some point the market needs to address the responsibility that if they are to offer a service of a certain "guaranteed" quality, then they should stand-by that guarantee with their money and give this guarantee meaning. I don't think this is the case at the moment, and that's why we even see 100% SLA in the market - because the level of pay-outs on SLAs don't matter to the seller. No. I'm not suggesting that sellers offer guarantees for consequential losses. Simply ones that: 1. give the buyer the peace of mind that if the service they've contracted to is below par, that they will have enough money put back in their pocket for them to have replaced their service, like-for-like in the market, to cover themselves over the period. 2. enable the market players to truly differentiate their service offerings by quality, rather than marketing. Toby [std. disclaimer of responsiblity here] Date: 25 Nov 2000 20:24:48 -0800 From: Sean Donelan <sean@donelan.com> Subject: Limits of reliability or is 99.999999999% realistic <snipped for brevity> But back to my question. What is the real requirement? Amazon.COM had system problems on Friday, and their site was unusuable for 30 minutes, definitely not 99.999%. But what did that really mean? The FAA loses its radar for several hours in various parts of the country. What did that really mean? Essentially every system given as an example of "high- availability, high-reliability" I've looked at, doesn't hold up under close examination. Is 99.999% just F.U.D. created by consultants? Instead of pretending we can build systems which will never fail, should we work on a realistic understanding of what can be delivered?
Toby_Williams@enron.net wrote:
SLAs are the point where commercial & operational worlds collide. The numbers being offered should:
Reflect the targeted quality of the service Tie in with meaningful damages for not achieving them
In a market where I can offer 99.7% or 100% and the difference is a whole day's service credit, I know what I'd be offering. No question.
If on the other hand, the expectation is that I pay out a year's contract value in cash upfront every time I miss a target in a month, then I'd only want to offer a target that will *always* be achieved.
In the market at the moment, the situation is much closer to the first scenario than the second - ie damages for SLAs do not mean anything to either:
the buyer, as compensation for not receiving the service that they have contracted for; or the seller as a motivation to work within the targets, because capped service credit agreements do not touch the bottom line
Today, in the IP market, it is irrelevant whether services come with 99.0 or 99.99999 SLAs & at some point the market needs to address the responsibility that if they are to offer a service of a certain "guaranteed" quality, then they should stand-by that guarantee with their money and give this guarantee meaning.
I don't think this is the case at the moment, and that's why we even see 100% SLA in the market - because the level of pay-outs on SLAs don't matter to the seller.
No. I'm not suggesting that sellers offer guarantees for consequential losses. Simply ones that:
1. give the buyer the peace of mind that if the service they've contracted to is below par, that they will have enough money put back in their pocket for them to have replaced their service, like-for-like in the market, to cover themselves over the period.
2. enable the market players to truly differentiate their service offerings by quality, rather than marketing.
Toby [std. disclaimer of responsiblity here]
Date: 25 Nov 2000 20:24:48 -0800 From: Sean Donelan <sean@donelan.com> Subject: Limits of reliability or is 99.999999999% realistic
<snipped for brevity>
But back to my question. What is the real requirement? Amazon.COM had system problems on Friday, and their site was unusuable for 30 minutes, definitely not 99.999%. But what did that really mean? The FAA loses its radar for several hours in various parts of the country. What did that really mean? Essentially every system given as an example of "high- availability, high-reliability" I've looked at, doesn't hold up under close examination.
Is 99.999% just F.U.D. created by consultants?
Instead of pretending we can build systems which will never fail, should we work on a realistic understanding of what can be delivered?
Hello; This reminds me of arguments that we had in my former work involving deep spacecraft. In spacecraft work, JPL has found that, if you did not strive for "5 sigma's" or even "6 sigma's" of reliability, then there would always be something you hadn't counted on that would drive reliability to zero. In other words, planning for very high reliability makes you do the engineering which gives you the redundancy which makes it possible to withstand unexpected events without (too much) failure. To the extent SLA's reflect that, they should be useful, regardless of how sound the statistics are. -- Regards Marshall Eubanks T.M. Eubanks Multicast Technologies, Inc 10301 Democracy Lane, Suite 410 Fairfax, Virginia 22030 Phone : 703-293-9624 Fax : 703-293-9609 e-mail : tme@on-the-i.com tme@multicasttech.com http://www.on-the-i.com http://www.buzzwaves.com
On Mon, Nov 27, 2000 at 11:07:34AM -0500, Marshall Eubanks wrote:
In spacecraft work, JPL has found that, if you did not strive for "5 sigma's" or even "6 sigma's" of reliability, then there would always be something you hadn't counted on that would drive reliability to zero.
NASA's needs don't reflect our needs. If my router fails, I can get to it to fix it in seconds, and if I need to replace it I don't have to wait five years to launch and another year to reach the target. If I have to have a replacement overnighted, I'm still only down a day, not a decade. Also, NASA's budget for a POP on Mars is a little higher than my budget for a POP in Colorado Springs. Hell, their budget for a single lander is probably higher than my budget for an entire data center.
Also, NASA's budget for a POP on Mars is a little higher than my budget for a POP in Colorado Springs. Hell, their budget for a single lander is probably higher than my budget for an entire data center.
It depends on where the lander goes, and how big the data center is. The develpment and operations costs for the Mars Pathfinder rover were 25 million. I don't have the numbers splitting that into development vs. operations, but in another example, the total mission cost of the Lunar Prospector, including all of the instrument development costs was 63 million. Amortizing the development costs over all the missions the instruments will be reused on and (saddle the prospector with its share of the instrument costs it got to reuse), and the "built" mission costs work out to be under 20 million. I admit, the Lunar Prospector's landing was a bit spectacular, but it wasn't that expensive. You can build nice data centers for these costs, I admit, but the costs aren't as different as you might think. regards, Ted Hardie
2000-11-27-11:07:34 Marshall Eubanks:
[...] In other words, planning for very high reliability makes you do the engineering which gives you the redundancy which makes it possible to withstand unexpected events without (too much) failure. To the extent SLA's reflect that, they should be useful, regardless of how sound the statistics are.
A reasonable and good observation to keep in mind from an engineering point of view, but I think the essence of the current complaint with SLAs is that they are completely decoupled from engineering; they seem to show up only with providers whose service is sufficiently poorly run that they never, ever approach delivering the claimed levels, and the SLA itself carries no weight since the penalties for failure (if they can be extracted at all) are too small to benefit the customer, or to influence the provider. In today's internet world they're just marketing drivel. Naturally such strong statements beg for counterexample; please, someone, tell us about providers that offer SLAs with big enough payoffs to provide some sort of incentive, who deliver on the service levels they boast about. Please! -Bennett
The reason for this perception is that non engineering folks don't understand the enginnering behind almost perfect uptimes. Typical Joe executive just knows he believes his business is mission critical and must always be available. If you give someone two router ports running hsrp, connect an alteon or similar 1 public many private ip switch to each of these, and put at least 2 servers behind each switch, provide instant generator power backup, and multiple links thru different providers to the net, this can provide the 99.bazillion 9s level that customers desire. Some slas are meaningful. Just make it so as time goes on in the outage, an increasing portion of the customer's costs are waived. Nothing talks like money in business or politics. Brian
In today's internet world they're just marketing drivel.
Naturally such strong statements beg for counterexample; please, someone, tell us about providers that offer SLAs with big enough payoffs to provide some sort of incentive, who deliver on the service levels they boast about. Please!
-Bennett
The reason for this perception is that non engineering folks don't understand the enginnering behind almost perfect uptimes. Typical Joe executive just knows he believes his business is mission critical and must always be available. If you give someone two router ports running hsrp, connect an alteon or similar 1 public many private ip switch to each of these, and put at least 2 servers behind each switch, provide instant generator power backup, and multiple links thru different providers to the net, this can provide the 99.bazillion 9s level that customers desire. Some slas are meaningful. Just make it so as time goes on in the outage, an increasing portion of the customer's costs are waived. Nothing talks like money in business or politics.
IMHO, based on my previous professional services life, for people for whom uptime really matters, link costs tend to be negligable. For people who really needed their stuff to work and would be SLA candidates, a carrot bigger than 'your service is free for x yrs' would have to be presented. Especially for outtages costing millions of dollars in lost revenue, not to mention lawsuits etc. Just look at what happens to Amazon when their service craps out for 30 minutes, you got an interview on CNNfn right away. Just look at the exposure. Unless you got an SLA with teeth and penalties several multiples or magnitudes of the cost of the service, they're pointless IMHO and just yet another marketing tool a la "free installation" or "1st mth is free" type deals. Cheers, Chris -- Christian Kuhtz <ck@arch.bellsouth.net> -wk, <ck@gnu.org> -hm Sr. Architect, Engineering & Architecture, BellSouth.net, Atlanta, GA, U.S. "I speak for myself only."
All-- In a perfect world, your provider says the service is always on, the service is always on. In the real world, we have to deal with outages -- some kids vandalise a phone box, new tech trips over some cables, some idiot telco misimplements MPLS and brings your service down for a day.... These things happen, and sometimes we all just have to suck it down and deal with it. But if it happens continuously, you have to ask your provider for some assurance that it won't keep happening. This is what SLAs are for. In my experience, a company that delivers reasonable levels of service has no need for SLAs with their clients. The service is up, everybody is happy. SLAs are like your parent telling you to do the dishes again "and get it right this time, or else you go to bed with no jell-o!" Which s why, in looking for a vendor, I ask for an SLA. If they have one as a standard offering, then I know that they've messed up a lot in the past, and will probably be messing up more than I like in the future. It's like the kid who never did his homework coming up to the teacher on the last day of school asking for extra credit. NO YOU FOOL! You should have done it right the last 180 days! Anyway, just my thoughts. On Mon, 27 Nov 2000 Toby_Williams@enron.net wrote:
Date: Mon, 27 Nov 2000 10:14:43 +0000 From: Toby_Williams@enron.net To: nanog@nanog.org Subject: Limits of reliability or is 99.999999999% realistic
SLAs are the point where commercial & operational worlds collide. The numbers being offered should:
Reflect the targeted quality of the service Tie in with meaningful damages for not achieving them
In a market where I can offer 99.7% or 100% and the difference is a whole day's service credit, I know what I'd be offering. No question.
If on the other hand, the expectation is that I pay out a year's contract value in cash upfront every time I miss a target in a month, then I'd only want to offer a target that will *always* be achieved.
In the market at the moment, the situation is much closer to the first scenario than the second - ie damages for SLAs do not mean anything to either:
the buyer, as compensation for not receiving the service that they have contracted for; or the seller as a motivation to work within the targets, because capped service credit agreements do not touch the bottom line
Today, in the IP market, it is irrelevant whether services come with 99.0 or 99.99999 SLAs & at some point the market needs to address the responsibility that if they are to offer a service of a certain "guaranteed" quality, then they should stand-by that guarantee with their money and give this guarantee meaning.
I don't think this is the case at the moment, and that's why we even see 100% SLA in the market - because the level of pay-outs on SLAs don't matter to the seller.
No. I'm not suggesting that sellers offer guarantees for consequential losses. Simply ones that:
1. give the buyer the peace of mind that if the service they've contracted to is below par, that they will have enough money put back in their pocket for them to have replaced their service, like-for-like in the market, to cover themselves over the period.
2. enable the market players to truly differentiate their service offerings by quality, rather than marketing.
Toby [std. disclaimer of responsiblity here]
Date: 25 Nov 2000 20:24:48 -0800 From: Sean Donelan <sean@donelan.com> Subject: Limits of reliability or is 99.999999999% realistic
<snipped for brevity>
But back to my question. �What is the real requirement? �Amazon.COM had system problems on Friday, and their site was unusuable for 30 minutes, definitely not 99.999%. �But what did that really mean? �The FAA loses its radar for several hours in various parts of the country. �What did that really mean? �Essentially every system given as an example of "high- availability, high-reliability" I've looked at, doesn't hold up under close examination.
Is 99.999% just F.U.D. created by consultants?
Instead of pretending we can build systems which will never fail, should we work on a realistic understanding of what can be delivered?
participants (8)
-
Bennett Todd
-
Brian W.
-
Christian Kuhtz
-
hardie@equinix.com
-
Marshall Eubanks
-
mdevney@teamsphere.com
-
Shawn McMahon
-
Toby_Williams@enron.net