So I've embarked on the no-doubt-futile task of trying to interpret SLAs as empirically-verifiable technical specifications, rather than as marketing blather. And there's something that I'm finding particularly puzzling: In most SLAs, there seem to be two separate guarantees proffered: one concerning "network availability" and one concerning "packet loss." Now, if I were to put my engineer hat on, and try to _imagine_ what the difference might be, I might imagine "network availability" to have something to do with layer-2 link status being presented as "up," while packet loss would be the percentage of packets dropped. But when I actually read SLAs, "network availability" is generally defined as the portion of the month that the path from the customer's local loop to the transit or peering routers was "available" to transmit packets. Packet loss, on the other hand, is generally defined as the portion of packets which are lost while crossing that exact same piece of network. Now, what am I missing here? Is this one of those Heisenberg things, where "network availability" is the time the network _could have_ delivered a packet _when you weren't actually doing so_, while "packet loss" is the time the network _couldn't_ deliver a packet when you _were_ actually doing so? Is "network availability" inherently unmeasurable on a network that's less than 100% utilized? Am I over-thinking this? Seriously, though, I know there are people who don't consider SLAs to be fantasy-fiction, and some of them must not be innumerate, and some subset of those must be on NANOG, and the intersection set might be equal to or greater than one, right? Can anybody explain this to me in a way I can translate into code, while still taking myself seriously? -Bill
On Jul 29, 2009, at 12:34 AM, Bill Woodcock wrote:
So I've embarked on the no-doubt-futile task of trying to interpret SLAs as empirically-verifiable technical specifications, rather than as marketing blather. And there's something that I'm finding particularly puzzling:
In most SLAs, there seem to be two separate guarantees proffered: one concerning "network availability" and one concerning "packet loss." Now, if I were to put my engineer hat on, and try to _imagine_ what the difference might be, I might imagine "network availability" to have something to do with layer-2 link status being presented as "up," while packet loss would be the percentage of packets dropped. But when I actually read SLAs, "network availability" is generally defined as the portion of the month that the path from the customer's local loop to the transit or peering routers was "available" to transmit packets. Packet loss, on the other hand, is generally defined as the portion of packets which are lost while crossing that exact same piece of network.
Now, what am I missing here? Is this one of those Heisenberg things, where "network availability" is the time the network _could have_ delivered a packet _when you weren't actually doing so_, while "packet loss" is the time the network _couldn't_ deliver a packet when you _were_ actually doing so?
Is "network availability" inherently unmeasurable on a network that's less than 100% utilized?
Am I over-thinking this?
Yes. But not because you are coming to strange conclusions, but because (as you say in your first sentence), you are trying to put empirical / objective meaning to marketing blather. I had a simple way to fix this. I defined a network as "down" with more than X% packet loss (usually with X in the 2-5 range, depending on other deal parameters). IMHO, a network with 5% packet loss -is- down. I don't know about you, but none of my customers will use my service if they have 5% loss. TCP is finicky! This receives the strongest credit because you cannot use the service. Below X, you are not "down", just degraded, and therefore the link has some utility, but not 100% utility. This receives a credit, but not as strong a credit as being unable to use a link. Oh, and, of course, if the there is no light on the fiber, then we are (obviously) "down" as well. Make sense? Or I am over-thinking it? :) -- TTFN, patrick P.S. Now you get to think about things like "packet loss to / from where?" and whether the last mile should count.
Seriously, though, I know there are people who don't consider SLAs to be fantasy-fiction, and some of them must not be innumerate, and some subset of those must be on NANOG, and the intersection set might be equal to or greater than one, right? Can anybody explain this to me in a way I can translate into code, while still taking myself seriously?
-Bill
Am I over-thinking this?
Yes, I think so. Often a large component of an SLA is related to the cost of compliance versus the cost of the penalty imposed. If it is cheaper to pay the occasional penalty, rather than construct the network to meet the SLA, then the network operator will often make a purely sales/marketing decision to use the SLA without including engineering/OPS in the discussion. Also, the wording often refers to unplanned downtime so that any planned downtime doesn't get counted in the non-availability measure. And sometimes you find some allowance for packet drop during a limited time period so that if you drop a thousand packets, it doesn't count if it happens during the peak hour of the day or if all packets are dropped in a few minutes timeframe. Another limitation that I have seen refers to "core" network or "core" PoPs meaning the part of the network in the major market area (generally the USA and Western Europe) but not covering network or PoPs in "fringe" areas. I don't believe that there is any hard science behind SLAs and that most engineering/OPS teams don't even know what are the actual SLAs being given to customers. There are engineering targets that are sometimes referred to as SLAs but they are not the Service Level Agreement that is in signed customer contracts. All that aside, it would be interesting to see some standards for measuring and reporting things like "network availability" from an engineering point of view. --Michael Dillon
Bill, To be brief, but hopefully not too fleeting, the majority of the standards orgs - ITU, MEF - use packet loss to derive availability. Loss% = the % of packets which were transmitted but not received by the destination host. As for availability, loss is measured across some time period. If during that period X% of the transmitted packets were NOT lost, then the network is said to be available. Typically a 20% figure is used, e.g. if 20% of the packets transmitted during a 5-minute period were received then the network is said to be 100% Available for that 5-minute time period. Some Carriers have taken this to the extreme to say that if at least 1 packet was successfully transmitted then the network was 100% Available for the time period. Loss is a measure of the networks usability, Availability is .......?? (Meaningless??) What utility does a network have that is "Available" yet sustaining a loss rate which renders it inoperable? Rich -----Original Message----- From: Bill Woodcock [mailto:woody@pch.net] Sent: Wednesday, July 29, 2009 12:34 AM To: nanog Subject: Ahoy, SLA boffins! So I've embarked on the no-doubt-futile task of trying to interpret SLAs as empirically-verifiable technical specifications, rather than as marketing blather. And there's something that I'm finding particularly puzzling: In most SLAs, there seem to be two separate guarantees proffered: one concerning "network availability" and one concerning "packet loss." Now, if I were to put my engineer hat on, and try to _imagine_ what the difference might be, I might imagine "network availability" to have something to do with layer-2 link status being presented as "up," while packet loss would be the percentage of packets dropped. But when I actually read SLAs, "network availability" is generally defined as the portion of the month that the path from the customer's local loop to the transit or peering routers was "available" to transmit packets. Packet loss, on the other hand, is generally defined as the portion of packets which are lost while crossing that exact same piece of network. Now, what am I missing here? Is this one of those Heisenberg things, where "network availability" is the time the network _could have_ delivered a packet _when you weren't actually doing so_, while "packet loss" is the time the network _couldn't_ deliver a packet when you _were_ actually doing so? Is "network availability" inherently unmeasurable on a network that's less than 100% utilized? Am I over-thinking this? Seriously, though, I know there are people who don't consider SLAs to be fantasy-fiction, and some of them must not be innumerate, and some subset of those must be on NANOG, and the intersection set might be equal to or greater than one, right? Can anybody explain this to me in a way I can translate into code, while still taking myself seriously? -Bill
I think the desired goal here is to separate the access SLA from the backbone SLA. That is, consider a simple picture: Network Cloud------Provider Edge Router-----Local Loop-----Customer Router Network availability is the % of the time the customer router and provider edge router can communicate, and is designed to measure if the local loop is up. For instance, let's say the provider edge router looses all its uplinks to the Network Cloud, your local loop is up and functioing but you have 100% packet loss to all destinations. The "packet loss" SLA kicks in on a per-destination basis. Everything is up and working, but the provider has a full circuit and is dropping 20% of the packets on that link. You catch it, you get a credit. I think the technical reason why these are separate has to do with the expectations. If my local loop is dropping 0.5% of the packets due to errors, it is broken and must be fixed. If some random destination on the Internet is dropping 0.5% of the packets well, that's a normal day in the life of the network. Plus, if your local loop takes errors then you get a credit. However, if there's a full link in the backbone but none of your packets take it, and thus you are unaffected, you don't. Now, having said all that, and having been one of the people who've attempted to communicate sane, rational, technical ideas to marketing and legal the chance that anything sane made it in the actual contract is, well, nil. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Now, having said all that, and having been one of the people who've attempted to communicate sane, rational, technical ideas to marketing and legal the chance that anything sane made it in the actual contract is, well, nil.
I disagree. If someone takes the trouble to publish a technical document describing a sane technical way to measure a network SLA, and they also provide code for measuring/calculating the SLA, then there is a good chance that the industry will pick it up. Look at 95th percentile billing. Dave Rand at Abovenet thought it up, probably to simplify the billing process and keep billing overhead costs down. Then UUNet picked it up and suddenly just about everyone was offering a 95th percentile billing model. -- Michael Dillon
Aawaw On 7/29/09, Bill Woodcock <woody@pch.net> wrote:
So I've embarked on the no-doubt-futile task of trying to interpret SLAs as empirically-verifiable technical specifications, rather than as marketing blather. And there's something that I'm finding particularly puzzling:
In most SLAs, there seem to be two separate guarantees proffered: one concerning "network availability" and one concerning "packet loss." Now, if I were to put my engineer hat on, and try to _imagine_ what the difference might be, I might imagine "network availability" to have something to do with layer-2 link status being presented as "up," while packet loss would be the percentage of packets dropped. But when I actually read SLAs, "network availability" is generally defined as the portion of the month that the path from the customer's local loop to the transit or peering routers was "available" to transmit packets. Packet loss, on the other hand, is generally defined as the portion of packets which are lost while crossing that exact same piece of network.
Now, what am I missing here? Is this one of those Heisenberg things, where "network availability" is the time the network _could have_ delivered a packet _when you weren't actually doing so_, while "packet loss" is the time the network _couldn't_ deliver a packet when you _were_ actually doing so?
Is "network availability" inherently unmeasurable on a network that's less than 100% utilized?
Am I over-thinking this?
Seriously, though, I know there are people who don't consider SLAs to be fantasy-fiction, and some of them must not be innumerate, and some subset of those must be on NANOG, and the intersection set might be equal to or greater than one, right? Can anybody explain this to me in a way I can translate into code, while still taking myself seriously?
-Bill
We use the BRIX active measurement system (BRIX now owned by EXFO) which gathers round trip time, packet loss, and jitter randomly every minute 24x7x365 for our major backbone links to calculate SLAs. "Network Availability" can be measured empirically using BRIX calculated values of packet loss, and expressed in terms of #9's, which BRIX will also calculate over any time period for which BRIX historical data is being kept. BRIX historical data is kept on an embedded Oracle data base. BRIX usually runs on a Solaris SMP server. -----Original Message----- From: Bill Woodcock [mailto:woody@pch.net] Sent: Tuesday, July 28, 2009 9:34 PM To: nanog Subject: Ahoy, SLA boffins! So I've embarked on the no-doubt-futile task of trying to interpret SLAs as empirically-verifiable technical specifications, rather than as marketing blather. And there's something that I'm finding particularly puzzling: In most SLAs, there seem to be two separate guarantees proffered: one concerning "network availability" and one concerning "packet loss." Now, if I were to put my engineer hat on, and try to _imagine_ what the difference might be, I might imagine "network availability" to have something to do with layer-2 link status being presented as "up," while packet loss would be the percentage of packets dropped. But when I actually read SLAs, "network availability" is generally defined as the portion of the month that the path from the customer's local loop to the transit or peering routers was "available" to transmit packets. Packet loss, on the other hand, is generally defined as the portion of packets which are lost while crossing that exact same piece of network. Now, what am I missing here? Is this one of those Heisenberg things, where "network availability" is the time the network _could have_ delivered a packet _when you weren't actually doing so_, while "packet loss" is the time the network _couldn't_ deliver a packet when you _were_ actually doing so? Is "network availability" inherently unmeasurable on a network that's less than 100% utilized? Am I over-thinking this? Seriously, though, I know there are people who don't consider SLAs to be fantasy-fiction, and some of them must not be innumerate, and some subset of those must be on NANOG, and the intersection set might be equal to or greater than one, right? Can anybody explain this to me in a way I can translate into code, while still taking myself seriously? -Bill
On Wed, Jul 29, 2009 at 12:34 AM, Bill Woodcock<woody@pch.net> wrote:
Am I over-thinking this?
The SLA's I've looked at promise me that if their service is hard down for a week (with no ambiguity whatsoever) they'll credit my bill for upwards of 2% of the $50k/year or so I spend on the Internet connection for my mutli-million dollar online service. So yeah, you're overthinking it. When they start coupling those SLAs with some sort of serious business loss insurance, then paying attention to the SLA and carefully examining what constitutes failure may make some kind sense at a technical level. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004
William Herrin wrote:
On Wed, Jul 29, 2009 at 12:34 AM, Bill Woodcock<woody@pch.net> wrote:
Am I over-thinking this?
The SLA's I've looked at promise me that if their service is hard down for a week (with no ambiguity whatsoever) they'll credit my bill for upwards of 2% of the $50k/year or so I spend on the Internet connection for my mutli-million dollar online service.
I'm really surprised anyone considers this an SLA, or anything special in a business contract. I automatically expect to get a credit of 1.923% if the service were not provided for a period of 168 hours, no questions asked and no SLA required. When service is simply not provided, there's nothing special about not having to pay for it. I don't know of any business where you can have a contract that requires you to pay your monthly/annual fee for services when said services are not provided. If you have a housekeeping or lawn service that is supposed to come once a week, and you have an annual contract with them for this service at $50/week, and they miss a week (provide no service) you don't pay them anyway for that missed week. You don't need an SLA in your contract with them to have this right to withhold payment for the period of time when the services are not provided *at all*. An SLA comes into play when a service is degraded below the quality you contracted for. What credit do they give you when you have 168 hours of degraded service, e.g. 50% of the service level you specified in your RFQ? That's where your SLA comes in. The SLA specifies at what point your service is considered "degraded" (how much below the contracted service level, and how long of a time period is required before it is considered below grade) and what $credit you may receive when you are provided some service, but not to the level specified in your contract. jc
On Wed, Jul 29, 2009 at 4:19 PM, JC Dill<jcdill.lists@gmail.com> wrote:
William Herrin wrote:
The SLA's I've looked at promise me that if their service is hard down for a week (with no ambiguity whatsoever) they'll credit my bill for upwards of 2% of the $50k/year or so I spend on the Internet connection for my mutli-million dollar online service.
An SLA comes into play when a service is degraded below the quality you contracted for. What credit do they give you when you have 168 hours of degraded service, e.g. 50% of the service level you specified in your RFQ? That's where your SLA comes in. The SLA specifies at what point your service is considered "degraded" (how much below the contracted service level, and how long of a time period is required before it is considered below grade) and what $credit you may receive when you are provided some service, but not to the level specified in your contract.
Hi JC, Perhaps you miss my point: what the ISP is offering to pay me as a result of a failure to deliver adequate service is so much less than my loss for the same as to render the payment meaningless. I'm gonna terminate the contract for nonperformance and hire someone who can get the job done long before its worth my time to chase you for an SLA-based service credit. And we both know it. The only way I ever chase you for an SLA credit is I'm playing the blame game instead of doing my job for my customers. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004
On 7/29/09, William Herrin <herrin-nanog@dirtside.com> wrote:
Perhaps you miss my point: what the ISP is offering to pay me as a result of a failure to deliver adequate service is so much less than my loss for the same as to render the payment meaningless. I'm gonna terminate the contract for nonperformance and hire someone who can get the job done long before its worth my time to chase you for an SLA-based service credit. And we both know it. The only way I ever chase you for an SLA credit is I'm playing the blame game instead of doing my job for my customers.
Actually, SLA credits are useful in cases where it's not the only path between two sites; if, for example, you have 12 OC192 links running across the US, but your peak traffic on them doesn't exceed 80Gb combined, having an OC192 down for a day or two won't really hurt you; there's no reason to cancel the circuit, the rest of your links are carrying the traffic just fine, but since one of the links failed to meet its SLA, you might as well push the vendor to give you the SLA credit back; it saves you some money, you have no lost customers, you have no other impact to your business. It's not about playing the "blame game", it's about giving the vendor an incentive to try to run their system a bit more reliably. Now, for single-homed customers depending on that one link, I agree, an SLA is largely meaningless compared to the impact of being down. But there's many cases where the SLA is meaningful, and collecting SLA credits is worth it, without there being a corresponding massive loss in revenue associated with the outage. Matt
JC Dill wrote:
William Herrin wrote:
The SLA's I've looked at promise me that if their service is hard down for a week (with no ambiguity whatsoever) they'll credit my bill for upwards of 2% of the $50k/year or so I spend on the Internet connection for my mutli-million dollar online service.
I'm really surprised anyone considers this an SLA, or anything special in a business contract. I automatically expect to get a credit of 1.923% if the service were not provided for a period of 168 hours, no questions asked and no SLA required.
When service is simply not provided, there's nothing special about not having to pay for it.
Read your contract closely and you'll find that, except for an explicit SLA clause (which will cost you extra), they make no guarantee that the circuit will work at all and you'll still owe them money. On top of that, the SLA payouts are usually capped at an amount _less_ than the price increase due to demanding an SLA. If your circuit costs $2k/mo, and it's down for an entire month, you'll probably still owe them at least $1500 for that non-service -- and you could buy a non-SLA service for the same $1500/mo. (Savvy customers who are spending big bucks know how to negotiate these terms to be more favorable, but most customers aren't savvy unless they've already been burned by this.) S -- Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking
Stephen Sprunk wrote:
Read your contract closely and you'll find that, except for an explicit SLA clause (which will cost you extra), they make no guarantee that the circuit will work at all and you'll still owe them money.
I am not a lawyer. However, over the years many lawyers have told me you can't have a legally enforcible contract that says (in essence) you owe me money even if I give you absolutely nothing in exchange (or visa versa). A legally enforcible contract must *always* have an exchange of consideration - I give you something (money, labor, tangible property, intangible property) in exchange for something you give me. Many businesses try this type of crap all the time, but (according to the above mentioned lawyers) it's not worth the paper it is written on. They make these clauses hoping the other party doesn't know their rights. However, contract law (e.g. the UCC) trumps unenforcible and illegal clauses in your contracts (this is why we *have* civil laws regarding civil contracts, otherwise there would be no point in civil laws at all). But please don't take my word for it, ask your own lawyer to review your contract and give you an opinion about the legality and enforceability of clauses of this type, in your particular contract. jc
Indeed, that's why some companies have contracts managers with experience of thieving gits who try to rip you off on SLAs. We indeed have been burned and so our contracts worth any money now have real good incentives for the vendors to come up with the goods and make what they sell us work. Even though, sometimes important stuff gets dropped because the vendor refuses to be bound by it, and then, we get screwed over it. -- Leigh -----Original Message----- From: Stephen Sprunk [mailto:stephen@sprunk.org] Sent: Wed 7/29/2009 10:52 PM To: JC Dill Cc: North American Noise and Off-topic Gripes Subject: Re: Ahoy, SLA boffins! JC Dill wrote:
William Herrin wrote:
The SLA's I've looked at promise me that if their service is hard down for a week (with no ambiguity whatsoever) they'll credit my bill for upwards of 2% of the $50k/year or so I spend on the Internet connection for my mutli-million dollar online service.
I'm really surprised anyone considers this an SLA, or anything special in a business contract. I automatically expect to get a credit of 1.923% if the service were not provided for a period of 168 hours, no questions asked and no SLA required.
When service is simply not provided, there's nothing special about not having to pay for it.
Read your contract closely and you'll find that, except for an explicit SLA clause (which will cost you extra), they make no guarantee that the circuit will work at all and you'll still owe them money. On top of that, the SLA payouts are usually capped at an amount _less_ than the price increase due to demanding an SLA. If your circuit costs $2k/mo, and it's down for an entire month, you'll probably still owe them at least $1500 for that non-service -- and you could buy a non-SLA service for the same $1500/mo. (Savvy customers who are spending big bucks know how to negotiate these terms to be more favorable, but most customers aren't savvy unless they've already been burned by this.) S -- Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking
participants (12)
-
Andreas, Rich
-
Bill Woodcock
-
Holmes,David A
-
JC Dill
-
Leigh Porter
-
Leo Bicknell
-
Matthew Petach
-
Michael Dillon
-
Net
-
Patrick W. Gilmore
-
Stephen Sprunk
-
William Herrin