I've gotten myself into an argument with a provider about the definition of 'industry-standard 95th percentile method.' To me, this means the following: a) take the number of bytes xfered over a 5 minute period, and determine rate for both the inbound and outbound. Store this in your favorite data-store. b) at billing time, presumably on the first of the month or some other monthly increment, take all the samples, sort them from greatest to least, hacking off the top 5% of samples. Actually, this is done twice, once for inbound, once for outbound. Then, take the higher of those two, and multiply it by your favorite $ multiple (ie, $500 per megabit per second, or $1 per kilobit per second, etc). I think that most people agree with the above; the issue we are running into is one rogue provider who is billing this at in + out, not the greater of in or out. How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3) Thanks!
Hi Alex, I work as an engineer in the product development group at Telseon. I'm curious about what feedback you get, especially what method the other providers you list use to calculate the 95th percentile. For what it's worth, I agree with you and the method you mention. I'd be surprised if others in that league are doing it differently. Telseon doesn't bill using the 95th percentile method though. We let the customer adjust their bandwidth on the fly and bill them for what they provision. Since they can reprovision via a web interface they can jump around (5 megs one day 500 megs the next). Thanks in advance for any info you garner. -Sean Morrison On Thu, 19 Apr 2001, Alex Rubenstein wrote:
I've gotten myself into an argument with a provider about the definition of 'industry-standard 95th percentile method.'
To me, this means the following:
a) take the number of bytes xfered over a 5 minute period, and determine rate for both the inbound and outbound. Store this in your favorite data-store.
b) at billing time, presumably on the first of the month or some other monthly increment, take all the samples, sort them from greatest to least, hacking off the top 5% of samples. Actually, this is done twice, once for inbound, once for outbound. Then, take the higher of those two, and multiply it by your favorite $ multiple (ie, $500 per megabit per second, or $1 per kilobit per second, etc).
I think that most people agree with the above; the issue we are running into is one rogue provider who is billing this at in + out, not the greater of in or out.
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
Thanks!
On Thu, 19 Apr 2001, Alex Rubenstein wrote:
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
FWIW, Abovenet handles it exactly like you mentioned - Max(95%(out),95%(in)), and I'm fairly certain UUnet does as well. At least, the weekly transfer stats they mail us seem to indicate that, but we don't have a contract that would make use of transfer stats, so it is possible that it varies, but I don't think so. Andy xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Andy Dills 301-682-9972 Xecunet, LLC www.xecu.net xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Dialup * Webhosting * E-Commerce * High-Speed Access
On Thu, 19 Apr 2001, Andy Dills wrote:
On Thu, 19 Apr 2001, Alex Rubenstein wrote:
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
FWIW, Abovenet handles it exactly like you mentioned - Max(95%(out),95%(in)), and I'm fairly certain UUnet does as well. At least, the weekly transfer stats they mail us seem to indicate that, but we don't have a contract that would make use of transfer stats, so it is possible that it varies, but I don't think so.
Sorry to followup my own post, but I just received my weekly utilization report. Here is the statement at the bottom of that email: "The statistics for Tiered and Burstable customers are derived by taking a sample of the customer traffic every five minutes on the in and out packets sent/received on the customer's UUNET connection. These statistics are then aggregated over a (daily, weekly, monthly) period where the top (20%, 5%, 1%) of the traffic is discarded to arrive at the (P80, P95, P99) flow statistic. Metered customer's statistics are derived by calculating the total in and out octets on the customer's UUNET connection." Everything makes sense until the last line. "Calculating the total in and out octets" could imply either max or sum, so it probably depends on the contract, which seems to be the bottom line to this thread anyhow. Andy xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Andy Dills 301-682-9972 Xecunet, LLC www.xecu.net xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Dialup * Webhosting * E-Commerce * High-Speed Access
[ On Thursday, April 19, 2001 at 12:13:34 (-0400), Andy Dills wrote: ]
Subject: Re: What does 95th %tile mean?
Everything makes sense until the last line. "Calculating the total in and out octets" could imply either max or sum, so it probably depends on the contract, which seems to be the bottom line to this thread anyhow.
When I see the words "total" and "and" used like that it can only mean that addition is the operation of choice. Inserting the missing word "of" in there might help: Calculating the total of in and out octets ... -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
There's a nice description/example of the 95th percentil usage calculation on uunet's page in the burstable access section at http://www.uu.net/ca/products/uudirect/burstable/ On Thu, 19 Apr 2001, Greg A. Woods wrote:
[ On Thursday, April 19, 2001 at 12:13:34 (-0400), Andy Dills wrote: ]
Subject: Re: What does 95th %tile mean?
Everything makes sense until the last line. "Calculating the total in and out octets" could imply either max or sum, so it probably depends on the contract, which seems to be the bottom line to this thread anyhow.
When I see the words "total" and "and" used like that it can only mean that addition is the operation of choice.
Inserting the missing word "of" in there might help:
Calculating the total of in and out octets ...
-- Sebastien Berube Systems Administrator sberube@zeroknowledge.com "Why do you necessarily have to be wrong just because a few million people think you are?"
AT&T's policy for measured burstable service looks something like this: The Provider Access Router is polled every 5 minutes for total octets in and total octets out. Data is divided by 300 (the number of seconds in a 5 minute period), giving two averages (one in, one out) for the previous 5 minute period These averages become data points, which are tracked over the course of the customer's monthly billing cycle. Top 5% of the data points are disregarded (be they IN or OUT). We bill at the 95% level of usage Michelle Truman
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
I believe (but am not positive) that Exodus does billing on 95in + 95out, where 95th in and 95th out are calculated separately for the month and then added together. What also might be interesting is to see if the burstable rates charged by the various providers differ based on the calculation type. For example, if you did the calculation as 95in+95out, then you could charge less per "byte" than someone who did Max(95in,95out) - but end up charging the same customer more each month... Since most of the people signing the contracts don't have a clue about how burstable is calculated, I can see marketing at a provider saying "go with us because our rates are cheaper" when in fact they are more expensive... Just my $0.02... Eric :)
[ On Thursday, April 19, 2001 at 12:35:29 (-0400), Eric Gauthier wrote: ]
Subject: Re: What does 95th %tile mean?
I can see marketing at a provider saying "go with us because our rates are cheaper" when in fact they are more expensive... Just my $0.02...
Any person in any department, marketing or otherwise, that actually does that should be put up on fraud charges ASAP (well if the customer signs, I guess). That's not a case of "buyer beware", that's flat out lying. Of course if the "engineers" tell marketing that story and then marketing just passes it on, well you've got to be sure you get the right culprit. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
I know one company in Europe that uses the in + out model. Thomas ----- Original Message ----- From: "Alex Rubenstein" <alex@corp.nac.net> To: <nanog@merit.edu> Sent: Thursday, April 19, 2001 10:09 AM Subject: What does 95th %tile mean?
I've gotten myself into an argument with a provider about the definition of 'industry-standard 95th percentile method.'
To me, this means the following:
a) take the number of bytes xfered over a 5 minute period, and determine rate for both the inbound and outbound. Store this in your favorite data-store.
b) at billing time, presumably on the first of the month or some other monthly increment, take all the samples, sort them from greatest to least, hacking off the top 5% of samples. Actually, this is done twice, once for inbound, once for outbound. Then, take the higher of those two, and multiply it by your favorite $ multiple (ie, $500 per megabit per second, or $1 per kilobit per second, etc).
I think that most people agree with the above; the issue we are running into is one rogue provider who is billing this at in + out, not the greater of in or out.
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
Thanks!
Isn't in+out a more fair representation of usage? I've always assumed that this was the standard to be honest. Thank god I'm not the billing person. I think Exodus does in+out. -M At 03:06 PM 4/19/2001 -0400, Thomas Kernen wrote:
I know one company in Europe that uses the in + out model.
Thomas
----- Original Message ----- From: "Alex Rubenstein" <alex@corp.nac.net> To: <nanog@merit.edu> Sent: Thursday, April 19, 2001 10:09 AM Subject: What does 95th %tile mean?
I've gotten myself into an argument with a provider about the definition of 'industry-standard 95th percentile method.'
To me, this means the following:
a) take the number of bytes xfered over a 5 minute period, and determine rate for both the inbound and outbound. Store this in your favorite data-store.
b) at billing time, presumably on the first of the month or some other monthly increment, take all the samples, sort them from greatest to least, hacking off the top 5% of samples. Actually, this is done twice, once for inbound, once for outbound. Then, take the higher of those two, and
multiply
it by your favorite $ multiple (ie, $500 per megabit per second, or $1 per kilobit per second, etc).
I think that most people agree with the above; the issue we are running into is one rogue provider who is billing this at in + out, not the greater of in or out.
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
Thanks!
Regards, -- Martin Hannigan hannigan@fugawi.net Fugawi Networks Founder/Director of Implementation Boston, MA http://www.fugawi.net Ph: 617.742.2693 Fax: 617.742.2300
[ On Thursday, April 19, 2001 at 16:07:37 (-0400), Martin Hannigan wrote: ]
Subject: Re: What does 95th %tile mean?
Isn't in+out a more fair representation of usage? I've always assumed that this was the standard to be honest. Thank god I'm not the billing person. I think Exodus does in+out.
Either (in+out) or MAX(in,out) should be an equally fair measure of usage, at least from the customer's perspective. The difference is in the pricing, and if both the customers and the vendors are not equally aware of the particular computation used by each other then it's impossible to know what's competetive and what's a rip-off (accidental or otherwise). That's true of any form of usage-based billing too -- i.e. for either bulk throughput pricing (octets per period), or Nth percentile pricing. Some ISPs have un-balanced in/out loads though and those that do can usually afford to sell whichever they've got in surplus at a lower price. A wise ISP might attract more wise customers by offering separate pricing strucutres for in and out traffic, or they might offer "free" services in whichever direction they can (eg. a primarily access-only provider offering to host mailing lists, FTP archives, etc.; or hosting providers offering to provide access POPs for charity groups, etc.). -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
The 95% reading always struck me as a randomly generated number in any case. Take an extreme example - a customer operates a wire such that both in and out are at line rate for five minutes, and then both in and out are idle for five minutes, continually. Depending on the synchronization between the burst pattern and the sampling system, and the sampling technique itself, the 95% reading can be zero, half the line rate, or the line rate, and all answers are equally valid in some sense. While real situations do not exhibit such a large range of potential variability (i.e. 100%), there is still a hefty level of variation in a 95% reading due to the interactions between the time base of the traffic, the time base of the meter engine and the sampling technique used by the meter engine. It leads to the situation where the provider confidently asserts that the 95% value was xkbps, and the customer confidently asserting ykbps and both readings are equally valid, with both measurements using the _same_ measurement technique. How is the consequent billing dispute resolved _fairly_?
On Fri, Apr 20, 2001 at 08:03:02AM +1000, Geoff Huston wrote:
Depending on the synchronization between the burst pattern and the sampling system, and the sampling technique itself, the 95% reading can be zero, half the line rate, or the line rate, and all answers are equally valid in some sense.
They don't take a one-second sample every five minutes, they take the five-minute average rate measured by their router. Unless they're insane, or their routers don't support that. I dunno who makes routers that don't support that, though.
They don't take a one-second sample every five minutes, they take the five-minute average rate measured by their router.
Unless they're insane, or their routers don't support that. I dunno who makes routers that don't support that, though.
Sorry, perhaps I didn't make the extreme example sufficiently clear: In the extreme I cited, (full rate for 5 minutes, idle for five minutes, repeated), the five minute average rate oscillates between zero and full line rate. The period of oscillation is 10 minutes (i.e. five minutes for the five minute rate to decay from line rate to zero and fine minutes to build back to line rate). Now if you sample every five minutes, and the sample point is synchronized to the peak and trough of the five minute rate you will get successive readings of 'line rate', zero, 'line rate', zero, etc. The 95% sample value will be 'line rate'. If you change _nothing_ except shift the sample point two and one half minutes forward in time the sample points will consistently produce outcomes of 'half line rate', 'half line rate', ..., and the 95% point is 'one half of line rate'. Same algorithm, same raw data, different 95% answers, both valid, yet one is twice as large as the other. Great outcome for a billing system isn't it? (The comment in my earlier note about getting a zero reading requires using something other than a 5 minute average data rate. The point I'm trying to make in this posting is that even if you do the 'right' thing and collect interface data readings every five minutes and do the first order differentials yourself to get the five minute data rates, the 95% 'answer' is still variable.)
Same algorithm, same raw data, different 95% answers, both valid, yet one is twice as large as the other. Great outcome for a billing system isn't it?
Any billing scheme based upon statistical sampling will, with some probability, err in the favor of one party or the other randomly. But it is important that the customer understands that he is being billed based upon statistical sampling and thus there are no "exact" measurements. I've looked at other ways and can't find any better. Billing based upon NetFlow, for example, is still statistical sampling since NetFlow loses a percentage of flows. For example, one of my VIP2-50's says: 368351628 flows exported in 12278484 udp datagrams 33838 flows failed due to lack of export packet 269989 export packets were dropped enqueuing for the RP 108825 export packets were dropped due to IPC rate limiting Billing based upon total bytes transferred tends to create similar problems. Do you bill based upon bytes transferred per day? Per month? If so, it's still statistical sampling if you have some amount of 'paid bandwidth'. And you can't collect this data from interfaces because interface rates include local traffic, which (for example) grossly overbills customers with newsfeeds. I think there would be a market for a device with two GE interfaces that accounted for everything that passed through the two interfaces in a reliable and configurable way. It would have to be capable of fault-tolerant operation with multiple units. It would have to be free too. ;) DS
At 4/20/01 10:18 AM, David Schwartz wrote:
Billing based upon total bytes transferred tends to create similar problems. Do you bill based upon bytes transferred per day? Per month? If so, it's still statistical sampling if you have some amount of 'paid bandwidth'.
I think its the last part of this statement about 'paid bandwidth' which is the bit that may make your statistical sampling comment , but I'm unsure if your 'paid bandwidth' is the same as the one thats in my head. In general (minus 'paid bandwidth' and taking the view that all bytes passed between the customer and the provider have the same billable value) byte transferred systems are more reliable if you take as your yardstick of 'reliability' that the same algorithm applied to the same raw data should yield the same result. As long as both parties can agree (precisely) when the measurement interval starts and stops, of course. Of course if you then want to complicate the picture by attaching different billing rates to different packets, then once more the complexity rises and the accuracy tends to drop.
Along the lines of this thread, What do you folks use for GATHERING said billing stats? I use a combination of MRTG and Cricket, which works pretty well, what are other folks using? I'm always willing to look at other folks' solutions to problems, as none of us, well, except Mr. Frazier, knows everything:). todd S4R www.s4r.com On Fri, 20 Apr 2001, Geoff Huston wrote:
At 4/20/01 10:18 AM, David Schwartz wrote:
Billing based upon total bytes transferred tends to create similar problems. Do you bill based upon bytes transferred per day? Per month? If so, it's still statistical sampling if you have some amount of 'paid bandwidth'.
I think its the last part of this statement about 'paid bandwidth' which is the bit that may make your statistical sampling comment , but I'm unsure if your 'paid bandwidth' is the same as the one thats in my head.
In general (minus 'paid bandwidth' and taking the view that all bytes passed between the customer and the provider have the same billable value) byte transferred systems are more reliable if you take as your yardstick of 'reliability' that the same algorithm applied to the same raw data should yield the same result. As long as both parties can agree (precisely) when the measurement interval starts and stops, of course.
Of course if you then want to complicate the picture by attaching different billing rates to different packets, then once more the complexity rises and the accuracy tends to drop.
[ On Thursday, April 19, 2001 at 17:55:28 (-0700), Todd Suiter wrote: ]
Subject: RE: What does 95th %tile mean?
Along the lines of this thread, What do you folks use for GATHERING said billing stats? I use a combination of MRTG and Cricket, which works pretty well, what are other folks using? I'm always willing to look at other folks' solutions to problems, as none of us, well, except Mr. Frazier, knows everything:).
Neither MRTG nor Cricket (nor anything with RRDtool or anything similar underlying it), in their standard released form, are truly suitable for accounting purposes since they both can introduce additional averaging errors. You need to keep all of the original sample data. The best tool will depend on what type of device is being queried, but in general something like Cricket could provide a decent framework that already draws pretty pictures for visualisation. All you'd need to do is introduce a second call in the collector to send the current samples to some other recording mechanism so that you can save the original sample data separate from the Cricket RRDs. You could simply drop the samples along with a timestamp into a flat file for later processing, or you could immediately insert them into some kind of database. Cricket would be a good starting point because it already has ability to do not just SNMP queries but also the ability to take data from any program. It's also got a half-decent configuration framework. One thing I should point out is that from an auditing perspective it's fairly important to try and record the time that the counter sample was actually taken. This sample timestamp can be used to assure anyone looking at the data that even if samples are missing the counter deltas between samples are still being used to properly calculate the average rate over the actual sample period. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
On Thu, 19 Apr 2001, Greg A. Woods wrote:
Neither MRTG nor Cricket (nor anything with RRDtool or anything similar underlying it), in their standard released form, are truly suitable for accounting purposes since they both can introduce additional averaging errors. You need to keep all of the original sample data.
This actually works pretty well: http://www.seanadams.com/95/ There was a very similar discussion just weeks ago on the datacenter mailinglist as well, you all might want to peek at the archives... Charles
-- Greg A. Woods
+1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
[ On Friday, April 20, 2001 at 00:52:39 (-0400), Charles Sprickman wrote: ]
Subject: RE: What does 95th %tile mean?
On Thu, 19 Apr 2001, Greg A. Woods wrote:
Neither MRTG nor Cricket (nor anything with RRDtool or anything similar underlying it), in their standard released form, are truly suitable for accounting purposes since they both can introduce additional averaging errors. You need to keep all of the original sample data.
This actually works pretty well:
If you read that page carefully you'll note that he's using a modified version of MRTG that doesn't average its samples. As it says: This is a patch to add 95th percentile metering to MRTG. This is not as simple a feature as one might think. MRTG normally saves only one day worth of 5-minute samples. It is not possible to accurately calculate the 95th percentile without having all of the samples for a one month period. In order to calculate the 95th percentile for a 30-day period, it is necessary to save an entire 30 days worth of the 5-minute samples. MRTG does not do that by default, nor does Cricket, nor will any tool using RRDtool as an underlying database.
There was a very similar discussion just weeks ago on the datacenter mailinglist as well, you all might want to peek at the archives...
Perhaps you should look again at who posted to that discussion.... :-) -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
If you read that page carefully you'll note that he's using a modified version of MRTG that doesn't average its samples. As it says:
No. The only change he makes the MRTG is to display the 95th percentile values he computes.
This is a patch to add 95th percentile metering to MRTG. This is not as simple a feature as one might think. MRTG normally saves only one day worth of 5-minute samples. It is not possible to accurately calculate the 95th percentile without having all of the samples for a one month period. In order to calculate the 95th percentile for a 30-day period, it is necessary to save an entire 30 days worth of the 5-minute samples.
Right, so he has to make backups of MRTG's log file in order to have sufficient history data.
MRTG does not do that by default, nor does Cricket, nor will any tool using RRDtool as an underlying database.
MRTG does not do *what* by default? DS
[ On Friday, April 20, 2001 at 00:52:39 (-0400), Charles Sprickman wrote: ]
Subject: RE: What does 95th %tile mean?
On Thu, 19 Apr 2001, Greg A. Woods wrote:
Neither MRTG nor Cricket (nor anything with RRDtool or anything similar underlying it), in their standard released form, are truly suitable for accounting purposes since they both can introduce additional averaging errors. You need to keep all of the original sample data.
This actually works pretty well:
If you read that page carefully you'll note that he's using a modified version of MRTG that doesn't average its samples. As it says:
This is a patch to add 95th percentile metering to MRTG. This is not as simple a feature as one might think. MRTG normally saves only one day worth of 5-minute samples. It is not possible to accurately calculate the 95th percentile without having all of the samples for a one month period. In order to calculate the 95th percentile for a 30-day period, it is necessary to save an entire 30 days worth of the 5-minute samples.
MRTG does not do that by default, nor does Cricket, nor will any tool using RRDtool as an underlying database.
You need to use the "old" MRTG without RRDTOOL to avoid the averaging. It maintains an accurate timestamp for the previous sample so that the data store in the table is accurate even if there was some jitter in the collection interval. You still do need to maintain backup logs so that you have the entire month's of 5-minute samples. I have tried arguing against the "corrections" that RRDTOOL makes to data, but the only suggested "fix" is to lie to RRDTOOL about the timestamp. I understand that the old MRTG database is "wrong" since the timestamp it stores in the database is not the actual sample collection time. However, for most of the things I want to do, I prefer to know what the real data was at the collection point closest to the time of interest instead of what the data "should" have been if it was collected at precisely the right time. As for the original topic, we used Alex's (max(in,out)) definition of 95% percentile billing. I always thought that the in+out method was a little "sleazy" since the explanation is usually buried in some fine print and people who aren't careful can be easily tricked into making an invalid provider comparison. For the journal of meaningless statistics, we found that over time, the average (mean) usage for our "typical" colocation customer was 69-72% of the 95% value. The 95% measure definitely isn't the answer to all problems. It does address some problems that "actual usage" doesn't though. Mainly, if you bill based on actual usage, customers can get very nervous that things like smurf attacks that are out of their control will send their bill through the roof. Depending on your customers and your business model, there are other ways to deal with the problem though. FWIW, I expect that the 95% model will slowly be phased out as the industry matures. -dpm -- David P. Maynard, CTO OutServ.net, Inc. -- The e-Business Operations Solution [TM] EMail: dmaynard@outserv.net, Tel: +1 512 977 8918, Fax: +1 512 977 0986 --
On Thu, Apr 19, 2001, Greg A. Woods wrote:
Neither MRTG nor Cricket (nor anything with RRDtool or anything similar underlying it), in their standard released form, are truly suitable for accounting purposes since they both can introduce additional averaging errors. You need to keep all of the original sample data.
The best tool will depend on what type of device is being queried, but in general something like Cricket could provide a decent framework that already draws pretty pictures for visualisation. All you'd need to do is introduce a second call in the collector to send the current samples to some other recording mechanism so that you can save the original sample data separate from the Cricket RRDs. You could simply drop the samples along with a timestamp into a flat file for later processing, or you could immediately insert them into some kind of database. Cricket would be a good starting point because it already has ability to do not just SNMP queries but also the ability to take data from any program. It's also got a half-decent configuration framework.
A little bit of math will show that its actually very feasable to collect and store minutely or 5-minutely data from a router and store them in a database. I've done *that* before, and it works very well.
One thing I should point out is that from an auditing perspective it's fairly important to try and record the time that the counter sample was actually taken. This sample timestamp can be used to assure anyone looking at the data that even if samples are missing the counter deltas between samples are still being used to properly calculate the average rate over the actual sample period.
What? You would consider storing samples without a timestamp? *grin* Adrian -- Adrian Chadd "Two hundred and thirty-three thousand <adrian@creative.net.au> times the speed of light. Dear holy <censored> <censored>"
[ On Friday, April 20, 2001 at 10:30:19 (+1000), Geoff Huston wrote: ]
Subject: RE: What does 95th %tile mean?
In general (minus 'paid bandwidth' and taking the view that all bytes passed between the customer and the provider have the same billable value) byte transferred systems are more reliable if you take as your yardstick of 'reliability' that the same algorithm applied to the same raw data should yield the same result. As long as both parties can agree (precisely) when the measurement interval starts and stops, of course.
For almost any value of N < 100, there is absolutely no synchronisation of sampling periods required when calculating an Nth percentile bandwidth usage, at least not with "normal" Internet usage. For many users even if N==100 you can still get a reasonably fair measure of peak usage by ignoring any rate which matches the maximum line rate (obviously this fails if the user does actually have a 90th percentile, or so, usage equal to the line rate). Nth percentile metering is simply a statistically fair way to find an agreeable peak rate usage that is not the maximum line rate. My guess is that the shorter the counter sample period the closer the customer will want the value of N to approach 90 (or even less! :-). -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
On Thu, Apr 19, 2001 at 05:18:02PM -0700, David Schwartz wrote:
I've looked at other ways and can't find any better. Billing based upon NetFlow, for example, is still statistical sampling since NetFlow loses a percentage of flows. For example, one of my VIP2-50's says:
So did you calculate, how much you are losing. It's less than 1% of a 1% of all flows. That means you catch up more than 99.99% of all flows. Not that bad. Furthermore NetFlow gives you the ability to offer value added (billing) services to your customers. For example ...
368351628 flows exported in 12278484 udp datagrams 33838 flows failed due to lack of export packet 269989 export packets were dropped enqueuing for the RP 108825 export packets were dropped due to IPC rate limiting
Billing based upon total bytes transferred tends to create similar problems. Do you bill based upon bytes transferred per day? Per month? If so, it's still statistical sampling if you have some amount of 'paid bandwidth'.
And you can't collect this data from interfaces because interface rates include local traffic, which (for example) grossly overbills customers with newsfeeds.
... you may easily deduct News traffic from being billed. BTW: tell me how do you exclude News Traffic if you count the 95th %ile? Billing based upon total bytes transferred is IMHO verfy fair and attractive from the point of a customer's view and tends to be a nightmare from an ISP's pespective especially if you don't just count bytes but are looking at the IP-addresses involved. Maybe a mixture of byte-counting and portspeed would give a fair billing model. BTW that's also the model you pay power for in Germany.
DS
Arnold -- Arnold Nipper / nIPper consulting mailto:arnold@nipper.de Heilbronner Str. 34b Phone: +49 700 NIPPER DE D-76131 Karlsruhe Mobile: +49 172 2650958 Germany Fax: +49 180 505255469743
... you may easily deduct News traffic from being billed. BTW: tell me how do you exclude News Traffic if you count the 95th %ile?
Why should you? packets are packets; does your upstream provider charge you less for news? Are you magically capable of moving news across your own network cheaper than web traffic?
Billing based upon total bytes transferred is IMHO verfy fair and attractive from the point of a customer's view and tends to be a nightmare from an ISP's pespective especially if you don't just count bytes but are looking at the IP-addresses involved.
Agreed.
On Fri, Apr 20, 2001 at 09:20:08AM -0400, Alex Rubenstein wrote:
... you may easily deduct News traffic from being billed. BTW: tell me how do you exclude News Traffic if you count the 95th %ile?
Why should you? packets are packets; does your upstream provider charge you less for news? Are you magically capable of moving news across your own network cheaper than web traffic?
Of course packets are packets. But if I'm running a news server I get the packet once and hopefully sell it serveral times. Especially taking into account that news is of high volume you might charge less for news. Of course as soon as the customer gets the feed from someone else there is no reason do not charge it. Arnold -- Arnold Nipper / nIPper consulting mailto:arnold@nipper.de Heilbronner Str. 34b Phone: +49 700 NIPPER DE D-76131 Karlsruhe Mobile: +49 172 2650958 Germany Fax: +49 180 505255469743
On Fri, 20 Apr 2001, Alex Rubenstein wrote:
... you may easily deduct News traffic from being billed. BTW: tell me how do you exclude News Traffic if you count the 95th %ile?
Why should you? packets are packets; does your upstream provider charge you less for news? Are you magically capable of moving news across your own network cheaper than web traffic?
Yes, they pay less for news then other traffic (possibly including web). Why? Because they suck N packets of news from their providers news server, but can then provide customers*N worth of news packets out to their subscribers. I think in areas of heavy byte-level billing, some providers have been known to sell connectivity at the cost of local-loop + byte usage and required their users to use a Squid WWW cache, making their profit off of the savings caused by the cache.
Billing based upon total bytes transferred is IMHO verfy fair and attractive from the point of a customer's view and tends to be a nightmare from an ISP's pespective especially if you don't just count bytes but are looking at the IP-addresses involved.
Agreed.
If fair = reflectes the ISPs costs of operation (which is the only useful meaning of fair I can come up with here): It's not fair unless the load is equally distributed (i.e. uncorrelated between many customers). In North America, ISPs pay for pipes not bytes. The pipes are provisioned for peak usage, not average. Like electrical power, if you fail at providing for peak load you cause very poor service.
... you may easily deduct News traffic from being billed. BTW: tell me how do you exclude News Traffic if you count the 95th %ile?
Why should you? packets are packets; does your upstream provider charge you less for news? Are you magically capable of moving news across your own network cheaper than web traffic?
One of my upstreams does not charge me the traffic cost for a newsfeed I obtain from them. This is because the cost to move a packet from their Santa Clara news server to my Santa Clara router is basically zero. On the other hand, they do charge me for traffic they have to carry over their WAN. Suppose I have a DS3 with 10Mbps paid for and a 95% algorithm for over-usage. If I get a full newsfeed from the computer next to the router my DS3 goes to, should I pay the same cost as someone who transfers 15Mbps constantly across the country on their backbone? That doesn't strike me as fair. Surely there is some cost associated with providing me with news, since they need to get news to their news server. But if they charged me the full cost of 14Mbps, we simply wouldn't get a newsfeed from them, and they would have lost that business. DS
On Fri, Apr 20, 2001 at 09:52:15PM -0700, David Schwartz wrote:
providing me with news, since they need to get news to their news server. But if they charged me the full cost of 14Mbps, we simply wouldn't get a newsfeed from them, and they would have lost that business.
So you'd get your newsfeed from somebody else, transfer it across the WAN, and get charged for it anyway.
On Fri, Apr 20, 2001 at 09:52:15PM -0700, David Schwartz wrote:
providing me with news, since they need to get news to their news server. But if they charged me the full cost of 14Mbps, we simply wouldn't get a newsfeed from them, and they would have lost that business.
So you'd get your newsfeed from somebody else, transfer it across the WAN, and get charged for it anyway.
Hardly. We would have obtained our news feed from someone else who has a news server at a facility where we already colocate a router. But having this provider offer us a newsfeed at a fair price allowed us to save a cross-connect fee. DS
On Thu, Apr 19, 2001 at 05:18:02PM -0700, David Schwartz wrote:
So did you calculate, how much you are losing. It's less than 1% of a 1% of all flows. That means you catch up more than 99.99% of all flows. Not that bad. Furthermore NetFlow gives you the ability to offer value added (billing) services to your customers. For example ...
368351628 flows exported in 12278484 udp datagrams 33838 flows failed due to lack of export packet 269989 export packets were dropped enqueuing for the RP 108825 export packets were dropped due to IPC rate limiting
I get 3% loss. (26989+108825)*100/12278484 = 3%
Billing based upon total bytes transferred tends to create similar problems. Do you bill based upon bytes transferred per day? Per month? If so, it's still statistical sampling if you have some amount of 'paid bandwidth'.
And you can't collect this data from interfaces because interface rates include local traffic, which (for example) grossly overbills customers with newsfeeds.
... you may easily deduct News traffic from being billed. BTW: tell me how do you exclude News Traffic if you count the 95th %ile?
Our NetFlow accounting ignores the news flows as it ignores all local flows. We 'smear' each flow over the time period in the flow packet (start/end) as if its bytes were smoothly distributed over that time interval. We configure our routers to flush active flows every five minutes, rather than the default of 30, so the maximum time error is of the same order of magnitude as the sampling interval. We disclose this method to our customers, and they understand that it is statistical sampling. We also offer them a 'throttling' alternative that guarantees them either that they won't pay for any excess or that their excess will be capped at some particular dollar amount.
Billing based upon total bytes transferred is IMHO verfy fair and attractive from the point of a customer's view and tends to be a nightmare from an ISP's pespective especially if you don't just count bytes but are looking at the IP-addresses involved.
With a flat 'per byte' rate? I'm not sure how attractive people would find that. Plus, that costs the ISP the oppurtunit to charge people for bandwidth that they're not using (because we had to provide it to them anyway, in case they nedeed it).
Maybe a mixture of byte-counting and portspeed would give a fair billing model. BTW that's also the model you pay power for in Germany. Arnold Nipper / nIPper consulting mailto:arnold@nipper.de
You can do that. A DS3 costs X dollars per months plus Y dollars per gigabyte transferred (or Y per inbound gigabyte and Z per outbound gigabyte if necessarry). The 'problem' with that is a customer that alternates between 45Mbs and 0Mbps is charged the same as a customer that usees 22.5Mbps all the time, despite the very different costs to support them. You can fix all these problems by making the scheme more complex and thus (arguably) fairer. But my experience has been that customers tend to want simplicity. I do agree that you don't get much simpler than port cost plus byte cost. DS
"ds" == David Schwartz <davids@webmaster.com> writes: Any billing scheme based upon statistical sampling will, with some probability, err in the favor of one party or the other randomly. But it is important that the customer understands that he is being billed based upon statistical sampling and thus there are no "exact" measurements.
For methods based on "classic" NetFlow I disagree, see below.
I've looked at other ways and can't find any better. Billing based upon NetFlow, for example, is still statistical sampling since NetFlow loses a percentage of flows. For example, one of my VIP2-50's says:
368351628 flows exported in 12278484 udp datagrams 33838 flows failed due to lack of export packet 269989 export packets were dropped enqueuing for the RP 108825 export packets were dropped due to IPC rate limiting
Yes, and in addition you may lose flows on the path between the exporting router and the accounting postprocessor, or due to resource shortage in the postprocessor. However the way you handle this is that you don't bill for flows whose accounting records you have lost, so you always err in favor of your customer. This gives you the right incentive to dimension your accounting infrastructure so that loss is minimized. As long as the loss rate is in the ballpark you showed, the lost revenue probably doesn't justify the effort (VIP upgrades) to fix this. Of course your sampling problem occurs if the provider uses SAMPLED NetFlow and multiplies the actually measured traffic rates with the sampling interval. Regards, -- Simon.
However the way you handle this is that you don't bill for flows whose accounting records you have lost, so you always err in favor of your customer. This gives you the right incentive to dimension your accounting infrastructure so that loss is minimized. As long as the loss rate is in the ballpark you showed, the lost revenue probably doesn't justify the effort (VIP upgrades) to fix this.
Simon.
That is nonsense. If Burger King couldn't bill an average of 3% of their customers due to billing error, they'd raise their prices 3%. The net amount paid by their customers would still be the same and their total revenue would still be the same. They'd still be just as competitive. They'd just be billing based upon, you guessed it, statistical sampling. If you pay for it, you have to bill for it, somehow. DS
On Sat, Apr 21, 2001 at 08:52:47AM -0700, David Schwartz wrote:
However the way you handle this is that you don't bill for flows whose accounting records you have lost, so you always err in favor of your customer. This gives you the right incentive to dimension your accounting infrastructure so that loss is minimized. As long as the loss rate is in the ballpark you showed, the lost revenue probably doesn't justify the effort (VIP upgrades) to fix this.
Simon.
That is nonsense.
Keep cool ..
If Burger King couldn't bill an average of 3% of their customers due to billing error, they'd raise their prices 3%. The net amount paid by their customers would still be the same and their total revenue would still be the same. They'd still be just as competitive. They'd just be billing based upon, you guessed it, statistical sampling.
Even if I lose 3% of all flows that does not mean that I also lose 3% of valuable data. It depends on which flows have been thrown away. In the worst case you may lose nearly 100%, in the best case you almost lose nothing. It would be interesting which algorithm is been chosen for throwing away flows. The observation I made years ago was that 30% - 40% of all IP accounting records just made up a few bytes. At that time disk space and computing power were more limited, so I decided to just throw them away. And I'm quite sure that our company did not lose one buck. Furthermore: you will never bill byte by byte. That means a customer has to pay x $ per Gig. If he used 2.1 Gig he has to pay for 3 Gig, if he used 2.9 Gig he also pays for 3 Gig. Of course the better you know what you are missing or discarding, the better your CFO will feel.
If you pay for it, you have to bill for it, somehow.
DS
-- Arnold
If Burger King couldn't bill an average of 3% of their customers due to billing error, they'd raise their prices 3%. The net amount paid by their customers would still be the same and their total revenue would still be the same. They'd still be just as competitive. They'd just be billing based upon, you guessed it, statistical sampling.
Even if I lose 3% of all flows that does not mean that I also lose 3% of valuable data. It depends on which flows have been thrown away. In the worst case you may lose nearly 100%, in the best case you almost lose nothing. It would be interesting which algorithm is been chosen for throwing away flows.
Logic would dictate that during the heaviest traffic, when the router is working the hardest, it's more likely to lose flows. But that's just what you'd expect, not necessarily what actually happens.
Furthermore: you will never bill byte by byte. That means a customer has to pay x $ per Gig. If he used 2.1 Gig he has to pay for 3 Gig, if he used 2.9 Gig he also pays for 3 Gig.
In other words, a customer that uses 2.99 gig pays less than a customer that uses 3.00 gigs. So that 3% loss could make a big difference. Yes, it will make a difference less often, but when it does, it will be a bigger difference. In fact, I would expect that a 3% loss will, on average, result in 3% less revenue from overusage. Again, the billing is based upon statistical sampling. Really. DS
In the extreme I cited, (full rate for 5 minutes, idle for five minutes, repeated), the five minute average rate oscillates between zero and full line rate. The period of oscillation is 10 minutes (i.e. five minutes for the five minute rate to decay from line rate to zero and fine minutes to build back to line rate).
I do think you are confused, Geoff. The 5-minute average is not being sampled every five minutes. The raw number of octets is being sampled every five minutes, and divided by the time since the previous sample (5 minutes). Then, 95th percentile is taken of that. So, if you have 5-min FULL, 5-min IDLE, continuously all day, then a given 5 minute burst either falls entirely within one period, in which case that period has maximum utilization, or it falls half in one and half in another, in which case you have half utilization. Or somewhere inbetween. But in no case would the measured utilization for the greater of the two periods be less than half. So your 95th percentile will fall somewhere between 50% and 100%, not beteen 0% and 100%. I think that real world traffic is extremely unlikely to meet this precise synchronization case, and that it is not terribly worth worrying about. If it were to become a source of billing disputes between providers and customers, then presumably we would work out a better system. --jhawk
[ On Friday, April 20, 2001 at 09:54:25 (+1000), Geoff Huston wrote: ]
Subject: Re: What does 95th %tile mean?
Now if you sample every five minutes, and the sample point is synchronized to the peak and trough of the five minute rate you will get successive readings of 'line rate', zero, 'line rate', zero, etc. The 95% sample value will be 'line rate'.
Yeah, but synchronizing your usage that exactly with the sample taker is almost impossible, at least with the average Internet pipe. Not to mention but even if you managed to synchronise your usage profile to the sample taker in that way then you really did use "line rate" bandwidth for a significant period of time and that's what you should fairly be billed at.
If you change _nothing_ except shift the sample point two and one half minutes forward in time the sample points will consistently produce outcomes of 'half line rate', 'half line rate', ..., and the 95% point is 'one half of line rate'.
(in this case the bill would be more in line with what the customer might actually perceive as his usage, but not in line with the ISPs perception of that usage) What does it matter though when it's almost impossible for the customer to adjust their usage profile in such an accurate way? "Statistically" the sample period will measure the peak usage and if one treats the 95th percentile rate as the actual bandwidth used over a longer billing period then the bill will be fair for both parties. Of course if your billing period is insanely short (eg. 1 hour) and doesn't cover the peaks and valleys in actual use over at least several days then you'll have screwy bills too, but they should still be fair. The point is that in the real world of *Internet* usage Nth percentile measurements, if properly done, are indeed fair, especially for aggregated pipes where significant bursts that would raise the rate can't be affected by one or two users doing a couple of big downloads. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
[ On Friday, April 20, 2001 at 08:03:02 (+1000), Geoff Huston wrote: ]
Subject: Re: What does 95th %tile mean?
The 95% reading always struck me as a randomly generated number in any case.
Huh? It's a simple and mathematically sound and highly repeatable and auditable way of drawing a line on the usage graph that says something like: If you were to have had a fixed-rate connection this is the bandwidth that you would have required over the previous billing period in order to have obtained effectively the same level of performance as you actually enjoyed over that period. The only trick (from the customer P.O.V.) is in understanding that this is what you're buying and in realising that if you use it then you will pay for it. It probably works best for links that have aggregated traffic (eg. for 1st and 2nd tier providers).
Depending on the synchronization between the burst pattern and the sampling system, and the sampling technique itself, the 95% reading can be zero, half the line rate, or the line rate, and all answers are equally valid in some sense.
Perhaps you need to learn that the "bit rate" values used in deriving an N'th percentile value are first calculated by counting the number of octets that crossed an interface since the last sample was taken and dividing by the amount of time since that last sample was taken (and then adjusting with a multiplier for different units, eg. octets vs. bits or whatever). In other words the bit rate values are taken as the average rate over the specified sample time. No data is thrown away or ignored -- every single byte is counted and every count is critical to finding the correct N'th percentile value. There's absolutely nothing in the way of synchronisation required and indeed there's no such thing as a "burst pattern" when you consider that at any given instant in time an octet will cross a (to pick a specific example) 10-mbit interface at ten megabits per second! How else can you imagine measuring the bit rate utilisation of a fixed-rate pipe? The same N'th percentile measurement can always be calculated from either end of a pipe so long as the sample interval is the same at both ends, and so long as the pipe has no (measurable) loss. If there's measurable loss then you'd better measure it and take it into account or else you will end up with unfair billing. In fact the very same octet-count measurements are needed for any kind of usage-based billing. The only difference with N'th percentile metering is that the sample time needs to be short enough to catch user-noticable bursts (i.e. to avoid averaging out bursts that were they to be flattened out to the average rate would be noticable to the user). For most currently used IP services this might be somewhere between 5 seconds and 60 seconds. For straight bulk throughput billing you only need to sample often enough to aoivd missing counter roll-over or counter reset events. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <woods@robohack.ca> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
Isn't in+out a more fair representation of usage? I've always assumed that this was the standard to be honest. Thank god I'm not the billing person. I think Exodus does in+out.
-M
It depends upon the cost model for your provider. For most providers, outbound bandwidth is at more of a premium, so it doesn't make sense to charge you for more of the expensive bandwidth just because you use more of the cheap bandwidth. DS
When you purchase a DS1, you're purchasing 1.5Mb/s. That means, 1.5Mb/s in BOTH directions. If the circuit was supposed to be billed as 3Mb/s, they would claim 3Mb/s linerate. Ethernet, ATM, blah blah blah works the same way. ADSL and cable modems are the strange mediums that are not SYMETRIC. IMHO, 1Mb/s means 1Mb/s IN, OUT or BOTH. --- John Fraizer EnterZone, Inc On Thu, 19 Apr 2001, Martin Hannigan wrote:
Isn't in+out a more fair representation of usage? I've always assumed that this was the standard to be honest. Thank god I'm not the billing person. I think Exodus does in+out.
-M
At 03:06 PM 4/19/2001 -0400, Thomas Kernen wrote:
I know one company in Europe that uses the in + out model.
Thomas
----- Original Message ----- From: "Alex Rubenstein" <alex@corp.nac.net> To: <nanog@merit.edu> Sent: Thursday, April 19, 2001 10:09 AM Subject: What does 95th %tile mean?
I've gotten myself into an argument with a provider about the definition of 'industry-standard 95th percentile method.'
To me, this means the following:
a) take the number of bytes xfered over a 5 minute period, and determine rate for both the inbound and outbound. Store this in your favorite data-store.
b) at billing time, presumably on the first of the month or some other monthly increment, take all the samples, sort them from greatest to least, hacking off the top 5% of samples. Actually, this is done twice, once for inbound, once for outbound. Then, take the higher of those two, and
multiply
it by your favorite $ multiple (ie, $500 per megabit per second, or $1 per kilobit per second, etc).
I think that most people agree with the above; the issue we are running into is one rogue provider who is billing this at in + out, not the greater of in or out.
How is everyone else doing it? Specifically, larger folks (UU, Sprint, CW, Exodus/FGC, GX, Qwest, L3)
Thanks!
Regards,
-- Martin Hannigan hannigan@fugawi.net Fugawi Networks Founder/Director of Implementation Boston, MA http://www.fugawi.net Ph: 617.742.2693 Fax: 617.742.2300
participants (22)
-
Adrian Chadd
-
Alex Rubenstein
-
Alex Rubenstein
-
Andy Dills
-
Arnold Nipper
-
Charles Sprickman
-
David P. Maynard
-
David Schwartz
-
Eric Gauthier
-
Geoff Huston
-
Greg Maxwell
-
John Fraizer
-
John Hawkinson
-
Martin Hannigan
-
Michelle T
-
Sean Morrison
-
Sebastien Berube
-
Shawn McMahon
-
Simon Leinen
-
Thomas Kernen
-
Todd Suiter
-
woods@weird.com