How do you (not how do I) calculate 95th percentile?
I am wondering what other people are doing for 95th percentile calculations these days. Not how you gather the data, but how often you check the counter? Do you use averages or maximums over time periods to create the buckets used for the 95th percentile calculation? A lot of smaller folks check the counter every 5 min and use that same value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often. Are you larger folks averaging the retrieved values over a larger period? Using the maximum within a larger period? Or just using your saved values? This is curiosity only. A few years ago we compared the same data and the answers varied wildly. It would appear from my latest check that it is becoming more standardized on 5-minute averages, so I'm asking here on Nanog as a reality check. Note: I have AboveNet, Savvis, Verio, etc calculations. I'm wondering if there are any other odd combinations out there. Reply to me offlist. If there is interest I'll summarize the results without identifying the source. -- Jo Rhett senior geek SVcolo : Silicon Valley Colocation
Jo Rhett wrote:
I am wondering what other people are doing for 95th percentile calculations these days. Not how you gather the data, but how often you check the counter? Do you use averages or maximums over time periods to create the buckets used for the 95th percentile calculation?
We use maximums, every 5 minutes.
A lot of smaller folks check the counter every 5 min and use that same value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often.
Actually, a lot of people do 5 minutes... and I would say that larger companies don't check them more often because they are using 64 bit counters, as should anyone with over about 100Mbps of traffic. Are you larger
folks averaging the retrieved values over a larger period? Using the maximum within a larger period? Or just using your saved values?
In our setup, as with a lot of people likely, any data that is older than 30 days is averaged. However, we store the exact maximums for the most current 30 days.
This is curiosity only. A few years ago we compared the same data and the answers varied wildly. It would appear from my latest check that it is becoming more standardized on 5-minute averages, so I'm asking here on Nanog as a reality check.
Note: I have AboveNet, Savvis, Verio, etc calculations. I'm wondering if there are any other odd combinations out there.
Reply to me offlist. If there is interest I'll summarize the results without identifying the source.
-- ------------------------------------------------------ Tom Sands Chief Network Engineer Rackspace Managed Hosting (210)447-4065 ------------------------------------------------------
On Wed, Feb 22, 2006 at 12:50:34PM -0600, Tom Sands wrote:
A lot of smaller folks check the counter every 5 min and use that same value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often.
Actually, a lot of people do 5 minutes... and I would say that larger companies don't check them more often because they are using 64 bit counters, as should anyone with over about 100Mbps of traffic.
Counter size is an incomplete reason for polling interval. If you need a 5 minute average and poll your routers once every five minutes, what happens if an SNMP packet gets lost? In the best case, a retransmission over Y seconds sees it through, but now you've got 300+Y seconds in what was supposed to be a 300 second average...your next datapoint will also now be a 300-Y average unless you schedule it into the future. In the worst case, you've lost the datapoint entirely. This loses not just the one datapoint ending in that five minute span, but also the next datapoint. Sure, you can synthesize two 5 minute averages from one 10 minute average (presuming your counters wouldn't roll), but this is still a loss in data - one of those two datapoints should have been higher than the other. At a place of previous employ, we solved this problem by using a 30 second (!) polling interval, and a home-written (C, linking to the UCD-SNMP library (now net-snmp)) polling engine that did its best to emit and receive as many queries in as short a space of time as it was able to (without flooding monitored devices). In these circumstances, we could lose several datapoints and still construct valid 5-minute averages from the pieces (combinations of 30, 60, 90 etc second averages, weighting each by the number of seconds it represents within the 300-second span). Our operations staff also enjoyed being able to see graphical response to changes in traffic balancing within half a minute...better, faster feedback. Another factor that makes 'counter size' a bad indicator for polling interval.
In our setup, as with a lot of people likely, any data that is older than 30 days is averaged. However, we store the exact maximums for the most current 30 days.
You keep no record? What do you do if a customer challenges their bill? Synthesize 5 minute datapoints out of the larger averages? I recommend keeping the 5 minute averages in perpetuity, even if that means having an operator burn the data to CD and store it in a safe (not under his desk in the pizza boxes, nor under his soft drink as a coaster). -- David W. Hankins "If you don't do it right the first time, Software Engineer you'll just have to do it again." Internet Systems Consortium, Inc. -- Jack T. Hankins
David W. Hankins wrote:
On Wed, Feb 22, 2006 at 12:50:34PM -0600, Tom Sands wrote:
A lot of smaller folks check the counter every 5 min and use that same value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often.
Actually, a lot of people do 5 minutes... and I would say that larger companies don't check them more often because they are using 64 bit counters, as should anyone with over about 100Mbps of traffic.
Counter size is an incomplete reason for polling interval.
Possibly incomplete, but a reason for some none the less, if all they can do is 32 bit counters.
If you need a 5 minute average and poll your routers once every five minutes, what happens if an SNMP packet gets lost?
No one said it was "needed", just what is done.. and I agree with your reason of more frequent polling, than doing it because of counter roll.
In the best case, a retransmission over Y seconds sees it through, but now you've got 300+Y seconds in what was supposed to be a 300 second average...your next datapoint will also now be a 300-Y average unless you schedule it into the future.
In the worst case, you've lost the datapoint entirely. This loses not just the one datapoint ending in that five minute span, but also the next datapoint. Sure, you can synthesize two 5 minute averages from one 10 minute average (presuming your counters wouldn't roll), but this is still a loss in data - one of those two datapoints should have been higher than the other.
In our setup, as with a lot of people likely, any data that is older than 30 days is averaged. However, we store the exact maximums for the most current 30 days.
You keep no record? What do you do if a customer challenges their bill? Synthesize 5 minute datapoints out of the larger averages?
This isn't for customer billing. We don't bill customers on Mbps, but rather on total volume of GB transfered. That is an easy number to collect and doesn't depend on 5 minute itervals being successful. Right up until someone clears the counters ;)
I recommend keeping the 5 minute averages in perpetuity, even if that means having an operator burn the data to CD and store it in a safe (not under his desk in the pizza boxes, nor under his soft drink as a coaster).
-- ------------------------------------------------------ Tom Sands Chief Network Engineer Rackspace Managed Hosting (210)447-4065 ------------------------------------------------------
On Feb 22, 2006, at 10:12 AM, Jo Rhett wrote:
A lot of smaller folks check the counter every 5 min and use that same value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often. Are you larger folks averaging the retrieved values over a larger period? Using the maximum within a larger period? Or just using your saved values?
Most people are using 64 bit counters. This avoids the wrapping problem (assuming you don't have 100GE and poll more then once every 5 years :-)).
This is curiosity only. A few years ago we compared the same data and the answers varied wildly. It would appear from my latest check that it is becoming more standardized on 5-minute averages, so I'm asking here on Nanog as a reality check.
Yup, 5 min seems to be the accepted time.
Note: I have AboveNet, Savvis, Verio, etc calculations. I'm wondering if there are any other odd combinations out there.
Reply to me offlist. If there is interest I'll summarize the results without identifying the source.
-- Jo Rhett senior geek SVcolo : Silicon Valley Colocation
(I did this fast, and, who knows; I could be off my an order or two of magnitude)
Most people are using 64 bit counters. This avoids the wrapping problem (assuming you don't have 100GE and poll more then once every 5 years :-)).
2^64 is 18,446,744,073,709,551,616 bytes. 100 GE (100,000,000,000 bits/sec) is 12,500,000,000 bytes/sec. It would take 1,475,739,525 seconds, or 46.79 years for a counter wrap. -- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben Net Access Corporation, 800-NET-ME-36, http://www.nac.net
Doh! You are 100% correct. I didn't take into account the fact that the counters are if(In|Out) *Octets* and NOT if(in/Out)*Bits*. The point is that 64-bit counters are not likely to roll :-) Warren On Feb 22, 2006, at 12:24 PM, Alex Rubenstein wrote:
(I did this fast, and, who knows; I could be off my an order or two of magnitude)
Most people are using 64 bit counters. This avoids the wrapping problem (assuming you don't have 100GE and poll more then once every 5 years :-)).
2^64 is 18,446,744,073,709,551,616 bytes.
100 GE (100,000,000,000 bits/sec) is 12,500,000,000 bytes/sec.
It would take 1,475,739,525 seconds, or 46.79 years for a counter wrap.
-- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben Net Access Corporation, 800-NET-ME-36, http://www.nac.net
participants (5)
-
Alex Rubenstein
-
David W. Hankins
-
Jo Rhett
-
Tom Sands
-
Warren Kumari