Hi All, I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea... thanks, Devang Patel
I consider a circuit nearing capacity at 80-85%. Depending on the circuit we start the process of increasing capacity around 70%. There are almost always telco issues, in-building issues, not enough physical ports on the provider end, and other such things that slow you down. Justin From: devang patel <devangnp@gmail.com> Date: Sat, 29 Aug 2009 21:50:41 -0600 To: <nanog@nanog.org> Subject: Link capacity upgrade threshold Hi All, I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea... thanks, Devang Patel
On Sat, Aug 29, 2009 at 11:50 PM, devang patel<devangnp@gmail.com> wrote:
I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea...
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade. If you average or median utilizations are at 80% capacity then as often as not it's time for your boss to fire you and replace you with someone who can do the job. Slight variations depending on the resource. Use absolute peak instead of 95th percentile for modem bank utilization -- under normal circumstances a modem bank should never ring busy. And a gig-e can run a little closer to the edge (percentage-wise) before folks notice slowness than a T1 can. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004
On Sun, 30 Aug 2009, William Herrin wrote:
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade.
I now see why people at the IETF spoke in a way that "core network congestion" was something natural. If your MRTG graph is showing 95% load in 5 minute average, you're most likely congesting/buffering at some time during that 5 minute interval. If this is acceptable or not in your network (it's not in mine) that's up to you. Also, a gig link on a Cisco will do approx 93-94% of imix of a gig in the values presented via SNMP (around 930-940 megabit/s as seen in "show int") before it's full, because of IFG, ethernet header overhead etc. So personally, I consider a gig link "in desperate need of upgrade" when it's showing around 850-880 megs of traffic in mrtg. -- Mikael Abrahamsson email: swmike@swm.pp.se
On Aug 30, 2009, at 1:23 AM, Mikael Abrahamsson wrote:
On Sun, 30 Aug 2009, William Herrin wrote:
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade.
I now see why people at the IETF spoke in a way that "core network congestion" was something natural.
If your MRTG graph is showing 95% load in 5 minute average, you're most likely congesting/buffering at some time during that 5 minute interval. If this is acceptable or not in your network (it's not in mine) that's up to you.
Also, a gig link on a Cisco will do approx 93-94% of imix of a gig in the values presented via SNMP (around 930-940 megabit/s as seen in "show int") before it's full, because of IFG, ethernet header overhead etc.
I've heard this said many times. I've also seen 'sho int' say 950,000,000 bits/sec and not see packets get dropped. I was under the impression "show int" showed -every- byte leaving the interface. I could make an argument that IFG would not be included, but things like ethernet headers better be. Does this change between IOS revisions, or hardware, or is it old info, or ... what? -- TTFN, patrick P.S. I agree that without perfect conditions (e.g. using an Ixia to test link speeds), you should upgrade WAAAAAY before 90-something percent. microbursts are real, and buffer space is small these days. I'm just asking what the counters -actually- show.
So personally, I consider a gig link "in desperate need of upgrade" when it's showing around 850-880 megs of traffic in mrtg.
-- Mikael Abrahamsson email: swmike@swm.pp.se
On Sun, Aug 30, 2009 at 01:03:35PM -0400, Patrick W. Gilmore wrote:
Also, a gig link on a Cisco will do approx 93-94% of imix of a gig in the values presented via SNMP (around 930-940 megabit/s as seen in "show int") before it's full, because of IFG, ethernet header overhead etc.
I've heard this said many times. I've also seen 'sho int' say 950,000,000 bits/sec and not see packets get dropped. I was under the impression "show int" showed -every- byte leaving the interface. I could make an argument that IFG would not be included, but things like ethernet headers better be.
Does this change between IOS revisions, or hardware, or is it old info, or ... what?
Actually Cisco does count layer 2 header overhead in its snmp and show int results, it is Juniper who does not (for most platforms at any rate) due to their hw architecture. I did some tests regarding this a while back on j-nsp, you'll see different results for different platforms and depending on whether you're looking at the tx or rx. Also you'll see different results for vlan overhead and the like, which can further complicate things. That said, "show int" is an epic disaster for a significantly large percentage of the time. I've seen more bugs and false readings on that thing than I can possibly count, so you really shouldn't rely on it for rate readings. The problem is extra special bad on SVIs, where you might see a reading that is 20% high or low from reality at any given second, even on modern code. I'm not aware of any major issues detecting drops though, so you should at least be able to detect them when they happen (which isn't always at line rate). If you're on a 6500/7600 platform running anything SXF+ try "show platform hardware capacity interface" to look for interfaces with lots of drops globally. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
On 30/08/2009 13:04, Randy Bush wrote:
the normal snmp and other averaging methods *really* miss the bursts.
Definitely. For fun and giggles, I recently turned on 30 second polling on some kit and it turned up all sorts of interesting peculiarities that were completely blotted out in a 5 minute average. In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly. There's a lot to the saying that QoS really means "Quantity of Service", because quality of service only ever becomes a problem if there is a shortfall in quantity. Nick
Nick Hilliard wrote:
Definitely. For fun and giggles, I recently turned on 30 second polling on some kit and it turned up all sorts of interesting peculiarities that were completely blotted out in a 5 minute average.
Would RMON History and Alarms help? I've always considered rolling them out to some of my kit to catch microbursts. Poggs
What system were you using to monitor link usage? Shane On Aug 30, 2009, at 8:26 AM, Nick Hilliard wrote:
On 30/08/2009 13:04, Randy Bush wrote:
the normal snmp and other averaging methods *really* miss the bursts.
Definitely. For fun and giggles, I recently turned on 30 second polling on some kit and it turned up all sorts of interesting peculiarities that were completely blotted out in a 5 minute average.
In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly.
There's a lot to the saying that QoS really means "Quantity of Service", because quality of service only ever becomes a problem if there is a shortfall in quantity.
Nick
On Sun, 30 Aug 2009, Nick Hilliard wrote:
In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly.
Or some enterprising vendor could start recording utilisation stats? regards, -- Paul Jakma paul@jakma.org Key ID: 64A2FF6A Fortune: Try to value useful qualities in one who loves you.
On Tue, Sep 01, 2009 at 11:55:45AM +0100, Paul Jakma wrote:
On Sun, 30 Aug 2009, Nick Hilliard wrote:
In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly.
Or some enterprising vendor could start recording utilisation stats?
do any router vendors provide something akin to hardware latches to keep track of highest buffer fill levels? poll as frequently/infrequently as you like... -- Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
Another approach to collecting buffer utilization is to infer such utilization from other variables. Active measurement of round trip times (RTT), packet loss, and jitter on a link-by-link basis is a reliable way of inferring interface queuing which leads to packet loss. A link that runs with good values on all 3 measures (low RTT, little or no packet loss, low jitter with small inter-packet arrival variation) can be deemed not a candidate for bandwidth upgrades. The key to active measurement is random measurement of the links so as to catch the bursts. The BRIX active measurement product (now owned by EXFO) is a good active measurement tool which randomizes probe data so as to, over time, collect a randomized sample of link behavior. -----Original Message----- From: Aaron J. Grier [mailto:agrier@poofygoof.com] Sent: Tuesday, September 01, 2009 12:19 PM To: nanog@nanog.org Subject: Re: Link capacity upgrade threshold On Tue, Sep 01, 2009 at 11:55:45AM +0100, Paul Jakma wrote:
On Sun, 30 Aug 2009, Nick Hilliard wrote:
In order to get a really good idea of what's going on at a microburst level, you would need to poll as often as it takes to fill the buffer of the port in question. This is not feasible in the general case, which is why we resort to hacks like QoS to make sure that when there is congestion, it is handled semi-sensibly.
Or some enterprising vendor could start recording utilisation stats?
do any router vendors provide something akin to hardware latches to keep track of highest buffer fill levels? poll as frequently/infrequently as you like... -- Aaron J. Grier | "Not your ordinary poofy goof." | agrier@poofygoof.com
do any router vendors provide something akin to hardware latches to keep track of highest buffer fill levels? poll as frequently/infrequently as you like...
Without getting into each permutation of a device's architecture, aren't buffer fills really just buffer drops? There are means to determine this. Lots of vendors have configurable buffer pools for inter-device traffic levels that record high water levels as well. Deepak Jain AiNET
Holmes,David A wrote:
runs with good values on all 3 measures (low RTT, little or no packet loss, low jitter with small inter-packet arrival variation) can be deemed not a candidate for bandwidth upgrades. The key to active
Sounds great, unless you don't own the router on the other side of the link which is subject to icmp filtering has a loaded RE, etc. If you pass the traffic through the routers to a reliable server, you'll be monitoring multiple links/routers and not just a single one. Jack
Date: Sun, 30 Aug 2009 21:04:15 +0900 From: Randy Bush <randy@psg.com>
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade.
s/80/60/
the normal snmp and other averaging methods *really* miss the bursts.
s/60/40/ If you need to carry large TCP flows, say 2Gbps on a 10GE, dropping even a single packet due to congestion is unacceptable. Even with fast recovery, the average transmission rate will take a noticeable dip on every drop and even a drop rate under 1% will slow the flow dramatically. The point is, what is acceptable for one traffic profile may be unacceptable for another. Mail and web browsing are generally unaffected by light congestion. Other applications are not so forgiving. -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: oberman@es.net Phone: +1 510 486-8634 Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade.
s/80/60/
the normal snmp and other averaging methods *really* miss the bursts.
s/60/40/
What is this "upgrade" thing you all speak of? When your links become saturated, shouldn't you solve the problem by deploying DPI-based application-discriminatory throttling and start double-dipping your customers? After all, it's their fault for using up more bandwidth than your flawed business model told you they will use. (If you're not familiar with Bell Canada, it's OK if you don't get the joke).
On Sun, 30 Aug 2009, Randy Bush wrote:
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade.
s/80/60/
the normal snmp and other averaging methods *really* miss the bursts.
Agreed. Internet traffic is very burtsy. If you care your customer experience upgrade at 60-65% level. Especially if an interface is towards a customers is similar in bandwith of backbone links... Best Regards, Janos Mohacsi
If talking about just max capacity, I would agree with most of the statements of 80+% being in the right range, likely with a very fine line of when you actually start seeing a performance impact. Operationally, at least in our network, I'd never run anything at that level. Providers that are redundant for each other don't normally operate above 40-45%, in order to accommodate a failure. Other links that have a backup, but don't actively load share, normally run up to about 60-70% before being upgraded. By the time the upgrade is complete, it could be close to 80%. -------------------------------------------------------------------------------- Tom Sands Rackspace Hosting William Herrin wrote:
On Sat, Aug 29, 2009 at 11:50 PM, devang patel<devangnp@gmail.com> wrote:
I just wanted to know what is Link capacity upgrade threshold in terms of % of link utilization? Just to get an idea...
If your 95th percentile utilization is at 80% capacity, it's time to start planning the upgrade. If your 95th percentile utilization is at 95% it's time to finish the upgrade.
If you average or median utilizations are at 80% capacity then as often as not it's time for your boss to fire you and replace you with someone who can do the job.
Slight variations depending on the resource. Use absolute peak instead of 95th percentile for modem bank utilization -- under normal circumstances a modem bank should never ring busy. And a gig-e can run a little closer to the edge (percentage-wise) before folks notice slowness than a T1 can.
Regards, Bill Herrin
Confidentiality Notice: This e-mail message (including any attached or embedded documents) is intended for the exclusive and confidential use of the individual or entity to which this message is addressed, and unless otherwise expressly indicated, is confidential and privileged information of Rackspace. Any dissemination, distribution or copying of the enclosed material is prohibited. If you receive this transmission in error, please notify us immediately by e-mail at abuse@rackspace.com, and delete the original message. Your cooperation is appreciated.
participants (19)
-
Aaron J. Grier
-
Deepak Jain
-
devang patel
-
Erik L
-
Holmes,David A
-
Jack Bates
-
Justin Wilson - MTIN
-
Kevin Oberman
-
Mikael Abrahamsson
-
Mohacsi Janos
-
Nick Hilliard
-
Patrick W. Gilmore
-
Paul Jakma
-
Peter Hicks
-
Randy Bush
-
Richard A Steenbergen
-
Shane Ronan
-
Tom Sands
-
William Herrin