If your 95th percentile utilization is at 80% capacity... s/80/60/ s/60/40/
I would suggest that the reason each of you have a different number is because there's a different best number for each case. Looking for any single number to fit all cases, rather than understanding the underlying process, is unlikely to yield good results. First, different people have different requirements. Some people need lowest possible cost, some people need lowest cost per volume of bits delivered, some people need lowest cost per burst capacity, some need low latency, some need low jitter, some want good customer service, some want flexible payment terms, and undoubtedly there are a thousand other possible qualities. Second, this is a binary digital network. It's never 80% full, it's never 60% full, and it's never 40% full. It's always exactly 100% full or exactly 0% full. If SNMP tells you that you've moved 800 megabits in a second on a one-gigabit pipe, then, modulo any bad implementations of SNMP, your pipe was 100% full for eight-tenths of that second. SNMP does not "hide" anything. Applying any percentile function to your data, on the other hand, does hide data. Specifically, it discards all of your data except a single point, irreversibly. So if you want to know anything about your network, you won't be looking at percentiles. Having your circuit be 100% full is a good thing, presuming you're paying for it and the traffic has some value to you. Having it be 100% full as much of the time as possible is a good thing, because that gives you a high ratio of value to cost. Dropping packets, on the other hand, is likely to be a bad thing, both because each packet putatively had value, and because many dropped packets are likely to be resent, and a resent packet is one you've paid for twice, and that's precluded the sending of another new, paid-for packet in that timeframe. The cost of not dropping packets is not having buffers overflow, and the cost of not having buffers overflow is either having deep buffers, which means high latency, or having customers with a predictable flow of traffic. Which brings me to item three. In my experience, the single biggest contributor to buffer overflow is having in-feeding (or downstream customer) circuits which are of burst capacity too close to that of the out-feeding (or upstream transit) circuits. Let's say that your outbound circuit is a gigabit, you have two inbound circuits that are a gigabit and run at 100% utilization 10% of the time each, and you have a megabit of buffer memory allocated to the outbound circuit. 1% of the time, both of the inbound circuits will be at 100% utilization simultaneously. When that's happening, you'll have data flowing in at the rate of two gigabits per second, which will fill the buffer in one twentieth of a second, if it persists. And, just like Rosencrantz and Guildenstern flipping coins, such a run will inevitably persist longer than you'd desire, frequently enough. On the other hand, if you have twenty inbound circuits of 100 megabits each, which are transmitting at 100% of capacity 10% of the time each, you're looking at exactly the same amount of data, however it arrives _much more predictably_, since the 2-gigabit inflow would only occur 0.0000000000000000001% of the time, rather than 1% of the time. And it would also be proportionally unlikely to persist for the longer periods of time necessary to overflow the buffer. Thus Kevin's ESnet customers, who are much more likely to be 10gb or 40gb downstream circuits feeding into his 40gb upstream circuits, are much more likely to overflow buffers, than a consumer Internet provider who's feeding 1mb circuits into a gigabit circuit, even if the aggregation ratio of the latter is hundreds of times higher. So, in summary: Your dropped packet counters are the ones to be looking at as a measure of goodput, more than your utilization counters. And keep the size of your aggregation pipes as much bigger than the size of the pipes you aggregate into them as you can afford to. As always, my apologies to those of you for whom this is unnecessarily remedial, for using NANOG bandwidth and a portion of your Sunday morning. -Bill
So, in summary: Your dropped packet counters are the ones to be looking at
as a measure of goodput, more than your utilization counters.
Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless. When you're only aware of the RX side though, in the absence of an equivalent to BECN, what's the best way to track this? Do any of the Ethernet OAM standards expose this data? Similarly, could anyone share experiences with transit link upgrades to accommodate bursts? In the past, any requests to transit providers have been answered w/ the need for significant increases to 95%ile commits. While this makes sense from a sales perspective, there's a strong (but insufficient) engineering argument against it.
On Tue, 1 Sep 2009, Kevin Graham wrote:
Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless.
If you're dropping packets, you're already over the cliff. Our job as ISP is to forward the packets our customers send to us, how is that compatible with upgrading links when they're so full that you're not only buffering but you're actually DROPPING packets? -- Mikael Abrahamsson email: swmike@swm.pp.se
Mikael Abrahamsson wrote:
If you're dropping packets, you're already over the cliff. Our job as ISP is to forward the packets our customers send to us, how is that compatible with upgrading links when they're so full that you're not only buffering but you're actually DROPPING packets?
Many ISPs don't even watch the dropping rate. Packets can easily start dropping long before you reach an 80% average mark or may not drop until 90% utilization. Dropped packets are a safeguard measurement to detect data collision that fills the buffers. One could argue that setting QOS with larger buffers and monitoring the buffer usage is better than waiting for the drop. Jack
On Wed, Sep 02, 2009 at 08:39:20AM +0200, Mikael Abrahamsson wrote:
On Tue, 1 Sep 2009, Kevin Graham wrote:
Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless.
If you're dropping packets, you're already over the cliff. Our job as ISP is to forward the packets our customers send to us, how is that compatible with upgrading links when they're so full that you're not only buffering but you're actually DROPPING packets?
By all means watch your traffic utilization and plan your upgrades in a timely fashion, but watching for dropped packets can help reveal unexpected issues, such as all of those routers out there that don't actually do line rate depending on your particular traffic profile or pattern of traffic between ports. Personally I find the whole argument over what % of utilization should trigger an upgrade to be little more than a giant excercise in penis waving. People throw out all kinds of numbers, 80%, 60%, 50%, 40%, I've even seen someone claim 25%, but in the end I find more value in a quick reaction time to ANY unexpected event than I do in adhering to some arbitrary rule about when to upgrade. I'd rather see someone leave their links 80% full but have enough spare parts and a competent enough operations staff that they can turn around an upgrade in a matter of hours, than I would see them upgrade something unnecessarily at 40% and then not be able to react to an unplanned issue on a different port. And honestly, I've peered with a lot of networks who claim to do preemptive upgrades at numbers like 40%, but I've never actually seen it happen. In fact, the relationship between the marketing claim of upgrading at x percentage and the number of weeks you have to run the port congested before the other side gets budgetary approval to sign the PO for the optics that they need to do the upgrade but don't own seems to be inversely proportional. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
What SNMP MIB records drops? I poked around for a short time, and I'm thinking that generically the drops fall into the errors counter. Hopefully that's not the case. Frank -----Original Message----- From: Kevin Graham [mailto:kgraham@industrial-marshmallow.com] Sent: Wednesday, September 02, 2009 1:32 AM To: Bill Woodcock; nanog Subject: Re: Link capacity upgrade threshold
So, in summary: Your dropped packet counters are the ones to be looking at
as a measure of goodput, more than your utilization counters.
Indeed. Capacity upgrades are best gauged by drop rates; bit-rates without this context are largely useless. When you're only aware of the RX side though, in the absence of an equivalent to BECN, what's the best way to track this? Do any of the Ethernet OAM standards expose this data? Similarly, could anyone share experiences with transit link upgrades to accommodate bursts? In the past, any requests to transit providers have been answered w/ the need for significant increases to 95%ile commits. While this makes sense from a sales perspective, there's a strong (but insufficient) engineering argument against it.
participants (6)
-
Bill Woodcock
-
Frank Bulk
-
Jack Bates
-
Kevin Graham
-
Mikael Abrahamsson
-
Richard A Steenbergen