If your 95th percentile utilization is at 80% capacity... s/80/60/ s/60/40/
I would suggest that the reason each of you have a different number is because there's a different best number for each case. Looking for any single number to fit all cases, rather than understanding the underlying process, is unlikely to yield good results. First, different people have different requirements. Some people need lowest possible cost, some people need lowest cost per volume of bits delivered, some people need lowest cost per burst capacity, some need low latency, some need low jitter, some want good customer service, some want flexible payment terms, and undoubtedly there are a thousand other possible qualities. Second, this is a binary digital network. It's never 80% full, it's never 60% full, and it's never 40% full. It's always exactly 100% full or exactly 0% full. If SNMP tells you that you've moved 800 megabits in a second on a one-gigabit pipe, then, modulo any bad implementations of SNMP, your pipe was 100% full for eight-tenths of that second. SNMP does not "hide" anything. Applying any percentile function to your data, on the other hand, does hide data. Specifically, it discards all of your data except a single point, irreversibly. So if you want to know anything about your network, you won't be looking at percentiles. Having your circuit be 100% full is a good thing, presuming you're paying for it and the traffic has some value to you. Having it be 100% full as much of the time as possible is a good thing, because that gives you a high ratio of value to cost. Dropping packets, on the other hand, is likely to be a bad thing, both because each packet putatively had value, and because many dropped packets are likely to be resent, and a resent packet is one you've paid for twice, and that's precluded the sending of another new, paid-for packet in that timeframe. The cost of not dropping packets is not having buffers overflow, and the cost of not having buffers overflow is either having deep buffers, which means high latency, or having customers with a predictable flow of traffic. Which brings me to item three. In my experience, the single biggest contributor to buffer overflow is having in-feeding (or downstream customer) circuits which are of burst capacity too close to that of the out-feeding (or upstream transit) circuits. Let's say that your outbound circuit is a gigabit, you have two inbound circuits that are a gigabit and run at 100% utilization 10% of the time each, and you have a megabit of buffer memory allocated to the outbound circuit. 1% of the time, both of the inbound circuits will be at 100% utilization simultaneously. When that's happening, you'll have data flowing in at the rate of two gigabits per second, which will fill the buffer in one twentieth of a second, if it persists. And, just like Rosencrantz and Guildenstern flipping coins, such a run will inevitably persist longer than you'd desire, frequently enough. On the other hand, if you have twenty inbound circuits of 100 megabits each, which are transmitting at 100% of capacity 10% of the time each, you're looking at exactly the same amount of data, however it arrives _much more predictably_, since the 2-gigabit inflow would only occur 0.0000000000000000001% of the time, rather than 1% of the time. And it would also be proportionally unlikely to persist for the longer periods of time necessary to overflow the buffer. Thus Kevin's ESnet customers, who are much more likely to be 10gb or 40gb downstream circuits feeding into his 40gb upstream circuits, are much more likely to overflow buffers, than a consumer Internet provider who's feeding 1mb circuits into a gigabit circuit, even if the aggregation ratio of the latter is hundreds of times higher. So, in summary: Your dropped packet counters are the ones to be looking at as a measure of goodput, more than your utilization counters. And keep the size of your aggregation pipes as much bigger than the size of the pipes you aggregate into them as you can afford to. As always, my apologies to those of you for whom this is unnecessarily remedial, for using NANOG bandwidth and a portion of your Sunday morning. -Bill