Jason Frisvold writes:
I'm working on a system to alert when a bandwidth augmentation is needed. I've looked at using both true averages and 95th percentile calculations. I'm wondering what everyone else uses for this purpose?
We use a "secret formula", aka rules of thumb, based on perceived quality expectations/customer access capacities, and cost/revenue considerations. In the bad old days of bandwidth crunch (ca. 1996), we scheduled upgrades of our transatlantic links so that relief would come when peak-hour average packet loss exceeded 5% (later 3%). At that time the general performance expectation was that Internet performance is mostly crap anyway, if you need to transfer large files, "at 0300 AM" is your friend; and upgrades were incredibly expensive. With that rule, link utilization was 100% for most of the (working) day. Today, we start thinking about upgrading from GbE to 10GE when link load regularily exceeds 200-300 Mb/s (even when the average load over a week is much lower). Since we run over dark fibre and use mid-range routers with inexpensive ports, upgrades are relatively cheap. And - fortunately - performance expectations have evolved, with some users expecting to be able to run file transfers near Gb/s speeds, >500 Mb/s videoconferences with no packet loss, etc. An important question is what kind of users your links aggregate. A "core" link shared by millions of low-bandwidth users may run at 95% utilization without being perceived as a bottleneck. On the other hand, you may have an campus access shared by users with fast connections (I hear GbE is common these days) on both sides. In that case, the link may be perceived as a bottleneck even when utilization graphs suggest there's a lot of headroom. In general, I think utilization rates are less useful as a basis for upgrade planning than (queueing) loss and delay measurements. Loss can often be measured directly at routers (drop counters in SNMP), but queueing delay is hard to measure in this way. You could use tools such as SmokePing (host-based) or Cisco IP SLA or Juniper RPM (router-based) to do this. (And if you manage to link your BSS and OSS, then you can measure the rate at which customers run away for an even more relevant metric :-)
We're talking about anything from a T1 to an OC-12 here. My guess is that the calculation needs to be slightly different based on the transport, but I'm not 100% sure.
Probably not on the type of transport - PDH/SDH/Ethernet behave essentially the same. But the rules will be different for different bandwidth ranges. Again, it is important to look not just at link capacities in isolation, but also at the relation to the capacities of the access links that they aggregate. -- Simon.