Bottlenecks and link upgrades
On Wed, 12 Aug 2020 at 10:35, Hank Nussbacher <hank@interall.co.il> wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
I've worked for employees where policy has been anywhere from 50% or 80%. And I know this isn't complete range. Most do not subscribe to any single simple rule but act more tactically. Personally if the link is in a growth market, you should upgrade really early, 50% seems late, cost is negligible if you anticipate growth to continue. If it's not a growth market cost may become less than negligible. Sometimes networks congest particularly their edge interfaces strategically due to poor incentives, where irrelevant revenue wholesale arm might see some benefit from strategic congestion while also significantly hurting their money printing mobile arm reducing company wide bottom line while improving wholesale arm bottom line. -- ++ytti
On 12/Aug/20 09:44, Saku Ytti wrote:
Personally if the link is in a growth market, you should upgrade really early, 50% seems late, cost is negligible if you anticipate growth to continue. If it's not a growth market cost may become less than negligible.
The problem you have is "what is a growth market", especially over time as it stabilizes and see new entrants, but growth is now in a phase where you need massive scale to keep playing. You then shift from a "sales are guaranteed Day 1" to a "build it and hope for the best". Many Commercial get fearful at that point, because of the temptation to link capacity to guaranteed sales.
Sometimes networks congest particularly their edge interfaces strategically due to poor incentives, where irrelevant revenue wholesale arm might see some benefit from strategic congestion while also significantly hurting their money printing mobile arm reducing company wide bottom line while improving wholesale arm bottom line.
I know a few :-). Mark.
On 12/Aug/20 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
We start the process at 50% utilization, and work toward completing the upgrade by 70% utilization. The period between 50% - 70% is just internal paperwork. Mark.
Just my curiosity. May I ask how we can measure the link capacity loading? What does it mean by a 50%, 70%, or 90% capacity loading? Load sampled and measured instantaneously, or averaging over a certain period of time (granularity)? These are questions have bothered me for long. Don't know if I can ask about these by the way. I take care of the radio access network performance at work. Found many things unknown in transport network. Thanks and best regards, Taichi On Wed, Aug 12, 2020 at 3:54 PM Mark Tinka <mark.tinka@seacom.com> wrote:
On 12/Aug/20 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
We start the process at 50% utilization, and work toward completing the upgrade by 70% utilization.
The period between 50% - 70% is just internal paperwork.
Mark.
On 12/Aug/20 17:08, m.Taichi wrote:
Just my curiosity. May I ask how we can measure the link capacity loading? What does it mean by a 50%, 70%, or 90% capacity loading? Load sampled and measured instantaneously, or averaging over a certain period of time (granularity)?
These are questions have bothered me for long. Don't know if I can ask about these by the way. I take care of the radio access network performance at work. Found many things unknown in transport network.
For this, we look at simpel 5-minute based SNMP data over the period. Nothing too fancy. It's stable Mark.
On 12/Aug/20 17:08, m.Taichi wrote:
Just my curiosity. May I ask how we can measure the link capacity loading? What does it mean by a 50%, 70%, or 90% capacity loading? Load sampled and measured instantaneously, or averaging over a certain period of time (granularity)?
These are questions have bothered me for long. Don't know if I can ask about these by the way. I take care of the radio access network performance at work. Found many things unknown in transport network.
For this, we look at simple 5-minute based SNMP data over the period. Nothing too fancy. It's stable Mark.
m Taichi writes:
Just my curiosity. May I ask how we can measure the link capacity loading? What does it mean by a 50%, 70%, or 90% capacity loading? Load sampled and measured instantaneously, or averaging over a certain period of time (granularity)?
Very good question! With tongue in cheek, one could say that measured instantaneously, the load on a link is always either zero or 100% link rate... ISPs typically sample link load in 5-minute intervals and look at graphs that show load (at this 5-minute sampling resolution) over ~24 hours, or longer-term graphs where the resolution has been "downsampled", where downsampling usually smoothes out short-term peaks.
From my own experience, upgrade decisions are made by looking at those graphs and checking whether peak traffic (possibly ignoring "spikes" :-) crosses the threshold repeatedly.
At some places this might be codified in terms of percentiles, e.g. "the Nth percentile of the M-minute utilization samples exceeds X% of link capacity over a Y-day period". I doubt that anyone uses such rules to automatically issue upgrade orders, but maybe to generate alerts like "please check this link, we might want to upgrade it". I'd be curious whether other operators have such alert rules, and what N/M/X/Y they use - might well be different values for different kinds of links. -- Simon. PS. We use the "stare at graphs" method, but if we had automatic alerts, I guess it would be something like "the 95th percentile of 5-minute samples exceeds 50% over 30 days". PPS. My colleagues remind me that we do alert on output queue drops.
These are questions have bothered me for long. Don't know if I can ask about these by the way. I take care of the radio access network performance at work. Found many things unknown in transport network.
Thanks and best regards, Taichi
On Wed, Aug 12, 2020 at 3:54 PM Mark Tinka <mark.tinka@seacom.com> wrote:
On 12/Aug/20 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
We start the process at 50% utilization, and work toward completing the upgrade by 70% utilization.
The period between 50% - 70% is just internal paperwork.
Mark.
On 13/Aug/20 11:56, Simon Leinen wrote:
I'd be curious whether other operators have such alert rules, and what N/M/X/Y they use - might well be different values for different kinds of links.
We use alerts to tell us about links that hit a threshold, in our NMS. But yes, this is based on 5-minute samples, not percentile data. The alerts are somewhat redundant for any long-term planning. They are more useful when problems happen out of the blue. Mark.
With tongue in cheek, one could say that measured instantaneously, the load on a link is always either zero or 100% link rate...
Actually, that's a first-class observation ! On Thu, Aug 13, 2020 at 12:00 PM Simon Leinen <simon.leinen@switch.ch> wrote:
m Taichi writes:
Just my curiosity. May I ask how we can measure the link capacity loading? What does it mean by a 50%, 70%, or 90% capacity loading? Load sampled and measured instantaneously, or averaging over a certain period of time (granularity)?
Very good question!
With tongue in cheek, one could say that measured instantaneously, the load on a link is always either zero or 100% link rate...
ISPs typically sample link load in 5-minute intervals and look at graphs that show load (at this 5-minute sampling resolution) over ~24 hours, or longer-term graphs where the resolution has been "downsampled", where downsampling usually smoothes out short-term peaks.
From my own experience, upgrade decisions are made by looking at those graphs and checking whether peak traffic (possibly ignoring "spikes" :-) crosses the threshold repeatedly.
At some places this might be codified in terms of percentiles, e.g. "the Nth percentile of the M-minute utilization samples exceeds X% of link capacity over a Y-day period". I doubt that anyone uses such rules to automatically issue upgrade orders, but maybe to generate alerts like "please check this link, we might want to upgrade it".
I'd be curious whether other operators have such alert rules, and what N/M/X/Y they use - might well be different values for different kinds of links. -- Simon. PS. We use the "stare at graphs" method, but if we had automatic alerts, I guess it would be something like "the 95th percentile of 5-minute samples exceeds 50% over 30 days". PPS. My colleagues remind me that we do alert on output queue drops.
These are questions have bothered me for long. Don't know if I can ask about these by the way. I take care of the radio access network performance at work. Found many things unknown in transport network.
Thanks and best regards, Taichi
On Wed, Aug 12, 2020 at 3:54 PM Mark Tinka <mark.tinka@seacom.com> wrote:
On 12/Aug/20 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
We start the process at 50% utilization, and work toward completing the upgrade by 70% utilization.
The period between 50% - 70% is just internal paperwork.
Mark.
-- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
When I worked for an ISP, it was about 70%, not sure if that is the case with the other ones. On 8/12/2020 3:31 AM, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Thanks, Hank
Caveat: The views expressed above are solely my own and do not express the views or opinions of my employer
On Wed, 12 Aug 2020, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Thanks, Hank
Caveat: The views expressed above are solely my own and do not express the views or opinions of my employer
Why upgrade when you can legislate the problem instead. Charter tries to convince FCC that broadband customers want data caps. https://arstechnica.com/tech-policy/2020/08/charter-tries-to-convince-fcc-th... Ted
On 12.08.2020 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Hi, Wouldn't it be better to measure the basic performance like packet drop rates and queue sizes ? These days live video is needed and these parameters are essential to the quality. Queues are building up in milliseconds and people are averaging over minutes to estimate quality. If you are measuring queue delay with high frequent one-way-delay measurements you would then be able to advice better on what the consequences of a highly loaded link are. We are running a research project on end-to-end quality and the enclosed image is yesterdays report on queuesize(h_ddelay) in ms. It shows stats on delays between some peers. I would have looked at the trends on the involved links to see if upgrade is necessary - 421 ms might be too much ig it happens often. Best regards Olav Kvittem
Thanks, Hank
Caveat: The views expressed above are solely my own and do not express the views or opinions of my employer
On 13/Aug/20 12:23, Olav Kvittem via NANOG wrote:
Wouldn't it be better to measure the basic performance like packet drop rates and queue sizes ?
These days live video is needed and these parameters are essential to the quality.
Queues are building up in milliseconds and people are averaging over minutes to estimate quality.
If you are measuring queue delay with high frequent one-way-delay measurements
you would then be able to advice better on what the consequences of a highly loaded link are.
We are running a research project on end-to-end quality and the enclosed image is yesterdays report on
queuesize(h_ddelay) in ms. It shows stats on delays between some peers.
I would have looked at the trends on the involved links to see if upgrade is necessary -
421 ms might be too much ig it happens often.
I'm confident everyone (even the cheapest CFO) knows the consequences of congesting a link and choosing not to upgrade it. Optical issues, dirty patch cords, faulty line cards, wrong configurations, will almost likely lead to packet loss. Link congestion due to insufficient bandwidth will most certainly lead to packet loss. It's great to monitor packet loss, latency, pps, e.t.c. But packet loss at 10% link utilization is not a foreign occurrence. No amount of bandwidth upgrades will fix that. Mark.
Mark Tinka wrote on 13/08/2020 11:31:
It's great to monitor packet loss, latency, pps, e.t.c. But packet loss at 10% link utilization is not a foreign occurrence. No amount of bandwidth upgrades will fix that.
you could easily have 10% utilization and see packet loss due to insufficient bandwidth if you have egress << ingress and proportionally low buffering, e.g. UDP or iSCSI from a 40G/100 port with egress to a low-buffer 1G port. This sort of thing is less likely in the imix world, but it can easily happen with high capacity CDN nodes injecting content where the receiving port is small and subject to bursty traffic. Nick
On 13/Aug/20 13:00, Nick Hilliard wrote:
you could easily have 10% utilization and see packet loss due to insufficient bandwidth if you have egress << ingress and proportionally low buffering, e.g. UDP or iSCSI from a 40G/100 port with egress to a low-buffer 1G port.
This sort of thing is less likely in the imix world, but it can easily happen with high capacity CDN nodes injecting content where the receiving port is small and subject to bursty traffic.
Indeed. The smaller the capacity gets toward egress, the closer you are getting to an end-user, in most cases. End-user link upgrades will always be the weakest link in the chain, as the incentive is more on their side than you, their provider. Your final egress port buffer sizing notwithstanding, of course. Mark.
Hi Mark, Just comments on your points below. On 13.08.2020 12:31, Mark Tinka wrote:
On 13/Aug/20 12:23, Olav Kvittem via NANOG wrote:
Wouldn't it be better to measure the basic performance like packet drop rates and queue sizes ?
These days live video is needed and these parameters are essential to the quality.
Queues are building up in milliseconds and people are averaging over minutes to estimate quality.
If you are measuring queue delay with high frequent one-way-delay measurements
you would then be able to advice better on what the consequences of a highly loaded link are.
We are running a research project on end-to-end quality and the enclosed image is yesterdays report on
queuesize(h_ddelay) in ms. It shows stats on delays between some peers.
I would have looked at the trends on the involved links to see if upgrade is necessary -
421 ms might be too much ig it happens often.
I'm confident everyone (even the cheapest CFO) knows the consequences of congesting a link and choosing not to upgrade it.
Optical issues, dirty patch cords, faulty line cards, wrong configurations, will almost likely lead to packet loss. Link congestion due to insufficient bandwidth will most certainly lead to packet loss.
sure, but I guess the loss rate depends of the nature of the traffic.
It's great to monitor packet loss, latency, pps, e.t.c. But packet loss at 10% link utilization is not a foreign occurrence. No amount of bandwidth upgrades will fix that.
I guess that having more reports would support the judgements better. A basic question is : what is the effect on the perceived quality of the customers ? And the relation between that and /5min load is not known to me. Actually one good indicator of the congestion loss rate are of course the SNMP OutputDiscards. Curves for queueing delay, link load and discard rate are surprisingly different. regards Olav
Mark.
On 13/Aug/20 13:44, Olav Kvittem wrote:
sure, but I guess the loss rate depends of the nature of the traffic.
Packet loss is packet loss. Some applications are more sensitive to it (live video, live voice, for example), while others are less so. However, packet loss always manifests badly if left unchecked.
I guess that having more reports would support the judgements better.
For sure, yes. Any decent NMS can provide a number of data points so you aren't shooting in the dark.
A basic question is : what is the effect on the perceived quality of the customers ?
Depends on the application. Gamers tend to complain the most, so that's a great indicator. Some customers that think bandwidth solves all problems will perceive their inability to attain their advertised contract as a problem, if packet loss is in the way. Generally, other bad things, including unruly human beings :-).
And the relation between that and /5min load is not known to me.
For troubleshooting, being able to have a tighter resolution is more important. 5-minute averages are for day-to-day operations, and long-term planning.
Actually one good indicator of the congestion loss rate are of course the SNMP OutputDiscards.
Curves for queueing delay, link load and discard rate are surprisingly different.
Yes, that then gets into the guts of the router hardware, and it's design. In such cases, that's when your 100Gbps link is peaking and causing packet loss, not understanding that the forwarding chip on it is only good for 60Gbps, for example. Mark.
Is it possible to do and is anyone monitoring metrics such as max queue length in 5 minutes intervals? Might be a better metric than average load in 5 minutes intervals. Regards Baldur
I suppose it would depend on if your hardware has an OID for what you want to monitor. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Baldur Norddahl" <baldur.norddahl@gmail.com> To: nanog@nanog.org Sent: Thursday, August 13, 2020 8:20:26 AM Subject: Re: Bottlenecks and link upgrades Is it possible to do and is anyone monitoring metrics such as max queue length in 5 minutes intervals? Might be a better metric than average load in 5 minutes intervals. Regards Baldur
I expect my hardware does not have such a metric, but maybe it should have. Max queue length tell us how full the link is with respect to microbursts. tor. 13. aug. 2020 15.28 skrev Mike Hammett <nanog@ics-il.net>:
I suppose it would depend on if your hardware has an OID for what you want to monitor.
----- Mike Hammett Intelligent Computing Solutions <http://www.ics-il.com/> <https://www.facebook.com/ICSIL> <https://plus.google.com/+IntelligentComputingSolutionsDeKalb> <https://www.linkedin.com/company/intelligent-computing-solutions> <https://twitter.com/ICSIL> Midwest Internet Exchange <http://www.midwest-ix.com/> <https://www.facebook.com/mdwestix> <https://www.linkedin.com/company/midwest-internet-exchange> <https://twitter.com/mdwestix> The Brothers WISP <http://www.thebrotherswisp.com/> <https://www.facebook.com/thebrotherswisp> <https://www.youtube.com/channel/UCXSdfxQv7SpoRQYNyLwntZg> ------------------------------ *From: *"Baldur Norddahl" <baldur.norddahl@gmail.com> *To: *nanog@nanog.org *Sent: *Thursday, August 13, 2020 8:20:26 AM *Subject: *Re: Bottlenecks and link upgrades
Is it possible to do and is anyone monitoring metrics such as max queue length in 5 minutes intervals? Might be a better metric than average load in 5 minutes intervals.
Regards
Baldur
It is possible to gather a lot of information about buffers and queues, at least with the vendors we work with. That can be very helpful in a lot of ways. :) On Thu, Aug 13, 2020 at 9:21 AM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Is it possible to do and is anyone monitoring metrics such as max queue length in 5 minutes intervals? Might be a better metric than average load in 5 minutes intervals.
Regards
Baldur
On Thu, Aug 13, 2020, at 12:31, Mark Tinka wrote:
I'm confident everyone (even the cheapest CFO) knows the consequences of congesting a link and choosing not to upgrade it.
I think you're over-confident.
It's great to monitor packet loss, latency, pps, e.t.c. But packet loss at 10% link utilization is not a foreign occurrence. No amount of bandwidth upgrades will fix that.
That, plus the fact that by the time delay becomes an indication of congestion, it's way too late to start an upgrade. That event should not occur.
On 15/Aug/20 01:45, Radu-Adrian Feurdean wrote:
I think you're over-confident.
If you can resist the "let me make a plan" offer that CFO's would want you to give them, you can be confident :-). Because when it hits the fan, the CFO will say, "But Feurdean said he would make a plan. If he thought the situation was urgent, he didn't make it known clearly enough". Better to say, "CFO, if you don't do this upgrade, the network breaks". And walk away. Don't accept risk on behalf of someone else, because at the end of the day, no one will blame the network... but those that operate it. Mark.
Wouldn't it be better to measure the basic performance like packet drop rates and queue sizes ?
Those values should be a standard part of monitoring and data collection, but if they happen to MATTER or not in a given situation very much depends. The traffic profile traversing the link may be such that the observed drop % and buffer depths is acceptable for that traffic, and there is no need for further tuning or changes. In other scenarios it may not be, in which case either network or application adjustments are warranted. There is rarely a one sized fits all answer when it comes to these things. On Thu, Aug 13, 2020 at 6:25 AM Olav Kvittem via NANOG <nanog@nanog.org> wrote:
On 12.08.2020 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Hi,
Wouldn't it be better to measure the basic performance like packet drop rates and queue sizes ?
These days live video is needed and these parameters are essential to the quality.
Queues are building up in milliseconds and people are averaging over minutes to estimate quality.
If you are measuring queue delay with high frequent one-way-delay measurements
you would then be able to advice better on what the consequences of a highly loaded link are.
We are running a research project on end-to-end quality and the enclosed image is yesterdays report on
queuesize(h_ddelay) in ms. It shows stats on delays between some peers.
I would have looked at the trends on the involved links to see if upgrade is necessary -
421 ms might be too much ig it happens often.
Best regards
Olav Kvittem
Thanks, Hank
Caveat: The views expressed above are solely my own and do not express the views or opinions of my employer
There is rarely a one sized fits all answer when it comes to these things.
Absolutely true: every application has characteristic QoS parameters. Unfortunately, it seems that 5-minute averages of data rates through links are the one-size-fits-all answer ... which doesn't fit all. Etienne On Thu, Aug 13, 2020 at 5:37 PM Tom Beecher <beecher@beecher.cc> wrote:
Wouldn't it be better to measure the basic performance like packet drop
rates and queue sizes ?
Those values should be a standard part of monitoring and data collection, but if they happen to MATTER or not in a given situation very much depends.
The traffic profile traversing the link may be such that the observed drop % and buffer depths is acceptable for that traffic, and there is no need for further tuning or changes. In other scenarios it may not be, in which case either network or application adjustments are warranted.
There is rarely a one sized fits all answer when it comes to these things.
On Thu, Aug 13, 2020 at 6:25 AM Olav Kvittem via NANOG <nanog@nanog.org> wrote:
On 12.08.2020 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Hi,
Wouldn't it be better to measure the basic performance like packet drop rates and queue sizes ?
These days live video is needed and these parameters are essential to the quality.
Queues are building up in milliseconds and people are averaging over minutes to estimate quality.
If you are measuring queue delay with high frequent one-way-delay measurements
you would then be able to advice better on what the consequences of a highly loaded link are.
We are running a research project on end-to-end quality and the enclosed image is yesterdays report on
queuesize(h_ddelay) in ms. It shows stats on delays between some peers.
I would have looked at the trends on the involved links to see if upgrade is necessary -
421 ms might be too much ig it happens often.
Best regards
Olav Kvittem
Thanks, Hank
Caveat: The views expressed above are solely my own and do not express the views or opinions of my employer
-- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
On Wed, Aug 12, 2020 at 12:33 AM Hank Nussbacher <hank@interall.co.il> wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Hi Hank, As others have noted, the answer is rarely that simple. First, what is your consumption? 90th or 95th percentile usually, after all 100% between 9 and 5 is 100% not 33% but 100% for two minutes is not 100%. It gets more complicated if any kind of QoS is in play because capacity-wise QoS essentially gives you not a single fixed-speed line but many interdependent variable-speed lines. Next, capacity is not the only question. Here are some of the other factors: 1) A residential customer on the cheapest plan does not merit as clean a channel as a high-paying business customer you'd like to keep milking. 2) Upgrades can take months of planning so the capacity now is beside the point. You'll use your best-guess projection for the capacity at the time an upgrade can be complete. 3) Some upgrades tend to be significantly more expensive than others. Lit service to dark fiber, for example. It's pretty ordinary to run closer to the limit before making an expensive upgrade than a modest upgrade. 4) A dirty link merits replacement sooner than a clean one. If the higher-capacity service also clears up packet loss, you'll want to trigger the decision at a lower consumption threshold. 5) Switching a single path to two paths is more valuable than switching two paths to three. It has priority at a lower level of consumption. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Wed, Aug 12, 2020, at 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Some reflections about link capacity: At 90% and over, you should panic. Between 80% and 90% you should be (very) scared. Between 70% and 80% you should be worried. Between 60% and 70% you should seriously consider speeding up the upgrades that you effectively started at 50%, and started planning since 40%. Of course, that differs from one ISP to another. Some only upgrade after several months with at least 4 hours a day, every day (or almost) at over 95%. Others deploy 10x expected capacity, and upgrade well before 40%.
Beyond a pure percentage, you might want to account for the time it takes you stay below a certain threshold. If you want to target a certain link to keep your 95th percentile peaks below 70%, then first get an understanding of your traffic growth and try to project when you will reach that number. You have to decide whether you care about the occasional peak, or the consistent peak, or somewhere in between, like weekday vs weekends, etc. Now you know how much lead time you will have. Then consider how long it will take you to upgrade that link. If it's a matter of adding a couple of crossconnects, then you might just need a week. If you have to ship and install optics, modules, a card, then add another week. If you have to get a sales order signed by senior management, add another week. If you have to put it through legal and finance, add a month. (kidding) If you are doing your annual re-negotiation, well...good luck. It's always good to ask your circuit vendors what the lead times are, then double it and add 5. And sometimes, if you need a low latency connection, traffic utilization levels might not even be something you look at. Louie Peering Coordinator at a start-up ISP On Fri, Aug 14, 2020 at 4:13 PM Radu-Adrian Feurdean < nanog@radu-adrian.feurdean.net> wrote:
On Wed, Aug 12, 2020, at 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Some reflections about link capacity: At 90% and over, you should panic. Between 80% and 90% you should be (very) scared. Between 70% and 80% you should be worried. Between 60% and 70% you should seriously consider speeding up the upgrades that you effectively started at 50%, and started planning since 40%.
Of course, that differs from one ISP to another. Some only upgrade after several months with at least 4 hours a day, every day (or almost) at over 95%. Others deploy 10x expected capacity, and upgrade well before 40%.
I've seen the weekly profiles of traffic sourced from caches for the major global services (video, social media, search and general) for a specific metro area. For all services, the weekly profile is a repetition of the daily profile, within +/- 20%. That is: the weekly profile is obtained from the daily profile within +/- 20% of the average daily profile height. Given this regularity, as suggested by Louie Lee, then it seems that growth projections are meaningful. That is, the weely profile data, seem to provide a sound empirical basis for link upgrades. Since I'm not an operator, my comments need to be sprinkled with a pinch of salt :) Cheers, Etienne On Sat, Aug 15, 2020 at 2:43 AM Louie Lee via NANOG <nanog@nanog.org> wrote:
Beyond a pure percentage, you might want to account for the time it takes you stay below a certain threshold. If you want to target a certain link to keep your 95th percentile peaks below 70%, then first get an understanding of your traffic growth and try to project when you will reach that number. You have to decide whether you care about the occasional peak, or the consistent peak, or somewhere in between, like weekday vs weekends, etc. Now you know how much lead time you will have.
Then consider how long it will take you to upgrade that link. If it's a matter of adding a couple of crossconnects, then you might just need a week. If you have to ship and install optics, modules, a card, then add another week. If you have to get a sales order signed by senior management, add another week. If you have to put it through legal and finance, add a month. (kidding) If you are doing your annual re-negotiation, well...good luck.
It's always good to ask your circuit vendors what the lead times are, then double it and add 5.
And sometimes, if you need a low latency connection, traffic utilization levels might not even be something you look at.
Louie Peering Coordinator at a start-up ISP
On Fri, Aug 14, 2020 at 4:13 PM Radu-Adrian Feurdean < nanog@radu-adrian.feurdean.net> wrote:
On Wed, Aug 12, 2020, at 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Some reflections about link capacity: At 90% and over, you should panic. Between 80% and 90% you should be (very) scared. Between 70% and 80% you should be worried. Between 60% and 70% you should seriously consider speeding up the upgrades that you effectively started at 50%, and started planning since 40%.
Of course, that differs from one ISP to another. Some only upgrade after several months with at least 4 hours a day, every day (or almost) at over 95%. Others deploy 10x expected capacity, and upgrade well before 40%.
-- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
No plan survives contact with the enemy. Your careful made growth projection was fine until the brass made a deal with some major customer, which caused a traffic spike. Or any infinite other events that could and eventually will happen to you. One hard thing, that almost everyone will get wrong at some point, is simulating load in the event multiple outages takes some links out, causing excessive traffic to reroute unto links that previously seemed fine. Regards, Baldur On Sat, Aug 15, 2020 at 10:48 AM Etienne-Victor Depasquale <edepa@ieee.org> wrote:
I've seen the weekly profiles of traffic sourced from caches for the major global services (video, social media, search and general) for a specific metro area.
For all services, the weekly profile is a repetition of the daily profile, within +/- 20%. That is: the weekly profile is obtained from the daily profile within +/- 20% of the average daily profile height.
Given this regularity, as suggested by Louie Lee, then it seems that growth projections are meaningful. That is, the weely profile data, seem to provide a sound empirical basis for link upgrades.
Since I'm not an operator, my comments need to be sprinkled with a pinch of salt :)
Cheers,
Etienne
On Sat, Aug 15, 2020 at 2:43 AM Louie Lee via NANOG <nanog@nanog.org> wrote:
Beyond a pure percentage, you might want to account for the time it takes you stay below a certain threshold. If you want to target a certain link to keep your 95th percentile peaks below 70%, then first get an understanding of your traffic growth and try to project when you will reach that number. You have to decide whether you care about the occasional peak, or the consistent peak, or somewhere in between, like weekday vs weekends, etc. Now you know how much lead time you will have.
Then consider how long it will take you to upgrade that link. If it's a matter of adding a couple of crossconnects, then you might just need a week. If you have to ship and install optics, modules, a card, then add another week. If you have to get a sales order signed by senior management, add another week. If you have to put it through legal and finance, add a month. (kidding) If you are doing your annual re-negotiation, well...good luck.
It's always good to ask your circuit vendors what the lead times are, then double it and add 5.
And sometimes, if you need a low latency connection, traffic utilization levels might not even be something you look at.
Louie Peering Coordinator at a start-up ISP
On Fri, Aug 14, 2020 at 4:13 PM Radu-Adrian Feurdean < nanog@radu-adrian.feurdean.net> wrote:
On Wed, Aug 12, 2020, at 09:31, Hank Nussbacher wrote:
At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?
Some reflections about link capacity: At 90% and over, you should panic. Between 80% and 90% you should be (very) scared. Between 70% and 80% you should be worried. Between 60% and 70% you should seriously consider speeding up the upgrades that you effectively started at 50%, and started planning since 40%.
Of course, that differs from one ISP to another. Some only upgrade after several months with at least 4 hours a day, every day (or almost) at over 95%. Others deploy 10x expected capacity, and upgrade well before 40%.
-- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
On Sat, Aug 15, 2020, at 11:35, Baldur Norddahl wrote:
No plan survives contact with the enemy. Your careful made growth projection was fine until the brass made a deal with some major customer, which caused a traffic spike.
Capacity planning also includes keeping an eye on what is being sold and what is being prepared. Having the traffic more than double within a 48h timespan (until day X peak at N Gbps, after days X+2, peaks at 2.5*N Gbps) -> done with success when the correct information ("partner X will change delivery system") arrived 4 months in advance. Having multiple 200 Mbps and 500 Mbps connections over an already-used 1 Gbps port and pretending that "everything's gonna be allright" , in that case you should confront your enemy.
Or any infinite other events that could and eventually will happen to you.
Among which you try to protect yourself against the most realistic ones.
One hard thing, that almost everyone will get wrong at some point, is simulating load in the event multiple outages takes some links out, causing excessive traffic to reroute unto links that previously seemed fine.
You should scale the network to absorb a certain degree of "surprise"/damage, and clearly explain that beyond that certain level, service will be degraded (or even absent) and there is nothing that can and nothing that will be done immediately. Every network fails at a certain moment in time. You just need to make sure you know how to make it working again, within a reasonable time frame. Or have a good run-away plan (sometimes this is the best solution).
+1 You can't foresee everything, but no plan means foreseeing nothing, = blindfold. Cheers, Etienne On Sat, Aug 15, 2020 at 12:29 PM Radu-Adrian Feurdean < nanog@radu-adrian.feurdean.net> wrote:
On Sat, Aug 15, 2020, at 11:35, Baldur Norddahl wrote:
No plan survives contact with the enemy. Your careful made growth projection was fine until the brass made a deal with some major customer, which caused a traffic spike.
Capacity planning also includes keeping an eye on what is being sold and what is being prepared. Having the traffic more than double within a 48h timespan (until day X peak at N Gbps, after days X+2, peaks at 2.5*N Gbps) -> done with success when the correct information ("partner X will change delivery system") arrived 4 months in advance.
Having multiple 200 Mbps and 500 Mbps connections over an already-used 1 Gbps port and pretending that "everything's gonna be allright" , in that case you should confront your enemy.
Or any infinite other events that could and eventually will happen to you.
Among which you try to protect yourself against the most realistic ones.
One hard thing, that almost everyone will get wrong at some point, is simulating load in the event multiple outages takes some links out, causing excessive traffic to reroute unto links that previously seemed fine.
You should scale the network to absorb a certain degree of "surprise"/damage, and clearly explain that beyond that certain level, service will be degraded (or even absent) and there is nothing that can and nothing that will be done immediately.
Every network fails at a certain moment in time. You just need to make sure you know how to make it working again, within a reasonable time frame. Or have a good run-away plan (sometimes this is the best solution).
-- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
On 15/Aug/20 12:32, Etienne-Victor Depasquale wrote:
+1
You can't foresee everything, but no plan means foreseeing nothing, = blindfold.
In the absence of guidance from your Sales team on a forecast, keep the 50% threshold trigger, and standardize on lead times if urgent feasibilities don't immediately pass. The more you do this, the more you will encourage better planning on the Sales side. It just happens automatically. The worst thing you can do for yourself and your team is try to be the hero. Mark.
On 15/Aug/20 11:35, Baldur Norddahl wrote:
No plan survives contact with the enemy. Your careful made growth projection was fine until the brass made a deal with some major customer, which caused a traffic spike. Or any infinite other events that could and eventually will happen to you.
That's why your operations teams cannot work separately from the Sales teams. If a big deal is in the pipeline, there should be someone operational to do a simple feasibility check to see if the segment in question will handle the traffic. If not, defer to standard lead times to deliver. Or even extended ones if the deal is larger than usual.
One hard thing, that almost everyone will get wrong at some point, is simulating load in the event multiple outages takes some links out, causing excessive traffic to reroute unto links that previously seemed fine.
So rather than simulate, insure, I say. By insure, I mean upgrade each and every backbone link when it hits 50%, and you'll have less to worry about when things start crumbling all over the place. Mark.
On 15/Aug/20 10:47, Etienne-Victor Depasquale wrote:
I've seen the weekly profiles of traffic sourced from caches for the major global services (video, social media, search and general) for a specific metro area.
For all services, the weekly profile is a repetition of the daily profile, within +/- 20%. That is: the weekly profile is obtained from the daily profile within +/- 20% of the average daily profile height.
Given this regularity, as suggested by Louie Lee, then it seems that growth projections are meaningful. That is, the weely profile data, seem to provide a sound empirical basis for link upgrades.
Since I'm not an operator, my comments need to be sprinkled with a pinch of salt :)
Provided your NMS has been stable over any period of time, you can extract historical data over 1 year or more and see how linearly things grew. It's difficult to sometimes see the growth rate when you are close to the daily action. Mark.
On Sat, Aug 15, 2020, at 02:39, Louie Lee wrote:
get an understanding of your traffic growth and try to project when you will reach that number. You have to decide whether you care about the occasional peak, or the consistent peak, or somewhere in between, like weekday vs weekends, etc. Now you know how much lead time you will have.
Get an understanding, and try to make a plan on the longer term (like 2-3 years) if you can. If you're reaching some important milestones (e.g need to buy expensive hardware), make a presentation for the management. You will definitely need adjustments, during the timespan covered (some things will need to be done sooner, others may leave you some extra time) but it should reduce the amount of surprise. That is valid if you have visibility. If you don't (that may happen), the cheatsheet I described previously is a good start. It could be applied at $job[-1], where I applied it to grow the network from almost zero to 35 Gbps, and it is kind of applied at $job[$now] where long term visibility is kind of missing and we need to be ready for rapid capacity variations.
And sometimes, if you need a low latency connection, traffic utilization levels might not even be something you look at.
This goes to the "understand your traffic" chapter. All the traffic (sine sometimes there may be a mix, e.g. regular eyeball traffic + voice traffic).
participants (16)
-
Baldur Norddahl
-
Daniel
-
Etienne-Victor Depasquale
-
Hank Nussbacher
-
Louie Lee
-
m.Taichi
-
Mark Tinka
-
Mike Hammett
-
Nick Hilliard
-
Olav Kvittem
-
Radu-Adrian Feurdean
-
Saku Ytti
-
Simon Leinen
-
Ted Hatfield
-
Tom Beecher
-
William Herrin