Latency/Packet Loss on ASR1006
Hi, We have ... ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2 We've been having latency and packet loss during peak periods... We notice all is good until we reach 50% utilization on output of... 'show platform hardware qfp active datapath utilization summary' Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss. Is this expected... the ESP40 can only really push 20G and then starts to have performance issues? --- Colin Legendre
https://www.cisco.com/c/en/us/support/docs/routers/asr-1000-series-aggregati... So many years since I have used an asr1000 but, honestly you have an esp40 in a box with 10x10G interfaces? That’s a very underpowered processor for that job. The ESP40 was designed for a box that would have 1G interfaces and perhaps a couple of 10’s. The ASR1000 is a CPU based box, everything goes back to the processor and remember cisco math means half duplex not full. From: NANOG <nanog-bounces+tony=wicks.co.nz@nanog.org> On Behalf Of Colin Legendre Sent: Saturday, 27 November 2021 8:09 am To: nanog <nanog@nanog.org> Subject: Latency/Packet Loss on ASR1006 Hi, We have ... ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2 We've been having latency and packet loss during peak periods... We notice all is good until we reach 50% utilization on output of... 'show platform hardware qfp active datapath utilization summary' Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss. Is this expected... the ESP40 can only really push 20G and then starts to have performance issues? --- Colin Legendre
On Fri, 26 Nov 2021 at 21:37, Tony Wicks <tony@wicks.co.nz> wrote:
So many years since I have used an asr1000 but, honestly you have an esp40 in a box with 10x10G interfaces? That’s a very underpowered processor for that job. The ESP40 was designed for a box that would have 1G interfaces and perhaps a couple of 10’s. The ASR1000 is a CPU based box, everything goes back to the processor and remember cisco math means half duplex not full.
I'm not sure what a CPU based box means here. ASR1k isn't using a general purpose core like PQ3, INTC or AMD. Like CRS-1 and nPower, ASR1k has Cisco made forwarding logic using cores from tensilica (CPP10/popey I believe was 40 x Tensilica DI 570T, next iteration was 64 cores). -- ++ytti
I mean a router without ASIC based forwarding like a Juniper MX or Nokia 7750. The advantage of the 1k is you don't need a services card for cgnat, but the large disadvantage is everything passes through the ESP processor and this often leads to disappointing results under load.
I'm not sure what a CPU based box means here. ASR1k isn't using a general purpose core like PQ3, INTC or AMD. Like CRS-1 and nPower, ASR1k has Cisco made forwarding logic using cores from tensilica (CPP10/popey I believe was 40 x Tensilica DI 570T, next iteration was 64 cores).
-- ++ytti
On Sat, 27 Nov 2021 at 13:32, Tony Wicks <tony@wicks.co.nz> wrote:
I mean a router without ASIC based forwarding like a Juniper MX or Nokia 7750. The advantage of the 1k is you don't need a services card for cgnat, but the large disadvantage is everything passes through the ESP processor and this often leads to disappointing results under load.
I think ASR1k NPU perfect analog for Juniper MX Trio or Nokia 7750 FP, I think these all fall in very common description of an NPU. We could dive deep and explain why 7750 and MX are vastly different, in decision of doing many small or few large cores, but ultimately they easily fall under NPU definition. -- ++ytti
Hi we see similar problems on ASR1006-X with ESP100 and MIP100. At about ~45 Gbit/s of traffic (on ~30k PPPoE Sessions and ~700k CGN sessions) the QFP utilization skyrockets from ~45 % straight to ~95 % :( I don't know if it's the CGN sessions or the traffic/packets causing the load increase, the datasheet says it supports something like 10M sessions.... but maybe not if you really intend to push packets through it? We have not seen such spikes with way higher pps, but lower CGN session count, when we had DDoS Attacks against end customers. Fiona On 11/26/21 20:09, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
--- Colin Legendre
In the past we had packet loss issues due SIP's PLIM buffer. The following docs may provide some guidance: https://www.cisco.com/c/en/us/support/docs/routers/asr-1000-series-aggregati... https://www.cisco.com/c/en/us/td/docs/interfaces_modules/shared_port_adapter... -- Tassos On 27/11/21 02:11, Fiona Weber via NANOG wrote:
Hi
we see similar problems on ASR1006-X with ESP100 and MIP100. At about ~45 Gbit/s of traffic (on ~30k PPPoE Sessions and ~700k CGN sessions) the QFP utilization skyrockets from ~45 % straight to ~95 % :( I don't know if it's the CGN sessions or the traffic/packets causing the load increase, the datasheet says it supports something like 10M sessions.... but maybe not if you really intend to push packets through it? We have not seen such spikes with way higher pps, but lower CGN session count, when we had DDoS Attacks against end customers.
Fiona
On 11/26/21 20:09, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
--- Colin Legendre
Thanks, will look into this. --- Colin Legendre President and CTO Coextro - Unlimited. Fast. Reliable. w: www.coextro.com e: clegendre@coextro.com p: 647-693-7686 ext.101 m: 416-560-8502 f: 647-812-4132 On Sat, Nov 27, 2021 at 7:42 AM Tassos <achatz@forthnet.gr> wrote:
In the past we had packet loss issues due SIP's PLIM buffer.
The following docs may provide some guidance:
https://www.cisco.com/c/en/us/support/docs/routers/asr-1000-series-aggregati...
https://www.cisco.com/c/en/us/td/docs/interfaces_modules/shared_port_adapter...
-- Tassos
On 27/11/21 02:11, Fiona Weber via NANOG wrote:
Hi
we see similar problems on ASR1006-X with ESP100 and MIP100. At about ~45 Gbit/s of traffic (on ~30k PPPoE Sessions and ~700k CGN sessions) the QFP utilization skyrockets from ~45 % straight to ~95 % :( I don't know if it's the CGN sessions or the traffic/packets causing the load increase, the datasheet says it supports something like 10M sessions.... but maybe not if you really intend to push packets through it? We have not seen such spikes with way higher pps, but lower CGN session count, when we had DDoS Attacks against end customers.
Fiona On 11/26/21 20:09, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
--- Colin Legendre
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
I haven't experienced that across about a dozen ASR 1ks. Though I just checked and we are not pushing any of our ESP's over 50% currently (the closest we have is an ESP 40 doing 18Gbps). However, I'm pretty sure we've pushed older ESPs (5, 10's, and 20's) to ~75% or so in the past. Given the components you have, I would have expected your router to handle 40Gbps input and 40Gbps output. That could either be 40Gbps into the 6 port card [and 40Gbps out of the four 1 port cards] or it could be 40Gbps input that is spread across the 6 port and 1 port cards [that is then output across both cards as well]. Despite other comments, I think your components are well matched. The only non-obvious thing here is that the 6 port card only has a ~40Gbps connection to the backplane so you cannot use all 6 ports at full bandwidth. I think this router is well suited to handle 20-30Gbps of customer demand doing standard destination based routing (if you're doing traffic shaping, NAT, tunnelling, or something else more involved than extended ACLs you may need something beefier at those traffic levels).
That's what I thought. Our total inbound bandwidth from upstreams is about 20G at max.. so that really is the total bandwidth... Now we are terminating about 1800 PPPoE sessions on the router as well, and have policing set on them, as well as shaping on a couple of our major downstream links. Is anyone interested in making a few $ and taking a look for us, to see if we are really hitting capacity, or if some sort of tuning could be done to help us eak out a little bit more from this device before upgrading. --- Colin Legendre On Tue, Dec 7, 2021 at 10:34 AM Blake Hudson <blake@ispn.net> wrote:
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
I haven't experienced that across about a dozen ASR 1ks. Though I just checked and we are not pushing any of our ESP's over 50% currently (the closest we have is an ESP 40 doing 18Gbps). However, I'm pretty sure we've pushed older ESPs (5, 10's, and 20's) to ~75% or so in the past.
Given the components you have, I would have expected your router to handle 40Gbps input and 40Gbps output. That could either be 40Gbps into the 6 port card [and 40Gbps out of the four 1 port cards] or it could be 40Gbps input that is spread across the 6 port and 1 port cards [that is then output across both cards as well].
Despite other comments, I think your components are well matched. The only non-obvious thing here is that the 6 port card only has a ~40Gbps connection to the backplane so you cannot use all 6 ports at full bandwidth. I think this router is well suited to handle 20-30Gbps of customer demand doing standard destination based routing (if you're doing traffic shaping, NAT, tunnelling, or something else more involved than extended ACLs you may need something beefier at those traffic levels).
On 07/12/2021 17:32, Blake Hudson wrote: Suggestion: move this thread to cisco-nsp where you might find more assistance. Regards, Hank
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
I haven't experienced that across about a dozen ASR 1ks. Though I just checked and we are not pushing any of our ESP's over 50% currently (the closest we have is an ESP 40 doing 18Gbps). However, I'm pretty sure we've pushed older ESPs (5, 10's, and 20's) to ~75% or so in the past.
Given the components you have, I would have expected your router to handle 40Gbps input and 40Gbps output. That could either be 40Gbps into the 6 port card [and 40Gbps out of the four 1 port cards] or it could be 40Gbps input that is spread across the 6 port and 1 port cards [that is then output across both cards as well].
Despite other comments, I think your components are well matched. The only non-obvious thing here is that the 6 port card only has a ~40Gbps connection to the backplane so you cannot use all 6 ports at full bandwidth. I think this router is well suited to handle 20-30Gbps of customer demand doing standard destination based routing (if you're doing traffic shaping, NAT, tunnelling, or something else more involved than extended ACLs you may need something beefier at those traffic levels).
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
He had a similar issue about 4 years ago. We were showing packet loss and drops getting progressively worse and the router was falling over when reaching about 70% of usage. We could see the interface reliability go down and input errors due to overruns on the interfaces. Cisco blamed it on microburtst not being able to be handled under load. "We were able to replicate this scenario in our lab as well. QFP under high load generated input errors and overruns which in turn led to unicast failures/ drops/ latency. The issue is not consistent with QFP % utilization as sometimes with even 80%+ traffic, we do not see the drops:" And recommended removing traffic or upgrading esp. One of our guys disabled nbar on the router and the problem disappeared. I would suggest taking a look at what features you are using and if you can try and disable them to see if it makes any impact. We then upgraded esps and all has been fine since. Brian
Thanks for this.. turned off netflow export.. and it dropped our qfp load from 44% to 18%. ugh.. --- Colin Legendre On Thu, Dec 9, 2021 at 4:22 AM Brian Turnbow via NANOG <nanog@nanog.org> wrote:
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
He had a similar issue about 4 years ago. We were showing packet loss and drops getting progressively worse and the router was falling over when reaching about 70% of usage. We could see the interface reliability go down and input errors due to overruns on the interfaces. Cisco blamed it on microburtst not being able to be handled under load.
"We were able to replicate this scenario in our lab as well. QFP under high load generated input errors and overruns which in turn led to unicast failures/ drops/ latency. The issue is not consistent with QFP % utilization as sometimes with even 80%+ traffic, we do not see the drops:"
And recommended removing traffic or upgrading esp.
One of our guys disabled nbar on the router and the problem disappeared. I would suggest taking a look at what features you are using and if you can try and disable them to see if it makes any impact. We then upgraded esps and all has been fine since.
Brian
NBAR was not enabled.. just netflow export.. and that was enough.. --- Colin Legendre President and CTO Coextro - Unlimited. Fast. Reliable. w: www.coextro.com e: clegendre@coextro.com p: 647-693-7686 ext.101 m: 416-560-8502 f: 647-812-4132 On Thu, Dec 9, 2021 at 7:17 PM Colin Legendre <clegendre@coextro.com> wrote:
Thanks for this.. turned off netflow export.. and it dropped our qfp load from 44% to 18%. ugh..
--- Colin Legendre
On Thu, Dec 9, 2021 at 4:22 AM Brian Turnbow via NANOG <nanog@nanog.org> wrote:
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
He had a similar issue about 4 years ago. We were showing packet loss and drops getting progressively worse and the router was falling over when reaching about 70% of usage. We could see the interface reliability go down and input errors due to overruns on the interfaces. Cisco blamed it on microburtst not being able to be handled under load.
"We were able to replicate this scenario in our lab as well. QFP under high load generated input errors and overruns which in turn led to unicast failures/ drops/ latency. The issue is not consistent with QFP % utilization as sometimes with even 80%+ traffic, we do not see the drops:"
And recommended removing traffic or upgrading esp.
One of our guys disabled nbar on the router and the problem disappeared. I would suggest taking a look at what features you are using and if you can try and disable them to see if it makes any impact. We then upgraded esps and all has been fine since.
Brian
If you still need netflow to gain some visibility on what’s happening, you could check the percentage of netflow export. Usually 1/1000 is good or 0.1%. Maybe for you 1/1 000 000 could be good enough too. If 100% was used, then indeed there are some real time performance penalties. Not much people need an accurate 100% of netflow exports. If you need 100% accuracy, then you need dedicated hardware. 0% or totally disabled is also often very good enough if you don’t need visibility. 😊 Netflow is useful in my opinion, but maybe not for every case. Jean From: NANOG <nanog-bounces+jean=ddostest.me@nanog.org> On Behalf Of Colin Legendre Sent: December 9, 2021 7:18 PM To: Brian Turnbow <b.turnbow@twt.it> Cc: nanog <nanog@nanog.org> Subject: Re: Latency/Packet Loss on ASR1006 NBAR was not enabled.. just netflow export.. and that was enough.. --- Colin Legendre President and CTO Coextro - Unlimited. Fast. Reliable. w: www.coextro.com <http://www.coextro.com> e: clegendre@coextro.com <mailto:clegendre@coextro.com> p: 647-693-7686 ext.101 m: 416-560-8502 f: 647-812-4132 On Thu, Dec 9, 2021 at 7:17 PM Colin Legendre <clegendre@coextro.com <mailto:clegendre@coextro.com> > wrote: Thanks for this.. turned off netflow export.. and it dropped our qfp load from 44% to 18%. ugh.. --- Colin Legendre On Thu, Dec 9, 2021 at 4:22 AM Brian Turnbow via NANOG <nanog@nanog.org <mailto:nanog@nanog.org> > wrote:
On 11/26/2021 1:09 PM, Colin Legendre wrote:
Hi,
We have ...
ASR1006 that has following cards... 1 x ESP40 1 x SIP40 4 x SPA-1x10GE-L-V2 1 x 6TGE 1 x RP2
We've been having latency and packet loss during peak periods...
We notice all is good until we reach 50% utilization on output of...
'show platform hardware qfp active datapath utilization summary'
Literally ... 47% good... 48% good... 49% latency to next hop goes from 1ms to 15-20ms... 50% we see 1-2% packet-loss and 30-40ms latency... 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected... the ESP40 can only really push 20G and then starts to have performance issues?
He had a similar issue about 4 years ago. We were showing packet loss and drops getting progressively worse and the router was falling over when reaching about 70% of usage. We could see the interface reliability go down and input errors due to overruns on the interfaces. Cisco blamed it on microburtst not being able to be handled under load. "We were able to replicate this scenario in our lab as well. QFP under high load generated input errors and overruns which in turn led to unicast failures/ drops/ latency. The issue is not consistent with QFP % utilization as sometimes with even 80%+ traffic, we do not see the drops:" And recommended removing traffic or upgrading esp. One of our guys disabled nbar on the router and the problem disappeared. I would suggest taking a look at what features you are using and if you can try and disable them to see if it makes any impact. We then upgraded esps and all has been fine since. Brian
participants (9)
-
Blake Hudson
-
Brian Turnbow
-
Colin Legendre
-
Fiona Weber
-
Hank Nussbacher
-
Jean St-Laurent
-
Saku Ytti
-
Tassos
-
Tony Wicks