10G switch drops traffic for a split second
I recently upgraded my core network from 1G to 10G and after the upgrade I have noticed that my 10G switch during peak traffic (1500mbps, 100,000pps) seems to be dropping traffic for a split second across all ports and all vlans. I immediately replaced the switch with a different brand/model and the problem persists. Sometimes traffic drops to zero, others it drops to 50%, problem is very random but seems to occur with much more frequency during high PPS (pushing high traffic / iperf does not induce problem) Could this be MTU? I've tried flow control, hard code duplex, stp on/off etc I'm at a loss any ideas? TJ Trout Volt Broadband
What model switch? What's the config look like, all L2 or L3 as well? Luke Guillory Network Operations Manager Tel: 985.536.1212 Fax: 985.536.0300 Email: lguillory@reservetele.com Reserve Telecommunications 100 RTC Dr Reserve, LA 70084 _________________________________________________________________________________________________ Disclaimer: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material which should not disseminate, distribute or be copied. Please notify Luke Guillory immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. Luke Guillory therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. . -----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of TJ Trout Sent: Tuesday, November 29, 2016 3:06 AM To: nanog@nanog.org Subject: 10G switch drops traffic for a split second I recently upgraded my core network from 1G to 10G and after the upgrade I have noticed that my 10G switch during peak traffic (1500mbps, 100,000pps) seems to be dropping traffic for a split second across all ports and all vlans. I immediately replaced the switch with a different brand/model and the problem persists. Sometimes traffic drops to zero, others it drops to 50%, problem is very random but seems to occur with much more frequency during high PPS (pushing high traffic / iperf does not induce problem) Could this be MTU? I've tried flow control, hard code duplex, stp on/off etc I'm at a loss any ideas? TJ Trout Volt Broadband
Without more detail, I'm grasping at straws here, but see this recent thread about QoS and microbursts on the juniper-nsp list: https://puck.nether.net/pipermail/juniper-nsp/2016-November/033692.html Do you have ports with different speeds connected? Another idea: Are you using Spanning Tree Protocol and seeing lots of TCNs? On Tue, Nov 29, 2016 at 01:06:00AM -0800, TJ Trout wrote:
I recently upgraded my core network from 1G to 10G and after the upgrade I have noticed that my 10G switch during peak traffic (1500mbps, 100,000pps) seems to be dropping traffic for a split second across all ports and all vlans. I immediately replaced the switch with a different brand/model and the problem persists.
Sometimes traffic drops to zero, others it drops to 50%, problem is very random but seems to occur with much more frequency during high PPS (pushing high traffic / iperf does not induce problem)
Could this be MTU? I've tried flow control, hard code duplex, stp on/off etc
I'm at a loss any ideas?
TJ Trout Volt Broadband
If you have congestion on outgoing interfaces you are most likely running out of packet buffer space on your switch. Especially campus class switches have small buffers, 4 MB or so and it can run out during high bursts and interface congestion. With some switches you could alleviate problem by rearranging congested interfaces to ports with seperate buffer pool, but you have to check with your switch vendor or documentation if your switches have shared or split buffer pools. Or just replace your switches with ones having deeper buffers. Tomi On 29.11.2016 11.06, TJ Trout wrote:
I recently upgraded my core network from 1G to 10G and after the upgrade I have noticed that my 10G switch during peak traffic (1500mbps, 100,000pps) seems to be dropping traffic for a split second across all ports and all vlans. I immediately replaced the switch with a different brand/model and the problem persists.
Sometimes traffic drops to zero, others it drops to 50%, problem is very random but seems to occur with much more frequency during high PPS (pushing high traffic / iperf does not induce problem)
Could this be MTU? I've tried flow control, hard code duplex, stp on/off etc
I'm at a loss any ideas?
TJ Trout Volt Broadband
On Tue, 29 Nov 2016, TJ Trout wrote:
Could this be MTU? I've tried flow control, hard code duplex, stp on/off etc
As others have pointed out, you probably have a switch with small buffers. If you also have flow control and you have something that triggers flow control to turn off packet forwarding, your small-buffer-switch might fill up all (shared) buffers on that port and now you're dropping traffic to all ports. So trying to find if you have something where flow control is enabled and is being triggered might be something worthwhile to do, and also perhaps just turn off flow control on all ports to make sure. -- Mikael Abrahamsson email: swmike@swm.pp.se
Luke; All l2, no l3. only 4 vlans. 2 peers trunked to a router which trunks back to 2 devices (microwave backhauls). Chuck; All ports are 10g except the 2 peers are 1g and trunk back to a 10g port for the router wan No TCN's Brian; I have tried a IBM G8124 and a Ubiquiti ES-16-XG both show same exact drops across all ports, makes me think it's a config issue. MTU, FC, something. Andrew; I have tried with FC disabled, but I will try that one more time. Mikael; Is it possible to over run the buffers of a 320gbps backplane switch with only 1.5gbps traffic? I think the switch is rated for 140m PPS and I'm only pushing 100k PPS On Tue, Nov 29, 2016 at 9:47 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
On Tue, 29 Nov 2016, TJ Trout wrote:
Could this be MTU? I've tried flow control, hard code duplex, stp on/off
etc
As others have pointed out, you probably have a switch with small buffers.
If you also have flow control and you have something that triggers flow control to turn off packet forwarding, your small-buffer-switch might fill up all (shared) buffers on that port and now you're dropping traffic to all ports.
So trying to find if you have something where flow control is enabled and is being triggered might be something worthwhile to do, and also perhaps just turn off flow control on all ports to make sure.
-- Mikael Abrahamsson email: swmike@swm.pp.se
Yes it is absolutely possible to overrun the buffers. Any kind of backpressure (FC) from hosts, or 10G->1G transitions can easily cause it. Even if in a 10s window you're not over 1G if the 10G sender attempts to back to back too many frames in a row (Like say sendfile() API type calls) BOOM, dropping frames in the switch. On Tue, Nov 29, 2016 at 1:28 PM, TJ Trout <tj@pcguys.us> wrote:
Luke;
All l2, no l3. only 4 vlans. 2 peers trunked to a router which trunks back to 2 devices (microwave backhauls).
Chuck;
All ports are 10g except the 2 peers are 1g and trunk back to a 10g port for the router wan
No TCN's
Brian;
I have tried a IBM G8124 and a Ubiquiti ES-16-XG both show same exact drops across all ports, makes me think it's a config issue. MTU, FC, something.
Andrew;
I have tried with FC disabled, but I will try that one more time.
Mikael;
Is it possible to over run the buffers of a 320gbps backplane switch with only 1.5gbps traffic? I think the switch is rated for 140m PPS and I'm only pushing 100k PPS
I plan on disabling FC on everything tonight, I've done that before but I want to be sure. Anything that can be done about the 2 x 1G peers trunking to the 10G router transition that can be fixed? should I be rate limiting the vlan for the peers at 1G so the 10G router isn't trying to send more than 1G? On Tue, Nov 29, 2016 at 1:47 PM, Michael Loftis <mloftis@wgops.com> wrote:
Yes it is absolutely possible to overrun the buffers. Any kind of backpressure (FC) from hosts, or 10G->1G transitions can easily cause it. Even if in a 10s window you're not over 1G if the 10G sender attempts to back to back too many frames in a row (Like say sendfile() API type calls) BOOM, dropping frames in the switch.
On Tue, Nov 29, 2016 at 1:28 PM, TJ Trout <tj@pcguys.us> wrote:
Luke;
All l2, no l3. only 4 vlans. 2 peers trunked to a router which trunks back to 2 devices (microwave backhauls).
Chuck;
All ports are 10g except the 2 peers are 1g and trunk back to a 10g port for the router wan
No TCN's
Brian;
I have tried a IBM G8124 and a Ubiquiti ES-16-XG both show same exact drops across all ports, makes me think it's a config issue. MTU, FC, something.
Andrew;
I have tried with FC disabled, but I will try that one more time.
Mikael;
Is it possible to over run the buffers of a 320gbps backplane switch with only 1.5gbps traffic? I think the switch is rated for 140m PPS and I'm only pushing 100k PPS
On Tue, 29 Nov 2016, TJ Trout wrote:
I plan on disabling FC on everything tonight, I've done that before but I want to be sure.
Anything that can be done about the 2 x 1G peers trunking to the 10G router transition that can be fixed? should I be rate limiting the vlan for the peers at 1G so the 10G router isn't trying to send more than 1G?
This thread reminded me of a blog post that struck me as useful 5 years ago, and again today. Measuring throughput, when dealing with buffers and troubleshooting errors and packet loss, must be done at a sub-one-second sampling rate. http://blog.serverfault.com/2011/06/27/per-second-measurements-dont-cut-it/ Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------
Yeah you also have to look for not so obvious things like MAC Pause frames sent/received...QoS counters, all sorts of VERY platform specific stuff. Right royal pain, especially since some do not expose these statistics at all. On Tue, Nov 29, 2016 at 3:10 PM, Peter Beckman <beckman@angryox.com> wrote:
On Tue, 29 Nov 2016, TJ Trout wrote:
I plan on disabling FC on everything tonight, I've done that before but I want to be sure.
Anything that can be done about the 2 x 1G peers trunking to the 10G router transition that can be fixed? should I be rate limiting the vlan for the peers at 1G so the 10G router isn't trying to send more than 1G?
This thread reminded me of a blog post that struck me as useful 5 years ago, and again today. Measuring throughput, when dealing with buffers and troubleshooting errors and packet loss, must be done at a sub-one-second sampling rate.
http://blog.serverfault.com/2011/06/27/per-second-measurements-dont-cut-it/
Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------
-- "Genius might be described as a supreme capacity for getting its possessors into trouble of all kinds." -- Samuel Butler
Here is the video from Facebook on Monitoring, managing and troubleshooting large scale networks they did last year on the subject as well. https://www.youtube.com/watch?v=BRY9xwg5nAU Luke Guillory Network Operations Manager Tel: 985.536.1212 Fax: 985.536.0300 Email: lguillory@reservetele.com Reserve Telecommunications 100 RTC Dr Reserve, LA 70084 _________________________________________________________________________________________________ Disclaimer: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material which should not disseminate, distribute or be copied. Please notify Luke Guillory immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. Luke Guillory therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. . -----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of Peter Beckman Sent: Tuesday, November 29, 2016 5:10 PM To: TJ Trout Cc: nanog Subject: Re: 10G switch drops traffic for a split second On Tue, 29 Nov 2016, TJ Trout wrote:
I plan on disabling FC on everything tonight, I've done that before but I want to be sure.
Anything that can be done about the 2 x 1G peers trunking to the 10G router transition that can be fixed? should I be rate limiting the vlan for the peers at 1G so the 10G router isn't trying to send more than 1G?
This thread reminded me of a blog post that struck me as useful 5 years ago, and again today. Measuring throughput, when dealing with buffers and troubleshooting errors and packet loss, must be done at a sub-one-second sampling rate. http://blog.serverfault.com/2011/06/27/per-second-measurements-dont-cut-it/ Beckman --------------------------------------------------------------------------- Peter Beckman Internet Guy beckman@angryox.com http://www.angryox.com/ ---------------------------------------------------------------------------
On Tue, 29 Nov 2016, TJ Trout wrote:
Is it possible to over run the buffers of a 320gbps backplane switch with only 1.5gbps traffic? I think the switch is rated for 140m PPS and I'm only pushing 100k PPS
If your switch is the typical small-buffered-switch that has become more and more common the past few years, then the entire switch might have buffer to keep packets for 0.1ms or less. So if someone says "flow control off" for 0.1ms, depending on the implementation, you might then start seeing packet drops on all ports until that device turns flow control back on. -- Mikael Abrahamsson email: swmike@swm.pp.se
On 11/30/16, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
On Tue, 29 Nov 2016, TJ Trout wrote:
Is it possible to over run the buffers of a 320gbps backplane switch with only 1.5gbps traffic? I think the switch is rated for 140m PPS and I'm only pushing 100k PPS
If your switch is the typical small-buffered-switch that has become more and more common the past few years, then the entire switch might have buffer to keep packets for 0.1ms or less. So if someone says "flow control off" for 0.1ms, depending on the implementation, you might then start seeing packet drops on all ports until that device turns flow control back on.
I always disabled flow control on the theory that VoIP & flow control are incompatible. just out of curiosity - anyone have it enabled? if so, why? Lee
On Wed, Nov 30, 2016 at 11:58:06AM -0500, Lee wrote:
On 11/30/16, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
If your switch is the typical small-buffered-switch that has become more and more common the past few years, then the entire switch might have buffer to keep packets for 0.1ms or less. So if someone says "flow control off" for 0.1ms, depending on the implementation, you might then start seeing packet drops on all ports until that device turns flow control back on.
I always disabled flow control on the theory that VoIP & flow control are incompatible. just out of curiosity - anyone have it enabled? if so, why?
Generally speaking, allowing any ethernet switch to *send* PAUSE frames is very bad idea, causing external head-of-line blocking and congestion spreading. OTOH, a decent use-case of flow control is for subrate services. For example, 622 Mbps microwave link with gigabit ethernet interfaces ultimately needs to use flow control to properly inform the connected equipment that this is only "622M ethernet" link and not a gigabit one. M.
participants (9)
-
Chuck Anderson
-
Lee
-
Luke Guillory
-
Marian Ďurkovič
-
Michael Loftis
-
Mikael Abrahamsson
-
Peter Beckman
-
TJ Trout
-
Tomi Hakala