Linux Router: TCP slow, UDP fast
Hi All, I'm losing the will to live with this networking headache ! Please feel free to point me at a Linux list if NANOG isn't suitable. I'm at a loss where else to ask. I've diagnosed some traffic oddities and after lots of head-scratching, reading and trial and error I can say with certainty that: With and without shaping and over different bandwidth providers using the e1000 driver for an Intel PRO/1000 MT Dual Port Gbps NIC (82546EB) I can replicate full, expected throughput with UDP but consistently only get 300kbps - 600kbps throughput _per connection_ for outbound TCP (I couldn't find a tool I trusted to replicate ICMP traffic). Multiple connections are cumulative and increase incrementally at roughly 300kbps - 600kbps. Inbound seems slightly erratic in holding a consistent speed but manages 15Mbps as expected, a far cry from 300kbps to 600kbps. The router is Quad Core sitting at no load and there's very little traffic being forwarded back and forth. The NIC's kernel parameters are set at default as 'built-in'. NAPI is not enabled though (enabling it requires a reboot which is a problem as this box is in production). The only other change to the box is that over Christmas IPtables (ip_conntrack and its associated modules mainly) was loaded into the kernel as 'built-in'. There's no sign of packet loss on any tests and I upped the conntrack max_connections size suitably for the amount of RAM. Has anyone come across IPtables without any rules loaded causing throughput issues ? I've also changed the following kernel parameters with no luck: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_no_metrics_save = 1 net.core.netdev_max_backlog = 2500 echo 0 > /proc/sys/net/ipv4/tcp_window_scaling It feels to me like a buffer limit is being reached 'per connection'. The throughput spikes at around 1.54Mbps and TCP backs off to about 300kbps - 600kbps or so. What am I missing ? Is NAPI that essential for such low traffic ? A very similar build moved far higher throughput on cheap NICs. MTU is at 1500, txqueuelen is 1000. Any help would be massively appreciated ! Chris
Try enabling window scaling echo 1 > /proc/sys/net/ipv4/tcp_window_scaling or, if you really want it disabled, configure a larger minimum window size net.ipv4.tcp_rmem = 64240 87380 16777216 HTH, Lee On 2/14/09, Chris <chris@ghostbusters.co.uk> wrote:
Hi All,
I'm losing the will to live with this networking headache ! Please feel free to point me at a Linux list if NANOG isn't suitable. I'm at a loss where else to ask.
I've diagnosed some traffic oddities and after lots of head-scratching, reading and trial and error I can say with certainty that:
With and without shaping and over different bandwidth providers using the e1000 driver for an Intel PRO/1000 MT Dual Port Gbps NIC (82546EB) I can replicate full, expected throughput with UDP but consistently only get 300kbps - 600kbps throughput _per connection_ for outbound TCP (I couldn't find a tool I trusted to replicate ICMP traffic). Multiple connections are cumulative and increase incrementally at roughly 300kbps - 600kbps. Inbound seems slightly erratic in holding a consistent speed but manages 15Mbps as expected, a far cry from 300kbps to 600kbps.
The router is Quad Core sitting at no load and there's very little traffic being forwarded back and forth. The NIC's kernel parameters are set at default as 'built-in'. NAPI is not enabled though (enabling it requires a reboot which is a problem as this box is in production).
The only other change to the box is that over Christmas IPtables (ip_conntrack and its associated modules mainly) was loaded into the kernel as 'built-in'. There's no sign of packet loss on any tests and I upped the conntrack max_connections size suitably for the amount of RAM. Has anyone come across IPtables without any rules loaded causing throughput issues ?
I've also changed the following kernel parameters with no luck:
net.core.rmem_max = 16777216 net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 2500
echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
It feels to me like a buffer limit is being reached 'per connection'. The throughput spikes at around 1.54Mbps and TCP backs off to about 300kbps - 600kbps or so. What am I missing ? Is NAPI that essential for such low traffic ? A very similar build moved far higher throughput on cheap NICs. MTU is at 1500, txqueuelen is 1000.
Any help would be massively appreciated !
Chris
On Sat, 14 Feb 2009, Lee wrote:
Try enabling window scaling echo 1 > /proc/sys/net/ipv4/tcp_window_scaling or, if you really want it disabled, configure a larger minimum window size net.ipv4.tcp_rmem = 64240 87380 16777216
Without window scaling, you're limited to 64k window size anyway. Chris, what is the round trip delay between the machines involved in your TCP session? -- Mikael Abrahamsson email: swmike@swm.pp.se
I'm losing the will to live with this networking headache ! Please feel free to point me at a Linux list if NANOG isn't suitable. I'm at a loss where else to ask.
The linux-net might be more appropriate indeed.
With and without shaping and over different bandwidth providers using the e1000 driver for an Intel PRO/1000 MT Dual Port Gbps NIC (82546EB) I can replicate full, expected throughput with UDP but consistently only get 300kbps - 600kbps throughput _per connection_ for outbound TCP
I've seen this behavior as the result of duplex mismatches. (The tcp settings are end system matters and do not affect how the router forwards traffic.)
Thanks loads for the quick replies. I'll try and respond individually. Lee > I recently disabled tcp_window_scaling and it didn't solve the problem. I don't know enough about it. Should I enable it again ? Settings differing from defaults are copied in my first post. Mike > Strangely I'm not seeing any errors on either the ingress or egress NICs: RX packets:3371200609 errors:0 dropped:0 overruns:0 frame:0 TX packets:3412500706 errors:0 dropped:0 overruns:0 carrier:0 The only errors I see anywhere are similar on both NICs. Both connect to the same model of switch with the same default config: rx_long_byte_count: 1396158525465 rx_csum_offload_good: 3341342496 rx_csum_offload_errors: 89459 and it may be worth noting that flow control is on. Are these a reasonable level of pause frames to be seeing ? They seem to be higher on non-routing boxes. Total bytes (TX)2466202288Unicast packets (TX)3436389971Multicast packets (TX)213310Broadcast packets (TX)4952902Single Collision Frames (TX)0Late Collisions (TX)0Excessive Collisions (TX)0Transmitted Pause Frames (TX)27806 Florian > They're running without obvious errors. Auto Neg has taken 1Gbps, Full. Can Auto Neg cause these symptoms do you think ? Thanks again, Chris
Hello, Chris, So, as it seems you have problem with TCP, and not UDP, maybe this is something with regard to TCP segmentation offloading. It could be a total shot in the dark, but can you see what ethtool -k <devname> says? Then you can have a look at 'man ethtool' and turn on/off the appropriate stuff. On Sat, 14 Feb 2009 13:48:53 +0000 Chris <chris@ghostbusters.co.uk> wrote:
Thanks loads for the quick replies. I'll try and respond individually. Lee > I recently disabled tcp_window_scaling and it didn't solve the problem. I don't know enough about it. Should I enable it again ? Settings differing from defaults are copied in my first post.
Mike > Strangely I'm not seeing any errors on either the ingress or egress NICs:
RX packets:3371200609 errors:0 dropped:0 overruns:0 frame:0 TX packets:3412500706 errors:0 dropped:0 overruns:0 carrier:0
The only errors I see anywhere are similar on both NICs. Both connect to the same model of switch with the same default config:
rx_long_byte_count: 1396158525465 rx_csum_offload_good: 3341342496 rx_csum_offload_errors: 89459
and it may be worth noting that flow control is on. Are these a reasonable level of pause frames to be seeing ? They seem to be higher on non-routing boxes.
Total bytes (TX)2466202288Unicast packets (TX)3436389971Multicast packets (TX)213310Broadcast packets (TX)4952902Single Collision Frames (TX)0Late Collisions (TX)0Excessive Collisions (TX)0Transmitted Pause Frames (TX)27806
Florian > They're running without obvious errors. Auto Neg has taken 1Gbps, Full. Can Auto Neg cause these symptoms do you think ?
Thanks again,
Chris
-- Best regards, Nickola
Thanks, Nickola. What's your opinion on these settings ? Do you recommend switching off "tcp segmentation offload" ? Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on Thanks again, Chris
On 2/14/09, Chris <chris@ghostbusters.co.uk> wrote:
Thanks loads for the quick replies. I'll try and respond individually. Lee > I recently disabled tcp_window_scaling and it didn't solve the problem. I don't know enough about it. Should I enable it again ? Settings differing from defaults are copied in my first post.
I don't know if the tcp window size makes any difference when the box is acting as a router. But when UDP works as expected & each additional TCP connection gets 300-600kbps the first thing I'd look at is the window size. If it was a duplex mismatch additional TCP connections would make things worse instead of each getting 3-600Kb bandwidth.
The only other change to the box is that over Christmas IPtables (ip_conntrack and its associated modules mainly) was loaded into the kernel
If all else fails, backing out recent changes usually works :) Regards, Lee
Thanks very much, Lee. My head's whirring. Am I right in thinking by turning on scaling (which I just did) then the window size is automatically set ? I'll do some more reading. I'm looking at TSO too as above, mentioned by Nickola. I'll maybe risk changing it with ethtool during a quiet network moment. I've just discovered the netstat -s command which gives loads more info than anything else I've come across. Any pointers about window size or TSO from the output appreciated :-) Thanks again, Chris
On 2/14/09, Chris <chris@ghostbusters.co.uk> wrote:
Thanks very much, Lee. My head's whirring. Am I right in thinking by turning on scaling (which I just did) then the window size is automatically set ?
No. Scaling just allows you to have a window size larger than 64KB. These might help http://www-didc.lbl.gov/TCP-tuning/troubleshooting.html http://www-didc.lbl.gov/TCP-tuning/linux.html Regards, Lee
Hi Mikael, I just realised that I didn't respond to your post. The RTTs vary massively because the router is forwarding from websites on the LAN to visitors worldwide. Is that what you meant ? Disabling TSO didn't work unfortunately. Thanks again, Chris
On Sat, 14 Feb 2009, Chris wrote:
The RTTs vary massively because the router is forwarding from websites on the LAN to visitors worldwide. Is that what you meant ?
And your TCP speed when doing testing is always 300-600 kilobyte/s regardless of RTT between the boxes with which you're testing? Without TCP window scaling turned on on the boxes doing TCP with each others, you're always limited to 1/RTT*64k bytes/s of transfer speed. Changing window scaling on the linux router will of course not change the behaviour of the traffic going thru it, only TCP sessions that itself does. -- Mikael Abrahamsson email: swmike@swm.pp.se
I'm looking at TSO too as above, mentioned by Nickola. I'll maybe risk changing it with ethtool during a quiet network moment.
Turning off offloading might be something to try indeed. Regarding the negotation issue, can you look at the other end of the link and check what it's saying? Looking at "netstat -s" statistics at the endpoint (not the router) could be illuminating, too. I haven't got any expertise in this area, but TCP problems can often be diagnosed by looking at tcpdump/packet captures and analyzing them using tcptrace (and the special xplot variant which can plot tcptrace output).
participants (5)
-
Chris
-
Florian Weimer
-
Lee
-
Mikael Abrahamsson
-
Nickola Kolev