TCP Window Scaling issue
Hello, I know this isn't precisely on topic but I'm having an issue that I could use some assistance with. I'm currently seeing a very interesting issue for a single server. File transfers from Server A to Server B are relatively slow and not using up much of the circuit. Upon further inspection the TCP window size remains at default 65535 and window scaling doesn't negotiate. What's interesting is this is only affecting a single server and only when traffic is going over the WAN circuit. Testing from Server A to any server on it's network shows it is negotiating window scaling just fine. Below I'll try and draw out a better idea of what is happening. Let the letters represent the server in question and let the .# represent which subnet they are on to show whether transversal of the WAN circuit is occurring. Server A.1 -> Server B.2 = No window scaling Server A.1 -> Server C.1 = Window scaling Server B.2 -> Server A.1 = Window scaling Server C.1 -> Server B.2 = Window scaling The net result here is when window scaling is properly being used I'm seeing about 30-40 Mbps of bandwidth usage, without scaling I'm only seeing 2.8Mbps. Any thoughts?
Zach Hill <zach.reborn@gmail.com> wrote:
What's interesting is this is only affecting a single server and only when traffic is going over the WAN circuit. Testing from Server A to any server on it's network shows it is negotiating window scaling just fine.
Check your firewall isn't buggering about with TCP options. Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ South German Bight, East Humber: Northeasterly 4 or 5. Slight, occasionally moderate. Mainly fair. Moderate or good.
Hi Tony. No firewall in the way. Physical flow is as below. Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco 3750x stack -> Server B On Thu, Jul 24, 2014 at 12:25 PM, Tony Finch <dot@dotat.at> wrote:
Zach Hill <zach.reborn@gmail.com> wrote:
What's interesting is this is only affecting a single server and only when traffic is going over the WAN circuit. Testing from Server A to any server on it's network shows it is negotiating window scaling just fine.
Check your firewall isn't buggering about with TCP options.
Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ South German Bight, East Humber: Northeasterly 4 or 5. Slight, occasionally moderate. Mainly fair. Moderate or good.
On 14-07-24 12:30 PM, Zach Hill wrote:
Hi Tony. No firewall in the way.
Physical flow is as below.
Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco 3750x stack -> Server B
I blame the cloud. Dump the actual packets as they leave Server A and arrive at Server B (and vice-versa!). Does it get modified en route? M. -- Michael Brown | The true sysadmin does not adjust his behaviour Systems Administrator | to fit the machine. He adjusts the machine michael@supermathie.net | until it behaves properly. With a hammer, | if necessary. - Brian
Hi Machael, Let me setup another packet capture at each side to see if the initial packets are being modified at all. Thanks, On Thu, Jul 24, 2014 at 12:39 PM, Michael Brown <michael@supermathie.net> wrote:
On 14-07-24 12:30 PM, Zach Hill wrote:
Hi Tony. No firewall in the way.
Physical flow is as below.
Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco 3750x stack -> Server B
I blame the cloud.
Dump the actual packets as they leave Server A and arrive at Server B (and vice-versa!). Does it get modified en route?
M.
-- Michael Brown | The true sysadmin does not adjust his behaviour Systems Administrator | to fit the machine. He adjusts the machine michael@supermathie.net | until it behaves properly. With a hammer, | if necessary. - Brian
Also just to reiterate I would lean more heavily on something fishing in the WAN cloud if all traffic from Site 1 to Site 2 were not seeing tcp window scaling properly, however it's only for Server A that is seeing this. Server A is able to properly TCP window scale for any local traffic. On Thu, Jul 24, 2014 at 12:47 PM, Zach Hill <zach.reborn@gmail.com> wrote:
Hi Machael,
Let me setup another packet capture at each side to see if the initial packets are being modified at all.
Thanks,
On Thu, Jul 24, 2014 at 12:39 PM, Michael Brown <michael@supermathie.net> wrote:
On 14-07-24 12:30 PM, Zach Hill wrote:
Hi Tony. No firewall in the way.
Physical flow is as below.
Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco 3750x stack -> Server B
I blame the cloud.
Dump the actual packets as they leave Server A and arrive at Server B (and vice-versa!). Does it get modified en route?
M.
-- Michael Brown | The true sysadmin does not adjust his behaviour Systems Administrator | to fit the machine. He adjusts the machine michael@supermathie.net | until it behaves properly. With a hammer, | if necessary. - Brian
On Thu, Jul 24, 2014 at 9:51 AM, Zach Hill <zach.reborn@gmail.com> wrote:
Also just to reiterate I would lean more heavily on something fishy in the WAN cloud if all traffic from Site 1 to Site 2 were not seeing tcp window scaling properly, however it's only for Server A that is seeing this. Server A is able to properly TCP window scale for any local traffic.
Remember, the WAN cloud is just that, a cloud; it's not likely to be a single link underneath it all; so one bad link/bad port/bad device in the cloud can affect just a sub-portion of the traffic, depending on the 5-tuple hashing that takes place. An interesting test would be to be give server A a different address (secondary address should be fine, all you need to do is source packets from a different source address) and see if your scaling suddenly reappears. If it does, it's definitely down to the 5-tuple hashing happening within The Cloud(tm). Matt
On Thu, Jul 24, 2014 at 12:47 PM, Zach Hill <zach.reborn@gmail.com> wrote:
Hi Machael,
Let me setup another packet capture at each side to see if the initial packets are being modified at all.
Thanks,
On Thu, Jul 24, 2014 at 12:39 PM, Michael Brown <michael@supermathie.net
wrote:
On 14-07-24 12:30 PM, Zach Hill wrote:
Hi Tony. No firewall in the way.
Physical flow is as below.
Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco 3750x stack -> Server B
I blame the cloud.
Dump the actual packets as they leave Server A and arrive at Server B (and vice-versa!). Does it get modified en route?
M.
-- Michael Brown | The true sysadmin does not adjust his behaviour Systems Administrator | to fit the machine. He adjusts the machine michael@supermathie.net | until it behaves properly. With a hammer, | if necessary. - Brian
*First round of packet captures* Here are the snippets from a packet capture. First is the SYN from Server A to Server B http://i.imgur.com/E5cu4ev.png Here is the SYN from Server B backhttp://i.imgur.com/RRSAl8G.png Second test from Server C to Server B: First is the SYN from Server C to Server B http://i.imgur.com/Jc2K6bT.pngand the SYN from Server B to Server C http://i.imgur.com/pbvx9jJ.png I guess I'm at a loss as to why in scenario 1 neither are sending window scaling at all. Is it because Server A isn't attempting or initializing? I'm in the process of setting up a VM that I can SPAN for a capture from the source of Server A. This will allow me to compare packets at each side. *Second round of packet captures* Now I just don't even know what is going on... Is this quantum physics now? Did the state just change by me looking at it? Here are some new screencaps. The only change that's been made was a SPAN port enabled on the Nexus7k sourced at Server A and destination for my new tcpdump capture server. Site 1 captures: 1 http://i.imgur.com/K5r7FaG.png 2 http://i.imgur.com/wfnfLyi.png Site 2 capture: 1 http://i.imgur.com/vpY2lnh.png 2 http://i.imgur.com/UyL3V6L.png Now they are both communicating a window size. Speed is still slow at 400-450KBps On Thu, Jul 24, 2014 at 1:23 PM, Matthew Petach <mpetach@netflight.com> wrote:
On Thu, Jul 24, 2014 at 9:51 AM, Zach Hill <zach.reborn@gmail.com> wrote:
Also just to reiterate I would lean more heavily on something fishy in
the WAN cloud if all traffic from Site 1 to Site 2 were not seeing tcp window scaling properly, however it's only for Server A that is seeing this. Server A is able to properly TCP window scale for any local traffic.
Remember, the WAN cloud is just that, a cloud; it's not likely to be a single link underneath it all; so one bad link/bad port/bad device in the cloud can affect just a sub-portion of the traffic, depending on the 5-tuple hashing that takes place.
An interesting test would be to be give server A a different address (secondary address should be fine, all you need to do is source packets from a different source address) and see if your scaling suddenly reappears. If it does, it's definitely down to the 5-tuple hashing happening within The Cloud(tm).
Matt
On Thu, Jul 24, 2014 at 12:47 PM, Zach Hill <zach.reborn@gmail.com> wrote:
Hi Machael,
Let me setup another packet capture at each side to see if the initial packets are being modified at all.
Thanks,
On Thu, Jul 24, 2014 at 12:39 PM, Michael Brown < michael@supermathie.net> wrote:
On 14-07-24 12:30 PM, Zach Hill wrote:
Hi Tony. No firewall in the way.
Physical flow is as below.
Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco 3750x stack -> Server B
I blame the cloud.
Dump the actual packets as they leave Server A and arrive at Server B (and vice-versa!). Does it get modified en route?
M.
-- Michael Brown | The true sysadmin does not adjust his behaviour Systems Administrator | to fit the machine. He adjusts the machine michael@supermathie.net | until it behaves properly. With a hammer, | if necessary. - Brian
On Thu, 24 Jul 2014 14:33:56 -0400, Zach Hill said:
First is the SYN from Server A to Server B http://i.imgur.com/E5cu4ev.png
Was this captured with tcpdump on Server A on its way out, or on Server B on its way in, or at some other point using a span port? The answer matters if we're suspecting that something along the way is stomping the option....
All are from SPAN ports at each end. So for the second round of packet captures Site 1 is from a SPAN port off the NIC of Server A. Site 2 is from a SPAN port off the NIC of the MPLS router. The first round of packet captures are only from the SPAN port off the MPLS router at Site 2. On Thu, Jul 24, 2014 at 3:08 PM, <Valdis.Kletnieks@vt.edu> wrote:
On Thu, 24 Jul 2014 14:33:56 -0400, Zach Hill said:
First is the SYN from Server A to Server B http://i.imgur.com/E5cu4ev.png
Was this captured with tcpdump on Server A on its way out, or on Server B on its way in, or at some other point using a span port? The answer matters if we're suspecting that something along the way is stomping the option....
On Thu, Jul 24, 2014 at 12:13 PM, Zach Hill <zach.reborn@gmail.com> wrote:
All are from SPAN ports at each end. So for the second round of packet captures Site 1 is from a SPAN port off the NIC of Server A. Site 2 is from a SPAN port off the NIC of the MPLS router.
The first round of packet captures are only from the SPAN port off the MPLS router at Site 2.
I have to dash out of a few hours; but the short answer is the first round of packet captures are too far from the host to matter. second set are doing better, but still would be best to compare with tcpdumps from the device A itself, to see what it thinks it's sending out, vs what is seen upstream of it. Can you grab tcpdumps from server A itself? Thanks! Matt
On Thu, Jul 24, 2014 at 3:08 PM, <Valdis.Kletnieks@vt.edu> wrote:
On Thu, 24 Jul 2014 14:33:56 -0400, Zach Hill said:
First is the SYN from Server A to Server B http://i.imgur.com/E5cu4ev.png
Was this captured with tcpdump on Server A on its way out, or on Server B on its way in, or at some other point using a span port? The answer matters if we're suspecting that something along the way is stomping the option....
I don't have root access to that server but I should be able to get it then get some tcpdumps. On Thu, Jul 24, 2014 at 3:18 PM, Matthew Petach <mpetach@netflight.com> wrote:
On Thu, Jul 24, 2014 at 12:13 PM, Zach Hill <zach.reborn@gmail.com> wrote:
All are from SPAN ports at each end. So for the second round of packet captures Site 1 is from a SPAN port off the NIC of Server A. Site 2 is from a SPAN port off the NIC of the MPLS router.
The first round of packet captures are only from the SPAN port off the MPLS router at Site 2.
I have to dash out of a few hours; but the short answer is the first round of packet captures are too far from the host to matter.
second set are doing better, but still would be best to compare with tcpdumps from the device A itself, to see what it thinks it's sending out, vs what is seen upstream of it. Can you grab tcpdumps from server A itself?
Thanks!
Matt
On Thu, Jul 24, 2014 at 3:08 PM, <Valdis.Kletnieks@vt.edu> wrote:
On Thu, 24 Jul 2014 14:33:56 -0400, Zach Hill said:
First is the SYN from Server A to Server B http://i.imgur.com/E5cu4ev.png
Was this captured with tcpdump on Server A on its way out, or on Server B on its way in, or at some other point using a span port? The answer matters if we're suspecting that something along the way is stomping the option....
On 14-07-24 12:25 PM, Tony Finch wrote:
Zach Hill <zach.reborn@gmail.com> wrote:
What's interesting is this is only affecting a single server and only when traffic is going over the WAN circuit. Testing from Server A to any server on it's network shows it is negotiating window scaling just fine. Check your firewall isn't buggering about with TCP options.
Tony. This, exactly. I diagnosed this issue a while back with our Checkpoint firewall - it didn't understand TCP window scaling so it would blindly zero out the field and cause nightmares.
M. -- Michael Brown | The true sysadmin does not adjust his behaviour Systems Administrator | to fit the machine. He adjusts the machine michael@supermathie.net | until it behaves properly. With a hammer, | if necessary. - Brian
participants (5)
-
Matthew Petach
-
Michael Brown
-
Tony Finch
-
Valdis.Kletnieks@vt.edu
-
Zach Hill