Hello I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it? Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network. I believe it likely that it is the destination that has the problem. Regards, Baldur
On Nov 11, 2019, at 05:01 , Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Are you saying that none of your routers support ECN or that you think ECN only applies to endpoints?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
I’d say start with asking the reporter to provide a PCAP of the problem and review the packet trace to provide clues of tap points in your network to investigate where ECN is (or should be) occurring and the opposite is occurring. Owen
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur I believe I may be that customer :) First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details: https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html The short version is that the problem appears to come from a combination of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash. For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example. -Toke The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms $ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)): $ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms $ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK: 12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60) 10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52) 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40) 10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */* 12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0 The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.
On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <nanog@nanog.org> wrote:
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur
I believe I may be that customer :)
First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details: https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
The short version is that the problem appears to come from a combination of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash.
For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example.
-Toke
The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms
$ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms
This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms
So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK:
12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60) 10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52) 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40) 10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */*
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0
The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
It is certainly odd, but it's definitely a "thing." https://archive.nanog.org/meetings/nanog37/presentations/matt.levine.pdf On Wed, Nov 13, 2019 at 10:24 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.
On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <nanog@nanog.org> wrote:
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur
I believe I may be that customer :)
First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details: https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
The short version is that the problem appears to come from a combination of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash.
For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example.
-Toke
The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms
$ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms
This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms
So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK:
12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60) 10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52) 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40) 10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */*
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0
The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
as one of the authors of that talk, it definitely is "a thing", has been for years and years and years, and indeed, mostly works. t On Wed, Nov 13, 2019 at 12:18 PM Hunter Fuller <hf0002+nanog@uah.edu> wrote:
It is certainly odd, but it's definitely a "thing."
https://archive.nanog.org/meetings/nanog37/presentations/matt.levine.pdf
On Wed, Nov 13, 2019 at 10:24 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast
TCP is... out of spec to say the least), not a bug in ECN/ECMP.
On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur
I believe I may be that customer :)
First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details:
https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
The short version is that the problem appears to come from a
combination
of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash.
For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example.
-Toke
The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms
$ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms
This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms
So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK:
12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF],
10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52) 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF],
nanog@nanog.org> wrote: packets packets packets packets proto TCP (6), length 60) proto TCP (6), length 40)
10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */*
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0
The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
I am testing disabling our use of ECMP as it is not strictly necessary and we are moving to a new platform anyway. Waiting for feedback from the customer to hear if this fixes the issue. In any case, is it not recommended that users of anycast proxy packets that arrive at the wrong place? To avoid this kind of issue. Regards, Baldur On Wed, Nov 13, 2019 at 6:35 PM Todd Underwood <toddunder@gmail.com> wrote:
as one of the authors of that talk, it definitely is "a thing", has been for years and years and years, and indeed, mostly works.
t
On Wed, Nov 13, 2019 at 12:18 PM Hunter Fuller <hf0002+nanog@uah.edu> wrote:
It is certainly odd, but it's definitely a "thing."
https://archive.nanog.org/meetings/nanog37/presentations/matt.levine.pdf
On Wed, Nov 13, 2019 at 10:24 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast
TCP is... out of spec to say the least), not a bug in ECN/ECMP.
On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I
could
run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur
I believe I may be that customer :)
First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details:
https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
The short version is that the problem appears to come from a
combination
of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash.
For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example.
-Toke
The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms
$ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms
This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms
So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK:
12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF],
10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF],
104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF],
nanog@nanog.org> wrote: packets packets packets packets proto TCP (6), length 60) proto TCP (6), length 52) proto TCP (6), length 40)
10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */*
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0
The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
On Wed, 13 Nov 2019, Baldur Norddahl wrote:
In any case, is it not recommended that users of anycast proxy packets that arrive at the wrong place? To avoid this kind of issue.
In typical anycast deployments there is no feasible way to figure out where the "right place" is. It would be very interesting if your could share what equipment you're using that is doing ECMP hashing based on ECN bits. That vendor needs to fix that or people should avoid their devices. -- Mikael Abrahamsson email: swmike@swm.pp.se
ZTE M6000-S V3.00.20(3.40.1) We are moving away from this platform so I can not be bothered with requesting a fix. In the past they have made fixes for us, so I believe they would also fix this issue if we asked them to do so. Also I would like to state that I have not personally verified that the equipment is doing hashing based on the ECN bits. I just turned off ECMP so the customer can test. If it works we will either let ECMP stay off or move the customer to the new platform. Regards, Baldur On Wed, Nov 13, 2019 at 7:30 PM Mikael Abrahamsson <swmike@swm.pp.se> wrote:
On Wed, 13 Nov 2019, Baldur Norddahl wrote:
In any case, is it not recommended that users of anycast proxy packets that arrive at the wrong place? To avoid this kind of issue.
In typical anycast deployments there is no feasible way to figure out where the "right place" is.
It would be very interesting if your could share what equipment you're using that is doing ECMP hashing based on ECN bits. That vendor needs to fix that or people should avoid their devices.
-- Mikael Abrahamsson email: swmike@swm.pp.se
Baldur Norddahl <baldur.norddahl@gmail.com> writes:
I am testing disabling our use of ECMP as it is not strictly necessary and we are moving to a new platform anyway. Waiting for feedback from the customer to hear if this fixes the issue.
Which I can confirm that it does. Thank you for the speedy resolution! :) -Toke
On Wed, 13 Nov 2019 at 18:27, Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.
Not true. Hash result should indicate discreet flow, more importantly discreet flow should not result into two unique hash numbers. Using whole TOS byte breaks this promise and thus breaks ECMP. Platforms allow you to configure which bytes are part of hash calculation, whole TOS byte should not be used as discreet flow SHOULD have unique ECN bits during congestion. Toke has diagnosed the problem correctly, solution is to remove TOS from ECMP hash calculation. -- ++ytti
Hello, On Wed, Nov 13, 2019 at 8:35 PM Saku Ytti <saku@ytti.fi> wrote:
On Wed, 13 Nov 2019 at 18:27, Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.
Not true. Hash result should indicate discreet flow, more importantly discreet flow should not result into two unique hash numbers. Using whole TOS byte breaks this promise and thus breaks ECMP.
Platforms allow you to configure which bytes are part of hash calculation, whole TOS byte should not be used as discreet flow SHOULD have unique ECN bits during congestion. Toke has diagnosed the problem correctly, solution is to remove TOS from ECMP hash calculation.
In fact I believe everything beyond the 5-tuple is just a bad idea to base your hash on. Here are some examples (not quite as straight forward than the TOS/ECN case here): TTL: https://mailman.nanog.org/pipermail/nanog/2018-September/096871.html IPv6 flow label: https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/ https://pc.nanog.org/static/published/meetings/NANOG71/1531/20171003_Jaeggli... https://www.youtube.com/watch?v=b0CRjOpnT7w Lukas
On Wed, 13 Nov 2019 at 22:57, Lukas Tribus <lists@ltri.eu> wrote:
In fact I believe everything beyond the 5-tuple is just a bad idea to base your hash on. Here are some examples (not quite as straight forward than the TOS/ECN case here):
ACK.
TTL: https://mailman.nanog.org/pipermail/nanog/2018-September/096871.html
IPv6 flow label: https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/ https://pc.nanog.org/static/published/meetings/NANOG71/1531/20171003_Jaeggli... https://www.youtube.com/watch?v=b0CRjOpnT7w
It is unfortunate IPv6 flow label is so poorly specified, had it been specified clearly it could have been very very good for the Internet. Crucially sender should be able to instruct transit HOW to hash, there should be flags in flow label used by sender to indicate that flow label must be used for hash exclusively, not at all, inclusively with what ever host otherwise uses. This would give sender control over what is discreet flow. Something like this https://ytti.github.io/flow-label/draft-ytti-v6ops-flow-label.html would have been nice, but unclear if it would be possible to deliver post-fact -- ++ytti
On Wed, Nov 13, 2019 at 11:36 AM Saku Ytti <saku@ytti.fi> wrote:
On Wed, 13 Nov 2019 at 18:27, Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.
Not true. Hash result should indicate discreet flow, more importantly discreet flow should not result into two unique hash numbers. Using whole TOS byte breaks this promise and thus breaks ECMP.
Yes true. Equal Cost MultiPath (ECMP) consistency over the life of a TCP connection is not a promise. Anycasters would love it to be but it's not. ECMP's only promise is that packets for a particular connection will tend to prefer a particular path so that throughput doesn't suffer overly much from the packet reordering you'd get by round-robining the packets on different links. Choosing an alternate path during congestion is a perfectly reasonable thing for ECMP to do. Don't blame the network. This is Cloudflare choosing not to handle the anycast spray corner case because it happens rarely enough with symptoms obscure enough that they only occasionally get called to carpet. Their BGP announcements make the claim they're ready for your packet at any of their sites, but they're not. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
* Saku Ytti
Not true. Hash result should indicate discreet flow, more importantly discreet flow should not result into two unique hash numbers. Using whole TOS byte breaks this promise and thus breaks ECMP.
Platforms allow you to configure which bytes are part of hash calculation, whole TOS byte should not be used as discreet flow SHOULD have unique ECN bits during congestion. Toke has diagnosed the problem correctly, solution is to remove TOS from ECMP hash calculation.
Agreed. This also goes for the other bits, so whole byte must be excluded. For example, the OpenSSH client will by default change the code point from zero (during authentication) to af21/cs1 (when it enters a interactive/non-interactive session). I have experienced this break IPv6 SSH sessions to an anycasted SSH server instance that was reached through old Juniper DPC cards with ECMP enabled. Symptom was that authentication went fine, only for the connection to be reset immediately after (unless default IPQoS config was changed). The «solution» was to simply disable ECMP for all IPv6 traffic, since I could not figure out how to make the Juniper exclude the DiffServ byte from the ECMP hash calculation. Tore
On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.
Errrrrr. I really don't think that there is any sort of spec that covers that :-P Using Anycast for TCP is incredibly common - the DNS root servers for one obvious example. More TCP centric well-known examples are Fastly and LinkedIn - LinkedIn in particular did a really good podcast on their experience with this. There is also a good NANOG talk from the ~2000s (?) on people using TCP anycast for long lived (serving ISO files, which were long-lived in those days) flows, and how reliable it is - perhaps that's the talk Todd mentioned? W
On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <nanog@nanog.org> wrote:
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur
I believe I may be that customer :)
First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details: https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
The short version is that the problem appears to come from a combination of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash.
For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example.
-Toke
The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms
$ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms
This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets 1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms
So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK:
12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60) 10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52) 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40) 10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */*
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0
The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address. It recognizes that there are valid use cases for it, though. Specifically, section 3.1 says this:
Most stateful transport protocols (e.g., TCP), without modification, do not understand the properties of anycast; hence, they will fail probabilistically, but possibly catastrophically, when using anycast addresses in the presence of "normal" routing dynamics. ... This can lead to a protocol working fine in, say, a test lab but not in the global Internet.
On Wed, Nov 13, 2019 at 3:33 PM Warren Kumari <warren@kumari.net> wrote:
On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast
TCP is... out of spec to say the least), not a bug in ECN/ECMP.
Errrrrr. I really don't think that there is any sort of spec that covers that :-P
Using Anycast for TCP is incredibly common - the DNS root servers for one obvious example. More TCP centric well-known examples are Fastly and LinkedIn - LinkedIn in particular did a really good podcast on their experience with this.
There is also a good NANOG talk from the ~2000s (?) on people using TCP anycast for long lived (serving ISO files, which were long-lived in those days) flows, and how reliable it is - perhaps that's the talk Todd mentioned?
W
On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <
Hello
I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?
Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.
I believe it likely that it is the destination that has the problem.
Hi Baldur
I believe I may be that customer :)
First of all, thank you for looking into the issue! We've been having great fun over on the ecn-sane mailing list trying to figure out what's going on. I'll summarise below, but see this thread for the discussion and debugging details:
https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
The short version is that the problem appears to come from a
combination
of the ECMP routing in your network, and Cloudflare's heavy use of anycast. Specifically, a router in your network appears to be doing ECMP by hashing on the packet header, *including the ECN bits*. This breaks TCP connections with ECN because the TCP SYN (with no ECN bits set) end up taking a different path than the rest of the flow (which is marked as ECT(0)). When the destination is anycasted, this means that the data packets go to a different server than the SYN did. This second server doesn't recognise the connection, and so replies with a TCP RST. To fix this, simply exclude the ECN bits (or the whole TOS byte) from your router's ECMP hash.
For a longer exposition, see below. You should be able to verify this from somewhere else in the network, but if there's anything else you want me to test, do let me know. Also, would you mind sharing the router make and model that does this? We're trying to collect real-world examples of network problems caused by ECN and this is definitely an interesting example.
-Toke
The long version:
From my end I can see that I have two paths to Cloudflare; which is taken appears to be based on a hash of the packet header, as can be seen by varying the source port:
$ traceroute -q 1 --sport=10000 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.357 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms 6 104.24.125.13 (104.24.125.13) 1.322 ms
$ traceroute -q 1 --sport=10001 104.24.125.13 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.293 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms 6 149.6.142.130 (149.6.142.130) 6.925 ms 7 104.24.125.13 (104.24.125.13) 1.501 ms
This is fine in itself. However, the problem stems from the fact that the ECN bits in the IP header are also included in the ECMP hash (-t sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is ECT(1)):
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.336 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms 3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms 4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms 5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms 6 104.24.125.13 (104.24.125.13) 1.210 ms
$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2 traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
1 _gateway (10.42.3.1) 0.339 ms 2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms 3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms 4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms 5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms 6 149.6.142.130 (149.6.142.130) 6.888 ms 7 104.24.125.13 (104.24.125.13) 1.785 ms
So why is this a problem? The TCP SYN packet first needs to negotiate ECN, so it is sent without any ECN bits set in the header; after negotiation succeeds, the data packets will be marked as ECT(0). But because that becomes part of the ECMP hash, those packets will take another path. And since the destination is anycasted, that means they will also end up at a different endpoint. This second endpoint won't recognise the connection, and reply with a TCP RST. This is clearly visible in tcpdump; notice the different TOS values, and that the RST packet has a different TTL than the SYN-ACK:
12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF],
10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52) 104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF],
nanog@nanog.org> wrote: packets packets packets packets proto TCP (6), length 60) proto TCP (6), length 40)
10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117) 10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77 GET / HTTP/1.1 Host: 104.24.125.13 User-Agent: curl/7.66.0 Accept: */*
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40) 104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0
The fix is to stop hashing on the ECN bits when doing ECMP. You could keep hashing on the diffserv part of the TOS field if you want, but I think it would also be fine to just exclude the TOS field entirely from the hash.
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
On Nov 14, 2019, at 7:39 AM, Anoop Ghanwani <anoop@alumni.duke.edu> wrote: RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address. It recognizes that there are valid use cases for it, though. Specifically, section 3.1 says this: Most stateful transport protocols (e.g., TCP), without modification, do not understand the properties of anycast; hence, they will fail probabilistically, but possibly catastrophically, when using anycast addresses in the presence of "normal" routing dynamics. This can lead to a protocol working fine in, say, a test lab but not in the global Internet.
On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least),
No. We have been doing anycast TCP for more than _thirty years_, most of that time on a global scale, without operational problems. There were people who seemed gray-bearded at the time, who were scared of anycast because it used IP addresses _non uniquely_ and that wasn’t how they’d intended them to be used, and these kids these days, etc. What you’re seeing is residuum of their pronouncements on the matter, carrying over from the mid-1990s. It’s very true that anycast can be misused and abused in a myriad of ways, leading to unexpected or unpleasant results, but no more so than other routing techniques. We and others have published on many or most of the potential issues and their solutions over the years. That RFC has never actually been a comprehensive source of information on the topic, and it contains a lot of scare-mongering. -Bill
On Nov 14, 2019, at 7:39 AM, Anoop Ghanwani <anoop@alumni.duke.edu> wrote: RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address. It recognizes that there are valid use cases for it, though. Specifically, section 3.1 says this: Most stateful transport protocols (e.g., TCP), without modification, do not understand the properties of anycast; hence, they will fail probabilistically, but possibly catastrophically, when using anycast addresses in the presence of "normal" routing dynamics. This can lead to a protocol working fine in, say, a test lab but not in the global Internet.
On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <nanog@as397444.net> wrote:
This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least),
No. We have been doing anycast TCP for more than _thirty years_, most of
On Thu, Nov 14, 2019 at 1:10 AM Bill Woodcock <woody@pch.net> wrote: that time on a global scale, without operational problems. Hi Bill, Not to put to fine a point on it but Baldur and Toke's scenario in which anycast tcp failed, the one which started this thread, should probably be classed as an operational problem. It is possible to build an anycast TCP that works. YOU have not done so. And Cloudflare certainly has not. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address.
and two decades of operational experience are that prudent deployments just work. randy
On Fri, Nov 15, 2019 at 1:54 AM Randy Bush <randy@psg.com> wrote:
RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address.
and two decades of operational experience are that prudent deployments just work.
I agree with Bill/Randy here... this does just work if you understand your local topology and manage change properly.
RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address.
and two decades of operational experience are that prudent deployments just work.
I agree with Bill/Randy here... this does just work if you understand your local topology and manage change properly.
agree, but would extend ... sometimes s/local// i.e. casting from your edge dumps directly to peers, keeping it off your backbone. but the topo set you have to keep in mind can be large. lots of good research lit on catchment topology of anycasted dns, which is very non-local. randy
participants (16)
-
Anoop Ghanwani
-
Baldur Norddahl
-
Bill Woodcock
-
Christopher Morrow
-
Hunter Fuller
-
Lukas Tribus
-
Matt Corallo
-
Mikael Abrahamsson
-
Owen DeLong
-
Randy Bush
-
Saku Ytti
-
Todd Underwood
-
Toke Høiland-Jørgensen
-
Tore Anderson
-
Warren Kumari
-
William Herrin