I want to share a little bit of our journey in tracking down the TCP RSTs that impacted some of our customers for almost ten weeks. Almost immediately after we turned up two new Arista border routers in late July we started receiving a trickle of complaints from customers regarding their inability to access certain websites (mostly B2B). All the packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST from the website after the client sent a TLS/SSL Client Hello. As the reports continued to come in, we built a Google Doc to keep track and it became clear that most of the sites were hosted by Incapsula/Imperva, but there were also a few by Sucuri and Fastly. Knowing that Incapsula provides DoS protection, we attempted to work with them (providing websites, source/destination IPs, traceroutes, and packet captures) to find out why their hosts were issuing our customers a TCP RST, but we made little progress. We moved some of the affected customers to different IP addresses but that didn't resolve the issue. We also asked our customer to work with the website to see if they would be willing to open a ticket with Incapsula. In the meantime, customers were getting frustrated! They couldn't visit Incapsula-hosted healthcare websites, financial firms, product dealers, etc. Over the weeks, a few of those customers purchased/borrowed different routers and some of those didn't have website issues anymore. And more than a few of them discovered that the websites worked fine from home or their mobile phone/hotspot, but not from their Internet connection with us. You can guess where they were applying pressure! That said, we didn't know why a small handful of companies, known for DoS protection, were issuing TCP RSTs to just some of our customers. Earlier this week we received four or five more websites from yet another affected customer, but most of those were with Fastly. By this time, we had been able to replicate the issue in our lab. Feeling desperate to make some tangible progress on this issue, I reached out to the Fastly NOC. In less than 12 hours they provided some helpful feedback, pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don't sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly's infrastructure here: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balan cing-requests). In subsequent email exchanges, the Fastly NOC suggested that it appeared that we were "spraying flows" (that is, packets related to single client session were egressing our network via different paths). Because Fastly is also present with us at an IX (though they weren't advertising their anycast IPs at the time), they suggested that we look at how our traffic egresses our network (IX versus transit) and our routers' outbound load-balancing/hashing schemes. The IX turned up to be a red herring, so I turned my attention to our transit. Each of our border routers has two BGP sessions over two circuits to transit provider POP A and two BGP sessions over two circuits to transit provider POP B, for a total of four BGP sessions per border router, a total of eight BGP sessions altogether. Starting with our core router, I confirmed that its ECMP hashing was consistent such that Fastly-bound traffic always went to border router 1 or border router 2. Then I looked at the ECMP hashing scheme on our border routers and noticed something unique - by default Arista also uses TTL: IPv4 hash fields: Source IPv4 Address is ON Protocol is ON Time-To-Live is ON Destination IPv4 Address is ON Since the source and destination IPs and protocol weren't changing, perhaps the TTL was not consistent? I opened the first packet trace in Wireshark and jackpot - the TTL value was 128 on the SYN but 127 on the TLS/SSL Client Hello. I adjusted the Arista's load-balancing profile not to use TTL and immediately my MTR in the background changed and all the sites on the lab machine that couldn't load before . were now loading. Fastly also pointed me to another article written by Joel Jaeggli (https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that discusses IPv6 flow labels - we removed that from the border routers' IPv6 hash fields, too. I reviewed the packet traces today and noticed that TTL values remained consistent at 128 *behind* the router CPE. In packet captures on the WAN interface of the router CPE I see that the SYN remains at 128, but the TLS/Client Hello is properly decremented to 127. So, it appears that some router CPE (and there were a variety of makes and models) are doing something special to certain packets and not decrementing the TTL. This explains why: * our customers had issues with all their devices behind their router CPE * the issue remained regardless of what public IP address their router CPE obtained via DHCP or was assigned * some customers who changed their router CPE didn't have the issue anymore - they got lucky with a router that doesn't adjust/reset the TTL * why customers who used our managed Wi-Fi router did not see the issue, because that model doesn't apparently manipulate the TTL, at least not in an inconsistent way. Lesson learned: review a device's hashing mechanism before going into production. For those interested, I have links to the packet traces below my signature, showing the inconsistent TTL values. Thanks again to the fantastic group of folk at the Fastly NOC who so ably pointed us in the right direction! Frank https://www.premieronline.net/~fbulk/example1.pcapng https://www.premieronline.net/~fbulk/example2.pcapng https://www.premieronline.net/~fbulk/example3.pcapng
On Sat, Sep 1, 2018 at 2:51 PM, <frnkblk@iname.com> wrote:
pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don’t sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly’s infrastructure here: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balan...).
Oh for Pete's sake. If they're going to attempt Anycast TCP with a unicast protocol stack they should at least have the sense to suppress the RSTs. Better yet, do the job right and build an anycast TCP stack as described here: https://bill.herrin.us/network/anycasttcp.html Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>
On Sat, Sep 1, 2018 at 4:00 PM, William Herrin <bill@herrin.us> wrote:
On Sat, Sep 1, 2018 at 2:51 PM, <frnkblk@iname.com> wrote:
pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don’t sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly’s infrastructure here: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balan...).
Better yet, do the job right and build an anycast TCP stack as described here: https://bill.herrin.us/network/anycasttcp.html
BTW, for anyone concerned about an explosion in state management overhead, the TL;DR version is: the anycast node which first accepts the TCP connection encodes its identity in the TCP sequence number where all the other nodes can statelessly find it in the subsequent packets. The exhaustive details for how that actually works are covered in the paper at the URL above, which you'll have to read despite its length if you want to understand. Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>
On 9/1/18, William Herrin <bill@herrin.us> wrote:
On Sat, Sep 1, 2018 at 4:00 PM, William Herrin <bill@herrin.us> wrote:
On Sat, Sep 1, 2018 at 2:51 PM, <frnkblk@iname.com> wrote:
pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don’t sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly’s infrastructure here: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balan...).
Better yet, do the job right and build an anycast TCP stack as described here: https://bill.herrin.us/network/anycasttcp.html
BTW, for anyone concerned about an explosion in state management overhead, the TL;DR version is: the anycast node which first accepts the TCP connection encodes its identity in the TCP sequence number where all the other nodes can statelessly find it in the subsequent packets. The exhaustive details for how that actually works are covered in the paper at the URL above, which you'll have to read despite its length if you want to understand.
An explosion in state management would be the least of my worries :) I got as far as your Third hook: and thought of this https://www.jwz.org/doc/worse-is-better.html I had it much easier with anycast in an enterprise setting. With anycast servers in data centers A & B, just make sure no site has an equal cost path to A and B. Any link/ router/ whatever failure & the user can just re-try. Lee
On Sat, Sep 1, 2018 at 6:11 PM, Lee <ler762@gmail.com> wrote:
On 9/1/18, William Herrin <bill@herrin.us> wrote:
On Sat, Sep 1, 2018 at 4:00 PM, William Herrin <bill@herrin.us> wrote:
Better yet, do the job right and build an anycast TCP stack as described here: https://bill.herrin.us/network/anycasttcp.html
An explosion in state management would be the least of my worries :) I got as far as your Third hook: and thought of this https://www.jwz.org/doc/worse-is-better.html
Hi Lee, On a brief tangent: Geographic routing would drastically simplify the Internet core, reducing both cost and complexity. You'd need to carry only nearby specific routes and a few broad aggregates for destinations far away. It will never be implemented, never, because no cross-ocean carriers are willing to have their bandwidth stolen when the algorithm decides it likes their path better than a paid one. Even though the algorithm gets the packets where they're going, and does so simply, it does so in a way that's too often incorrect. Then again, I don't really understand the MIT/New Jersey argument in Richard's worse-is-better story. The MIT guy says that a routine should handle a common non-fatal exception. The Jersey guy says that it's ok for the routine to return a try-again error and expect the caller to handle it. Since its trivial to build another layer that calls the routine in a loop until it returns success or a fatal error, it's more a philosophical argument than a practical one. As long as a correct result is consistently achieved in both cases, what's the difference? Richard characterized the Jersey argument as, "It is slightly better to be simple than correct." I just don't see that in the Jersey argument. Every component must be correct. The system of components as a whole must be complete. It's slightly better for a component to be simple than complete. That's the argument I read and it makes sense to me. Honestly, the idea that software is good enough even with known corner cases that do something incorrect... I don't know how that survives in a world where security-conscious programming is not optional.
I had it much easier with anycast in an enterprise setting. With anycast servers in data centers A & B, just make sure no site has an equal cost path to A and B. Any link/ router/ whatever failure & the user can just re-try.
You've delicately balanced your network to achieve the principle that even when routing around failures the anycast sites are not equidistant from any other site. That isn't simplicity. It's complexity hidden in the expert selection of magic numbers. Even were that achievable in a network as chaotic as the Internet, is it simpler than four trivial tweaks to the TCP stack plus a modestly complex but fully automatic user-space program that correctly reroutes the small percentage of packets that went astray? Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>
On 9/1/18, William Herrin <bill@herrin.us> wrote:
On Sat, Sep 1, 2018 at 6:11 PM, Lee <ler762@gmail.com> wrote:
On 9/1/18, William Herrin <bill@herrin.us> wrote:
On Sat, Sep 1, 2018 at 4:00 PM, William Herrin <bill@herrin.us> wrote:
Better yet, do the job right and build an anycast TCP stack as described here: https://bill.herrin.us/network/anycasttcp.html
An explosion in state management would be the least of my worries :) I got as far as your Third hook: and thought of this https://www.jwz.org/doc/worse-is-better.html
Hi Lee,
On a brief tangent: Geographic routing would drastically simplify the Internet core, reducing both cost and complexity. You'd need to carry only nearby specific routes and a few broad aggregates for destinations far away. It will never be implemented, never, because no cross-ocean carriers are willing to have their bandwidth stolen when the algorithm decides it likes their path better than a paid one. Even though the algorithm gets the packets where they're going, and does so simply, it does so in a way that's too often incorrect.
Then again, I don't really understand the MIT/New Jersey argument in Richard's worse-is-better story.
The "New Jersey" description is more of a caricature than a valid description: "I have intentionally caricatured the worse-is-better philosophy to convince you that it is obviously a bad philosophy and that the New Jersey approach is a bad approach." I mentally did a 's/New Jersey/Microsoft/' and it made a lot more sense.
The MIT guy says that a routine should handle a common non-fatal exception. The Jersey guy says that it's ok for the routine to return a try-again error and expect the caller to handle it. Since its trivial to build another layer that calls the routine in a loop until it returns success or a fatal error, it's more a philosophical argument than a practical one. As long as a correct result is consistently achieved in both cases, what's the difference?
That it's not always a trivial matter to build another layer. That your retry layer needs at least a counter or timeout value so it doesn't retry forever & those values need to be user configurable, so the re-try layer isn't quite as trivial as it appears at first blush.
Richard characterized the Jersey argument as, "It is slightly better to be simple than correct." I just don't see that in the Jersey argument. Every component must be correct. The system of components as a whole must be complete. It's slightly better for a component to be simple than complete. That's the argument I read and it makes sense to me.
Yes, I did a lot of interpreting also. Then I hit on s/New Jersey/Microsoft/ and it made a lot more sense to me.
Honestly, the idea that software is good enough even with known corner cases that do something incorrect... I don't know how that survives in a world where security-conscious programming is not optional.
Agreed. I substituted "soft-fail or fail-closed: user has to retry" for doing something incorrect.
I had it much easier with anycast in an enterprise setting. With anycast servers in data centers A & B, just make sure no site has an equal cost path to A and B. Any link/ router/ whatever failure & the user can just re-try.
You've delicately balanced your network to achieve the principle that even when routing around failures the anycast sites are not equidistant from any other site. That isn't simplicity. It's complexity hidden in the expert selection of magic numbers.
^shrug^ it seemed simple to me. And it was real easy to explain, which is why I thought of that "worse is better" paper. I took the New Jersey approach & did what was basically a hack. You took the MIT approach and created a general solution .. which is not so easy to explain :)
Even were that achievable in a network as chaotic as the Internet, is it simpler than four trivial tweaks to the TCP stack plus a modestly complex but fully automatic user-space program that correctly reroutes the small percentage of packets that went astray?
Your four trivial tweaks to the TCP stack are kernel patches - right? Which seems not at all trivial to me, but if you've got a group of people that can support & maintain that - good for you! Regards Lee
William Herrin <bill@herrin.us> writes:
BTW, for anyone concerned about an explosion in state management overhead, the TL;DR version is: the anycast node which first accepts the TCP connection encodes its identity in the TCP sequence number where all the other nodes can statelessly find it in the subsequent packets.
I didn't see a security section in your document. Did you consider the side effects of this sequence number abuse? Bjørn
On Sun, Sep 2, 2018 at 6:06 AM, Bjørn Mork <bjorn@mork.no> wrote:
William Herrin <bill@herrin.us> writes:
I didn't see a security section in your document. Did you consider the side effects of this sequence number abuse?
Hi Bjørn, In the "issues and criticisms" section. Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>
William Herrin <bill@herrin.us> writes:
On Sun, Sep 2, 2018 at 6:06 AM, Bjørn Mork <bjorn@mork.no> wrote:
William Herrin <bill@herrin.us> writes:
I didn't see a security section in your document. Did you consider the side effects of this sequence number abuse?
Hi Bjørn,
In the "issues and criticisms" section.
I can see the effect on syn cookies being disussed there, but I don't think that covers all concerns wrt more predicatable sequence numbers. See RFC6528, including its references. Bjørn
On Sun, Sep 2, 2018 at 6:49 AM, Bjørn Mork <bjorn@mork.no> wrote:
William Herrin <bill@herrin.us> writes:
On Sun, Sep 2, 2018 at 6:06 AM, Bjørn Mork <bjorn@mork.no> wrote:
William Herrin <bill@herrin.us> writes:
I didn't see a security section in your document. Did you consider the side effects of this sequence number abuse?
In the "issues and criticisms" section.
I can see the effect on syn cookies being disussed there, but I don't think that covers all concerns wrt more predicatable sequence numbers.
See RFC6528, including its references.
Thanks Bjørn, I've added several notes in "issues and criticisms" based on that information. Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>
I would love this as a blog post to link folks that are not nanog members. -Garrett On Sat, Sep 1, 2018, 11:52 <frnkblk@iname.com> wrote:
I want to share a little bit of our journey in tracking down the TCP RSTs that impacted some of our customers for almost ten weeks.
Almost immediately after we turned up two new Arista border routers in late July we started receiving a trickle of complaints from customers regarding their inability to access certain websites (mostly B2B). All the packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST from the website after the client sent a TLS/SSL Client Hello. As the reports continued to come in, we built a Google Doc to keep track and it became clear that most of the sites were hosted by Incapsula/Imperva, but there were also a few by Sucuri and Fastly. Knowing that Incapsula provides DoS protection, we attempted to work with them (providing websites, source/destination IPs, traceroutes, and packet captures) to find out why their hosts were issuing our customers a TCP RST, but we made little progress. We moved some of the affected customers to different IP addresses but that didn’t resolve the issue. We also asked our customer to work with the website to see if they would be willing to open a ticket with Incapsula. In the meantime, customers were getting frustrated! They couldn’t visit Incapsula-hosted healthcare websites, financial firms, product dealers, etc. Over the weeks, a few of those customers purchased/borrowed different routers and some of those didn’t have website issues anymore. And more than a few of them discovered that the websites worked fine from home or their mobile phone/hotspot, but not from their Internet connection with us. You can guess where they were applying pressure! That said, we didn’t know why a small handful of companies, known for DoS protection, were issuing TCP RSTs to just some of our customers.
Earlier this week we received four or five more websites from yet another affected customer, but most of those were with Fastly. By this time, we had been able to replicate the issue in our lab. Feeling desperate to make some tangible progress on this issue, I reached out to the Fastly NOC. In less than 12 hours they provided some helpful feedback, pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don’t sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly’s infrastructure here: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balan...). In subsequent email exchanges, the Fastly NOC suggested that it appeared that we were “spraying flows” (that is, packets related to single client session were egressing our network via different paths). Because Fastly is also present with us at an IX (though they weren’t advertising their anycast IPs at the time), they suggested that we look at how our traffic egresses our network (IX versus transit) and our routers’ outbound load-balancing/hashing schemes.
The IX turned up to be a red herring, so I turned my attention to our transit. Each of our border routers has two BGP sessions over two circuits to transit provider POP A and two BGP sessions over two circuits to transit provider POP B, for a total of four BGP sessions per border router, a total of eight BGP sessions altogether. Starting with our core router, I confirmed that its ECMP hashing was consistent such that Fastly-bound traffic always went to border router 1 or border router 2. Then I looked at the ECMP hashing scheme on our border routers and noticed something unique – by default Arista also uses TTL:
IPv4 hash fields:
Source IPv4 Address is ON
Protocol is ON
Time-To-Live is ON
Destination IPv4 Address is ON
Since the source and destination IPs and protocol weren’t changing, perhaps the TTL was not consistent? I opened the first packet trace in Wireshark and jackpot – the TTL value was 128 on the SYN but 127 on the TLS/SSL Client Hello. I adjusted the Arista’s load-balancing profile not to use TTL and immediately my MTR in the background changed and all the sites on the lab machine that couldn’t load before … were now loading.
Fastly also pointed me to another article written by Joel Jaeggli ( https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that discusses IPv6 flow labels – we removed that from the border routers’ IPv6 hash fields, too.
I reviewed the packet traces today and noticed that TTL values remained consistent at 128 **behind** the router CPE. In packet captures on the WAN interface of the router CPE I see that the SYN remains at 128, but the TLS/Client Hello is properly decremented to 127. So, it appears that some router CPE (and there were a variety of makes and models) are doing something special to certain packets and not decrementing the TTL.
This explains why:
- our customers had issues with all their devices behind their router CPE - the issue remained regardless of what public IP address their router CPE obtained via DHCP or was assigned - some customers who changed their router CPE didn’t have the issue anymore – they got lucky with a router that doesn’t adjust/reset the TTL - why customers who used our managed Wi-Fi router did not see the issue, because that model doesn’t apparently manipulate the TTL, at least not in an inconsistent way.
Lesson learned: review a device’s hashing mechanism before going into production.
For those interested, I have links to the packet traces below my signature, showing the inconsistent TTL values.
Thanks again to the fantastic group of folk at the Fastly NOC who so ably pointed us in the right direction!
Frank
https://www.premieronline.net/~fbulk/example1.pcapng
On Sat, 1 Sep 2018 at 21:06, Garrett Skjelstad <garrett@skjelstad.org> wrote:
I would love this as a blog post to link folks that are not nanog members.
-Garrett
Hi Garrett, It is available via the NANOG list archives: https://mailman.nanog.org/pipermail/nanog/2018-September/096871.html I've shared this story to non-list member using that URL. Thanks for the write up Frank! Cheers, James.
On 09/02/2018 10:24 AM, James Bensley wrote:
It is available via the NANOG list archives: https://mailman.nanog.org/pipermail/nanog/2018-September/096871.html
But why did the TLS Hello has a TTL lower that the TCP Syn ? Do you have any information on that ?
hey,
But why did the TLS Hello has a TTL lower that the TCP Syn ?
Do you have any information on that ?
Consumer CPEs are typically some BCM reference design where initial TCP handshake is handled by linux kernel and everything following (including NAT) is handled in SOC. I've seen those systems not decrement TTL at all, decrement TTL before checking if packet is destined to itself etc. This case is weird as typically the hardware part is faulty, not the kernel. -- tarko
I think it would be a good idea to repost this is reddit.com/r/networking Tim Sent from ProtonMail mobile -------- Original Message -------- On Sep 2, 2018, 10:43 PM, Tarko Tikan wrote:
hey,
But why did the TLS Hello has a TTL lower that the TCP Syn ?
Do you have any information on that ?
Consumer CPEs are typically some BCM reference design where initial TCP handshake is handled by linux kernel and everything following (including NAT) is handled in SOC.
I've seen those systems not decrement TTL at all, decrement TTL before checking if packet is destined to itself etc. This case is weird as typically the hardware part is faulty, not the kernel.
-- tarko
Glad we could help, Frank. On Sat, Sep 1, 2018 at 11:54 <frnkblk@iname.com> wrote:
I want to share a little bit of our journey in tracking down the TCP RSTs that impacted some of our customers for almost ten weeks.
Almost immediately after we turned up two new Arista border routers in late July we started receiving a trickle of complaints from customers regarding their inability to access certain websites (mostly B2B). All the packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST from the website after the client sent a TLS/SSL Client Hello. As the reports continued to come in, we built a Google Doc to keep track and it became clear that most of the sites were hosted by Incapsula/Imperva, but there were also a few by Sucuri and Fastly. Knowing that Incapsula provides DoS protection, we attempted to work with them (providing websites, source/destination IPs, traceroutes, and packet captures) to find out why their hosts were issuing our customers a TCP RST, but we made little progress. We moved some of the affected customers to different IP addresses but that didn’t resolve the issue. We also asked our customer to work with the website to see if they would be willing to open a ticket with Incapsula. In the meantime, customers were getting frustrated! They couldn’t visit Incapsula-hosted healthcare websites, financial firms, product dealers, etc. Over the weeks, a few of those customers purchased/borrowed different routers and some of those didn’t have website issues anymore. And more than a few of them discovered that the websites worked fine from home or their mobile phone/hotspot, but not from their Internet connection with us. You can guess where they were applying pressure! That said, we didn’t know why a small handful of companies, known for DoS protection, were issuing TCP RSTs to just some of our customers.
Earlier this week we received four or five more websites from yet another affected customer, but most of those were with Fastly. By this time, we had been able to replicate the issue in our lab. Feeling desperate to make some tangible progress on this issue, I reached out to the Fastly NOC. In less than 12 hours they provided some helpful feedback, pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don’t sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly’s infrastructure here: https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balan...). In subsequent email exchanges, the Fastly NOC suggested that it appeared that we were “spraying flows” (that is, packets related to single client session were egressing our network via different paths). Because Fastly is also present with us at an IX (though they weren’t advertising their anycast IPs at the time), they suggested that we look at how our traffic egresses our network (IX versus transit) and our routers’ outbound load-balancing/hashing schemes.
The IX turned up to be a red herring, so I turned my attention to our transit. Each of our border routers has two BGP sessions over two circuits to transit provider POP A and two BGP sessions over two circuits to transit provider POP B, for a total of four BGP sessions per border router, a total of eight BGP sessions altogether. Starting with our core router, I confirmed that its ECMP hashing was consistent such that Fastly-bound traffic always went to border router 1 or border router 2. Then I looked at the ECMP hashing scheme on our border routers and noticed something unique – by default Arista also uses TTL:
IPv4 hash fields:
Source IPv4 Address is ON
Protocol is ON
Time-To-Live is ON
Destination IPv4 Address is ON
Since the source and destination IPs and protocol weren’t changing, perhaps the TTL was not consistent? I opened the first packet trace in Wireshark and jackpot – the TTL value was 128 on the SYN but 127 on the TLS/SSL Client Hello. I adjusted the Arista’s load-balancing profile not to use TTL and immediately my MTR in the background changed and all the sites on the lab machine that couldn’t load before … were now loading.
Fastly also pointed me to another article written by Joel Jaeggli ( https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that discusses IPv6 flow labels – we removed that from the border routers’ IPv6 hash fields, too.
I reviewed the packet traces today and noticed that TTL values remained consistent at 128 **behind** the router CPE. In packet captures on the WAN interface of the router CPE I see that the SYN remains at 128, but the TLS/Client Hello is properly decremented to 127. So, it appears that some router CPE (and there were a variety of makes and models) are doing something special to certain packets and not decrementing the TTL.
This explains why:
- our customers had issues with all their devices behind their router CPE - the issue remained regardless of what public IP address their router CPE obtained via DHCP or was assigned - some customers who changed their router CPE didn’t have the issue anymore – they got lucky with a router that doesn’t adjust/reset the TTL - why customers who used our managed Wi-Fi router did not see the issue, because that model doesn’t apparently manipulate the TTL, at least not in an inconsistent way.
Lesson learned: review a device’s hashing mechanism before going into production.
For those interested, I have links to the packet traces below my signature, showing the inconsistent TTL values.
Thanks again to the fantastic group of folk at the Fastly NOC who so ably pointed us in the right direction!
Frank
https://www.premieronline.net/~fbulk/example1.pcapng
Can you share the Arista model and EOS version of the devices you installed that TTL hashing was enabled by default? On Sat, Sep 1, 2018 at 2:51 PM, <frnkblk@iname.com> wrote:
I want to share a little bit of our journey in tracking down the TCP RSTs that impacted some of our customers for almost ten weeks.
Almost immediately after we turned up two new Arista border routers in late July we started receiving a trickle of complaints from customers regarding their inability to access certain websites (mostly B2B). All the packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST from the website after the client sent a TLS/SSL Client Hello. As the reports continued to come in, we built a Google Doc to keep track and it became clear that most of the sites were hosted by Incapsula/Imperva, but there were also a few by Sucuri and Fastly. Knowing that Incapsula provides DoS protection, we attempted to work with them (providing websites, source/destination IPs, traceroutes, and packet captures) to find out why their hosts were issuing our customers a TCP RST, but we made little progress. We moved some of the affected customers to different IP addresses but that didn’t resolve the issue. We also asked our customer to work with the website to see if they would be willing to open a ticket with Incapsula. In the meantime, customers were getting frustrated! They couldn’t visit Incapsula-hosted healthcare websites, financial firms, product dealers, etc. Over the weeks, a few of those customers purchased/borrowed different routers and some of those didn’t have website issues anymore. And more than a few of them discovered that the websites worked fine from home or their mobile phone/hotspot, but not from their Internet connection with us. You can guess where they were applying pressure! That said, we didn’t know why a small handful of companies, known for DoS protection, were issuing TCP RSTs to just some of our customers.
Earlier this week we received four or five more websites from yet another affected customer, but most of those were with Fastly. By this time, we had been able to replicate the issue in our lab. Feeling desperate to make some tangible progress on this issue, I reached out to the Fastly NOC. In less than 12 hours they provided some helpful feedback, pointing out that a single traceroute to a Fastly site was hitting two of their POPs (they use anycast) and because they don’t sync state between POPs the second POP would naturally issue a TCP RST (sidebar: fascinating blog article on Fastly’s infrastructure here: https://www.fastly.com/blog/ building-and-scaling-fastly-network-part-2-balancing-requests). In subsequent email exchanges, the Fastly NOC suggested that it appeared that we were “spraying flows” (that is, packets related to single client session were egressing our network via different paths). Because Fastly is also present with us at an IX (though they weren’t advertising their anycast IPs at the time), they suggested that we look at how our traffic egresses our network (IX versus transit) and our routers’ outbound load-balancing/hashing schemes.
The IX turned up to be a red herring, so I turned my attention to our transit. Each of our border routers has two BGP sessions over two circuits to transit provider POP A and two BGP sessions over two circuits to transit provider POP B, for a total of four BGP sessions per border router, a total of eight BGP sessions altogether. Starting with our core router, I confirmed that its ECMP hashing was consistent such that Fastly-bound traffic always went to border router 1 or border router 2. Then I looked at the ECMP hashing scheme on our border routers and noticed something unique – by default Arista also uses TTL:
IPv4 hash fields:
Source IPv4 Address is ON
Protocol is ON
Time-To-Live is ON
Destination IPv4 Address is ON
Since the source and destination IPs and protocol weren’t changing, perhaps the TTL was not consistent? I opened the first packet trace in Wireshark and jackpot – the TTL value was 128 on the SYN but 127 on the TLS/SSL Client Hello. I adjusted the Arista’s load-balancing profile not to use TTL and immediately my MTR in the background changed and all the sites on the lab machine that couldn’t load before … were now loading.
Fastly also pointed me to another article written by Joel Jaeggli ( https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that discusses IPv6 flow labels – we removed that from the border routers’ IPv6 hash fields, too.
I reviewed the packet traces today and noticed that TTL values remained consistent at 128 **behind** the router CPE. In packet captures on the WAN interface of the router CPE I see that the SYN remains at 128, but the TLS/Client Hello is properly decremented to 127. So, it appears that some router CPE (and there were a variety of makes and models) are doing something special to certain packets and not decrementing the TTL.
This explains why:
- our customers had issues with all their devices behind their router CPE - the issue remained regardless of what public IP address their router CPE obtained via DHCP or was assigned - some customers who changed their router CPE didn’t have the issue anymore – they got lucky with a router that doesn’t adjust/reset the TTL - why customers who used our managed Wi-Fi router did not see the issue, because that model doesn’t apparently manipulate the TTL, at least not in an inconsistent way.
Lesson learned: review a device’s hashing mechanism before going into production.
For those interested, I have links to the packet traces below my signature, showing the inconsistent TTL values.
Thanks again to the fantastic group of folk at the Fastly NOC who so ably pointed us in the right direction!
Frank
https://www.premieronline.net/~fbulk/example1.pcapng
participants (11)
-
Bjørn Mork
-
frnkblk@iname.com
-
Garrett Skjelstad
-
James Bensley
-
Lee
-
nanog@jack.fr.eu.org
-
Ryan Landry
-
Tarko Tikan
-
Timothy Manito
-
Tom Beecher
-
William Herrin