In case it is useful for anyone else, underlying issue looks to be this: Cisco CSCws27022: ECN bits being included as part of ECMP hash on IPv6 TCP flows (Workaround: Do not use ECMP) Appears to be platform specific, affecting Cisco Catalyst C9K UADP ASIC (C9500-32C) Another work-around might be to configure "ip cef load-sharing algorithm original" Tim:> On Tue, Mar 25, 2025 at 4:33 PM Tim Durack <tdurack@gmail.com> wrote:
Very helpful, thanks! Will post my own short story once complete...
On Tue, Mar 25, 2025 at 4:24 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
Tim Durack <tdurack@gmail.com> writes:
Toke,
Resurrecting an old thread, did you ever write this one up?
Hi Tim
Thank you for the reminder! No, I never did get around to writing anything at the time. However, now that you reminded me, I collected my old notes and posted this:
https://blog.tohojo.dk/2025/03/ecn-ecmp-and-anycast-a-cocktail-of-broken-con...
I believe I have a customer reporting a similar problem with IPv6 TCP ECN probably ECMP resulting in RST coming back from anycast services (Cloudflare).
Tricky one to debug, looking for similar reports...
Hoping the above is helpful :)
-Toke
-- Tim:>
-- Tim:> -- Tim:>
These types of issues are endemic to TCP anycast, unfortunately. If your services depend on anycast, you really want to make sure that your hashing algo network-wide uses the TCP/IP five-tuple, and the five-tuple *only*; I’m assuming that’s what changing to the the “original” load-sharing CEF algorithm accomplished in your case. My personal war story involved connections to a database server that would change DSCP values between the TCP handshake and subsequent packets, resulting in RSVP-TE putting the packets on different paths... to a PE router that incorporated the inbound destination MAC (read: inbound interface) into the hash algo, with no option to disable. Different inbound interfaces = different dest MAC = different anycast destinations = immediate SEV1 impact as all our database connections broke at once. The solution*? Reconfigure each L3 interface on the router to the same MAC address. :P -Chris * An alternate solution involved screaming into the vendor support void, which eventually did get results... a year or so later. ** As I’m typing this, I’ve realized that an iptables rule on the server to undo the DSCP change would have fixed; putting that thought onto the shelf if needed later.
On Dec 2, 2025, at 09:12, Tim Durack via NANOG <nanog@lists.nanog.org> wrote:
In case it is useful for anyone else, underlying issue looks to be this:
Cisco CSCws27022: ECN bits being included as part of ECMP hash on IPv6 TCP flows (Workaround: Do not use ECMP)
Appears to be platform specific, affecting Cisco Catalyst C9K UADP ASIC (C9500-32C)
Another work-around might be to configure "ip cef load-sharing algorithm original"
Tim:>
On Tue, Mar 25, 2025 at 4:33 PM Tim Durack <tdurack@gmail.com> wrote:
Very helpful, thanks! Will post my own short story once complete...
On Tue, Mar 25, 2025 at 4:24 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
Tim Durack <tdurack@gmail.com> writes:
Toke,
Resurrecting an old thread, did you ever write this one up?
Hi Tim
Thank you for the reminder! No, I never did get around to writing anything at the time. However, now that you reminded me, I collected my old notes and posted this:
https://blog.tohojo.dk/2025/03/ecn-ecmp-and-anycast-a-cocktail-of-broken-con...
I believe I have a customer reporting a similar problem with IPv6 TCP ECN probably ECMP resulting in RST coming back from anycast services (Cloudflare).
Tricky one to debug, looking for similar reports...
Hoping the above is helpful :)
-Toke
-- Tim:>
-- Tim:>
-- Tim:> _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/KSVJBYJY...
Even if it is unicasted, performance is destroyed due to reordering. All modern TCP stacks use cubic for congestion control, which considers reorder a packet loss. However technically speaking QoS is of least concern, because if you change QoS you expect reordering. Of course this excludes ECN bits, only 6 most significant can be considered part of hash legitimately. On Tue, 2 Dec 2025 at 21:44, Chris Woodfield via NANOG <nanog@lists.nanog.org> wrote:
These types of issues are endemic to TCP anycast, unfortunately. If your services depend on anycast, you really want to make sure that your hashing algo network-wide uses the TCP/IP five-tuple, and the five-tuple *only*; I’m assuming that’s what changing to the the “original” load-sharing CEF algorithm accomplished in your case.
My personal war story involved connections to a database server that would change DSCP values between the TCP handshake and subsequent packets, resulting in RSVP-TE putting the packets on different paths... to a PE router that incorporated the inbound destination MAC (read: inbound interface) into the hash algo, with no option to disable. Different inbound interfaces = different dest MAC = different anycast destinations = immediate SEV1 impact as all our database connections broke at once.
The solution*? Reconfigure each L3 interface on the router to the same MAC address. :P
-Chris
* An alternate solution involved screaming into the vendor support void, which eventually did get results... a year or so later. ** As I’m typing this, I’ve realized that an iptables rule on the server to undo the DSCP change would have fixed; putting that thought onto the shelf if needed later.
On Dec 2, 2025, at 09:12, Tim Durack via NANOG <nanog@lists.nanog.org> wrote:
In case it is useful for anyone else, underlying issue looks to be this:
Cisco CSCws27022: ECN bits being included as part of ECMP hash on IPv6 TCP flows (Workaround: Do not use ECMP)
Appears to be platform specific, affecting Cisco Catalyst C9K UADP ASIC (C9500-32C)
Another work-around might be to configure "ip cef load-sharing algorithm original"
Tim:>
On Tue, Mar 25, 2025 at 4:33 PM Tim Durack <tdurack@gmail.com> wrote:
Very helpful, thanks! Will post my own short story once complete...
On Tue, Mar 25, 2025 at 4:24 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
Tim Durack <tdurack@gmail.com> writes:
Toke,
Resurrecting an old thread, did you ever write this one up?
Hi Tim
Thank you for the reminder! No, I never did get around to writing anything at the time. However, now that you reminded me, I collected my old notes and posted this:
https://blog.tohojo.dk/2025/03/ecn-ecmp-and-anycast-a-cocktail-of-broken-con...
I believe I have a customer reporting a similar problem with IPv6 TCP ECN probably ECMP resulting in RST coming back from anycast services (Cloudflare).
Tricky one to debug, looking for similar reports...
Hoping the above is helpful :)
-Toke
-- Tim:>
-- Tim:>
-- Tim:> _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/KSVJBYJY...
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/UW4PU2VV...
-- ++ytti
Saku Ytti via NANOG <nanog@lists.nanog.org> writes:
Even if it is unicasted, performance is destroyed due to reordering. All modern TCP stacks use cubic for congestion control, which considers reorder a packet loss.
This is not quite true any longer. Linux implements RACK-TLP (RFC8985) which prevents short-term reordering (such as that caused by ECMP) from being interpreted as a congestion event. It seems Windows does too, these days: https://techcommunity.microsoft.com/blog/networkingblog/algorithmic-improvem... -Toke
Thank you. I wasn't aware of that improvement. Absolutely massive improvement in observed behaviour, penalties appear to be no more than 50% which is pretty tough ask, if you still want to keep good performance on the common case, which is legitimate packet loss. Still bit hilarious justification behind the fix, as it appears as if dropbox and/or samsung strategically reorder under some design they have. Despite it being 'fixed', you would still not want to give away 50% of your investment to congestion control technicalities and we can still fairly argue that if your design strategically reorders, it is broken design. On Wed, 3 Dec 2025 at 14:26, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
Saku Ytti via NANOG <nanog@lists.nanog.org> writes:
Even if it is unicasted, performance is destroyed due to reordering. All modern TCP stacks use cubic for congestion control, which considers reorder a packet loss.
This is not quite true any longer. Linux implements RACK-TLP (RFC8985) which prevents short-term reordering (such as that caused by ECMP) from being interpreted as a congestion event.
It seems Windows does too, these days: https://techcommunity.microsoft.com/blog/networkingblog/algorithmic-improvem...
-Toke
-- ++ytti
participants (4)
-
Chris Woodfield -
Saku Ytti -
Tim Durack -
Toke Høiland-Jørgensen