On Sun, 22 Mar 2020 at 09:41, Mark Tinka <mark.tinka@seacom.mu> wrote:
We weren't as successful (MX480 ingress/egress devices transiting a CRS core).
So you're not even talking about multivendor, as both ends are JNPR? Or are you confusing entropy label with FAT? Transit doesn't know anything about FAT, FAT is PW specific and is only signalled between end-points. Entropy label applies to all services and is signalled to adjacent device. Transit just sees 1 label longer label stack, with hope (not promise) that transit uses the additional label for hashing.
In the end, we updated our policy to avoid running LAG's in the backbone, and going ECMP instead. Even with l2vpn payloads, that spreads a lot more evenly.
You really should be doing CW+FAT. And looking your other email, dear god, don't do per-packet outside some unique application where you control the TCP stack :). Modern Windows, Linux, MacOS TCP stack considers out-of-order as packet loss, this is not inherent to TCP, if you can change TCP congestion control, you can make reordering entirely irrelevant to TCP. But in most cases of course we do not control TCP algo, so per-packet will not work one bit. Like OP, you should enable adaptive. This thread is conflating few different balancing issues, so I'll take the opportunity to classify them. 1. Bad hashing implementation 1.1 Insufficient amount of hash-results Think say 6500/7600, what if you only have 8 hash-results and 7 interfaces? You will inherently have 2x more traffic on one interface 1.2 Bad algorithm Different hashes have different use-cases, and we often try to think golden-hammer for hashes (like we tend to use bad hashes for password hashing, like SHA etc, when goal of SHA is to be fast in HW, which is opposite to the goal of PW hash, as you want it to be slow). Equally since the day1 of ethernet silicon, we've had CRC in the slicion, and it has since then been grandfathered hash load-balancing hash. But CRC goals are completely different to hash-algo goals, CRC does not try, and does not need good diffusion quality, hash-algo only needs perfect diffusion, nothing else matters. CRC has terrible diffusion quality, instead of implementing specific good-diffusion hash in silicon vendors do stuff like rot(crcN(x), crcM(x)) which greatly improves diffusion, but is still very bad diffusion compared to hash algos which are designed for perfect diffusion. Poor diffusion means you have different flow count in egressInts. As I can't do math, I did monte-carlo simulation to see what type of bias should we expect even with _perfect_ diffusion: - Here we have 3 egressInt and we run monte carlo until we stop getting worse Bias (of course if we wait for heath death of universe, we will see eventually see every flow in singleInt, even with perfect diffusion). But in normal situation if you see worse bias, you should blame poor diffusion quality of vendor algo, if you see bias of this or lower, it's probably not diffusion you should blame Flows | MaxBias | Example Flow Count per Int 1k | 6.9% | 395, 341, 264 10k | 2.2% | 3490, 3396, 3114 100k |0.6% | 33655, 32702, 33643 1M | 0.2% | 334969, 332424, 332607 2. Elephant flows Even if we assume perfect diffusion, so each egressInt gets exactly same amount of flows, the flows may still be wildly different bps, and there is nothing we do by tuning the hash algo to fix this. The prudent fix here is to have mapping-table between hash-result and egressInt, so that we can inject bias, not to have fair distribution between hash-result and egressInt, but to have fewer hash-results point to the congested egressInt. This is easy, ~free to implement in HW. JNPR does it, NOK is happy to implement should customer want it. This of course also fixes bad algorithmic diffusion, so it's really really great tool to have in your toolbox and I think everyone should be running this feature. 3. Incorrect key recovery Balancing is promise that we know which keys identify a flow. In common case this is simple problem, but there are lot of complexity particularly in MPLS transit. The naive/simple problem everyone knows about is pseudowire flow in-transit parsed as IPv4/IPv6 flow, when DMAC starts with 4 or 6. Some vendors (JNPR, Huawei) do additional checks, like perhaps IP checksum or IP packet length, but this is actually making the situation worse, the problem triggers far less often, but when it triggers, it will be so much more exotic, as now you have underlaying frame where by luck you also have your IP packet length supposedly correct. So you can end up in weird situations where end-customers network works perfectly, then they implement IPSEC from all hosts to concentrator, still riding over your backbone, and now suddently one customer host stops working, after enabling IPSEC, everything else works. The chances that this trouble-ticket even ever ends on your table is low and the possibility that based on the problem description you'd blame the backbone is negligible. Customer will just end up renumberign the host or replacing it's DMAC or something and no one will ever know why it was broken. So it's crucial not to do payload heuristics in MPLS transit, as it cannot be done correctly by-design. FAT and Entropy labels solve this problem correctly, moving the hash-result generation to the edge, where you still can do it correctly. -- ++ytti