Re: LAG/ECMP and 'exact-route'

15 Aug 2025

      Hey LJ,
...
*   the packet itself, often parsed into at least:
     *   IP source / dest
     *   layer4 source / dest
     *   TOS
     *   Entropy / flow labels
  *   ALL of the metadata that might be used to make ANY of the hashing decisions...
     *   at least some systems use the input interface as an input into the hash.
     *   some use TOS, some don’t.
     *   some have different hash generator algorithms, so you have to know which one
     *   usually there’s some additional hash seed for more entropy (such as the router-ID) – you have to know if this is in play and if so what it is.
     *   you need to know exactly which NPU(s) are in the forwarding path, because there’s no guarantee that they use the same algorithms.
Ingress interface is also a common hash key. Also for tunneling (MPLS,
GRE, IPIP, GTP) you may look at bottom headers as well. And in ICMP
packets, like PMTUD etc, you should actually hash on the embedded
packet, not the actual headers, but this is rarely if ever implemented
(despite actually being relatively simple to implement), breaking
PMTUd in ECMP cases, causing customers to implement weird workarounds
(https://blog.cloudflare.com/path-mtu-discovery-in-practice/).

Anyhow, if you have to know which NPU you are using, you misunderstood
the assignment. This implementation will work once, when it gets
written, and over time it will get wrong because different people
maintain the EZchip/LS and the RE hash-code command, it is guaranteed
to feed bad information to the user. This is basically where Cisco is
today, there is code (cef exact-route), but it doesn't talk to HW, and
it gives results people use, but which are not correct.

I know that Juniper MX (not PTX) injects the packet in the HW lookup
engine, and runs the normal ucode and yoinks the answer. So it will be
correct, no one has to maintain it.  I did understand from other
contributors that this is how Arista implementation works too, but it
also appears to have platform gaps.

Of course even if this is implemented correctly, for the points you
make there is an extremely large risk that users simply do not give
the right set of keys, they'll still get results and again end up
confidently working with bad data.

For these reasons RFC5837 is so much better, the far end system simply
tells where it received the frame, removing all guess-work and
fragility. So it might be best that the standard case would be that
users use RF5837 to glean this information and the 'exact-route'
solution on the NOS is the exception case, when you simply do not have
the ability to generate those packets right now for real.
...
So.  Anyway.  In my newfound role as head apologist for people who build big systems... the main reason that these commands don’t exist on most systems is not because we don’t know how to implement them, and not because we don’t see the value in implementing them.  It’s because the cost to implement (and maintain!!!) them is actually really high, and people have decided (with their wallets) that they want other things more than they want this.
If this was true, you would have implemented RFC5837. The real reason
why things are not implemented is that no one dangled fat RFQ gated by
the request. This is how features get implemented, even when they are
absolutely stupid features which should not be implemented and
customers should be educated about why what they ask introduces
fragility that cannot be justified due to superior options already
exists. But doing things the right way and having a good business case
may not always go hand in hand.
These absolutely stupid features increase technical debt and cause
fragility to all users, but of course they help winning that RFQ, so
they get implemented.

-- 
  ++ytti

Re: LAG/ECMP and 'exact-route'

Saku Ytti