RE: LAG/ECMP and 'exact-route'

14 Aug 2025

      This is something that's been asked for as a feature more times than I can count, on more platforms than I can count either.  First, I'll assert that this is completely possible. Second, I'll assert that it's not even really "hard" -- but it is a lot of work.

The way almost all of these mechanisms work is by identifying a set of key fields from the packet and the associated metadata, then applying some hash function to those fields, and then using that result to index into a table of possible next hops.

This is relatively simple if (for example) you’re ONLY doing “pick a member link from this etherchannel / LAG bundle”.  If you have 5 members in the bundle, you hash the key fields, then modulo that by 5, and then you use the modulo-remainder as the index into the table from [0:4] and the answer is which port you go out.

HOWEVER, this starts to get really complex as soon as you start dealing with multiple levels of recursion.  As a much more complex example you might have  network where you’re doing Inter-AS VPN, and you might be using something like MPLS-TE within it.  Now your resolution “tree” could include at least the following decisions that need to be made:

  1.  Packet arrives, I need to do a FIB lookup to identify which ASBR exit point to use.  So that’s one hash calculation and an index into the group for the ASBR.
  2.  Once I have that, I have multiple tunnels/paths to get to that ASBR.  So I need to do another hash (no guarantee that it’s the same number of parallel paths) and do another index into another table.  This tells me which tunnel I’m going to use (and the encap/label for that tunnel...).  Once I have THAT...
  3.  Now I might find out that the output interface for my tunnel is actually a bundle/LAG, which means I need to do (yet) another hash calculation, and index into (yet) another table to pick which actual ethernet interface I send the packet on.  (I also have to get the encap for this, to figure out source/destination MAC addresses for this particular link...)

SO.... tired yet?  Because we ain’t done...

If it’s possible for the packet to arrive on any kind of virtual interface (like an MPLS/GRE tunnel, or a pseudowire, or whatever) then I probably have to do some extra digging.  Let’s use GRE : in this case somewhere in the forwarding code I have to make sure that I’m using the correct source and destination IP addresses for my hashing.... because hashing on the tunnel address isn’t going to have NEARLY as much entropy as hashing on the actual end-station IP source/dest.  UNLESS, of course, someone really does want to hash the whole tunnel together, which means now I have to implement both forwarding flows, AND the knob to select which to use.  See?  Fun.

All of this has to be kept in sync by the control plane across what might be over a hundred discrete forwarding chips/NPUs.  If any of those chips don’t get the memo, instead of a router you have a great big doorstop.

SO... To figure out exactly what the output interface is for any given packet, you need some combination of (at least) the following info:

  *   the packet itself, often parsed into at least:
     *   IP source / dest
     *   layer4 source / dest
     *   TOS
     *   Entropy / flow labels
  *   ALL of the metadata that might be used to make ANY of the hashing decisions...
     *   at least some systems use the input interface as an input into the hash.
     *   some use TOS, some don’t.
     *   some have different hash generator algorithms, so you have to know which one
     *   usually there’s some additional hash seed for more entropy (such as the router-ID) – you have to know if this is in play and if so what it is.
     *   you need to know exactly which NPU(s) are in the forwarding path, because there’s no guarantee that they use the same algorithms.

Once you’ve collected all of this information, and if you assume that you either maintain constantly or can query each of the NPUs for the current state of its local resolution tree(s), now you can compute which output interface will be chosen for that given packet.  But to tell the control plane how to do all this math, you’re going to have a VERY long input CLI, or you’re going to just have to feed your command the hex from the raw packet and let it do the decoding for you.  And don’t forget that you have to explicitly tell it all the things like “what input interface are we simulating” and “are we using the entropy label or not” and all that.

So.  Anyway.  In my newfound role as head apologist for people who build big systems... the main reason that these commands don’t exist on most systems is not because we don’t know how to implement them, and not because we don’t see the value in implementing them.  It’s because the cost to implement (and maintain!!!) them is actually really high, and people have decided (with their wallets) that they want other things more than they want this.

</apologies>

--lj

-----Original Message-----
From: Tom Beecher via NANOG <nanog@lists.nanog.org>
Sent: Thursday, August 14, 2025 12:31 PM
To: North American Network Operators Group <nanog@lists.nanog.org>
Cc: Steinar.Rimestad@altibox.no; Tom Beecher <beecher@beecher.cc>
Subject: Re: LAG/ECMP and 'exact-route'
...

...
Unfortunately the 'show forwarding-options load-balance' doesn't allow
...
giving MPLS label stack to it which greatly limits utility for SP
...
networks.
I *think* that you can use the packet-dump option to paste a packet in hex and it will give the proper result with the label stack considered. It was a couple years ago I tried this, and my memory is fuzzy if it did work correctly or not. Even if it did work it's obvious clunky as hell to have to slap a hex decode in there.

I did find a note that I asked for an ER to allow label IDs on the CLI, but I can't find anything further if they said yes/no. I'll ask.

On Thu, Aug 14, 2025 at 2:26 AM Saku Ytti via NANOG <nanog@lists.nanog.org<mailto:nanog@lists.nanog.org>>

wrote:
...
Thanks Nitzan, that was what I was thinking, that is quite recent (to
...
me) and I suspect it is syntactical sugar for 'jsim'?
...

...
Unfortunately the 'show forwarding-options load-balance' doesn't allow
...
giving MPLS label stack to it which greatly limits utility for SP
...
networks.
...

...
Steinar, in your experience does the bundle-hash give correct results?
...
Is it actually injecting packets to ezchip/lightspeed and getting
...
results from the HW (cef exact-route is not doing this at least).
...

...
Thanks to Pedro Prado for sharing that Arista has a command for this,
...
and indeed in Arista like in Juniper packet is actually injected to
...
the hardware to get the result.
...

...

...
I think none of them allow giving MPLS stack though? So mostly useful
...
for cloudy people, not SP people. RFC5837 would more reliably give us
...
the correct answer.
...

...
On Thu, 14 Aug 2025 at 09:10, Nitzan Tzelniker via NANOG
...
<nanog@lists.nanog.org<mailto:nanog@lists.nanog.org>> wrote:
...
...
...
...
For JUNOS I think that you are looking for user@lab> show
...
...
forwarding-options load-balance ?
...
...
Possible completions:
...
...
destination-address  Destination IP address
...
...
destination-port     Destination port
...
...
family               Layer 3 family
...
...
ingress-interface    Ingress Logical Interface
...
...
packet-dump          Raw packet dump in hex without '0x'
...
...
source-address       Source IP address
...
...
source-port          Source port
...
...
tos                  Type of Service field
...
...
transport-protocol   Transport layer protocol
...
...
...
...
Nitzan
...
...
...
...
On Tue, Aug 12, 2025 at 5:58 PM Saku Ytti via NANOG <
...
nanog@lists.nanog.org<mailto:nanog@lists.nanog.org>>
...
...
wrote:
...
...
...
...
...
Hey-o,
...
...
...
...
...
...
...
...
...
Which platform/software has a command to show which interface will
...
...
...
be used for forwarding with given keys?
...
...
...
...
...
...
ASR9k has a cef exec-route, and I see references to this in c-nsp,
...
...
...
reddit and cisco.com forums, stressing how useful debugging tool
...
...
...
it has been. Despite it not actually working, since it's just RE
...
...
...
software, it doesn't talk to the EZchip/lightspeed, unless it has
...
...
...
been fixed in the past couple of years, certainly hasn't worked in
...
...
...
the timeline of various forums finding it useful.
...
...
...
...
...
...
MX has 'jsim'
...
...
...
...
https://www.juniper.net/documentation/en_US/day-one-books/TW_MX3D_Pack
...
etWalkthrough.pdf
...
...
...
which I think actually works, but it is quite involved. I have
...
...
...
some
...
...
...
(false?) memory that I saw in some release note this being a bit
...
...
...
more productised into CLI command, but I'm failing to find
...
...
...
anything to support this memory.
...
...
...
...
...
...
There is also RFC5837, which is actually implemented in QFX5k, but
...
...
...
not for TTL exceeded, we've opened ER to get it supported on MX
...
...
...
and PTX and for TTL exceeded. This RFC will allow programmatic
...
...
...
platform agnostic discovery of the actual interface used, without
...
...
...
relying on platform specific magic. So please do ask your vendors
...
...
...
to implement it.
...
...
...
...
...
...
--
...
...
...
++ytti
...
...
...
_______________________________________________
...
...
...
NANOG mailing list
...
...
...
...
...
...
...
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/65
...
IZUIUM3WTM56W3CLM6HOGK2T7DCEKF/
...
...
...
...
...
_______________________________________________
...
...
NANOG mailing list
...
...
...
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/HH
...
WSKHAH2RWUUZN5XMLUCOKMCCLXCK77/
...

...

...

...
--
...
++ytti
...
_______________________________________________
...
NANOG mailing list
...

...
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DM
...
C65GBTZVZXSWB3NBCCOO7YRBWAXLGS/
_______________________________________________

NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/YQQFFLNG...

RE: LAG/ECMP and 'exact-route'

LJ Wobker (lwobker)