On Fri, Jun 5, 2020 at 9:50 PM William Herrin <bill@herrin.us> wrote:
On Fri, Jun 5, 2020 at 6:08 PM Yang Yu <yang.yu.list@gmail.com> wrote:
On Fri, Jun 5, 2020 at 10:39 AM William Herrin <bill@herrin.us> wrote:
Speak of which, did anyone ever implement FIB compression? I seem to remember the calculations looked really favorable for the leaf node use case (like James') where the router sits at the edge with a small number of more or less equivalent upstream transits. The FIB is the expensive memory. The RIB sits in the cheap part of the hardware.
fib optimize => using LPM table for LEM https://www.arista.com/en/um-eos/eos-section-28-11-ipv4-commands#ww1173031
Cool. So for folks who want a nutshell version about FIB compression, here it is:
[...]
the same. FIB compression eliminates the implicit reject and instead routes the unroutable packets to a more or less random next hop. If that next hop is also using FIB compression, it may route them right back to you, creating a routing loop until the packet's TTL expires.
The commercially available implementations do not work as you described and fortunately do not carry that (or really, any) risk. On platforms where the number of FIB entries is limited, but the prefix length doesn't affect that limit (classic TCAM), it is possible to combine adjacent entries (e.g. /24s) with the same FEC (next-hop) into fewer entries. This is probably what most people think of as "FIB compression". Maybe it's used somewhere, maybe it's not. It's also possible to suppress the installation into the FIB of routes when the prefix in question falls completely within a covering prefix with the same FEC. Doing so is computationally inexpensive, useful on almost any FIB lookup structure, and significantly helpful (on the order of 2x) on even very-well-connected routers. This is implemented by Arista in the feature that Yang linked to with the URL containing "fib-compression", but the actual command is better named: "ip fib compression redundant-specifics filter" Also, on the B'com Jericho chip (used by the Arista 7500R/7280R and Cisco NCS 5502/5508), there is a longest-prefix match (LPM) table and a seperate, much larger, exact-match (LEM) table, both of which can be used for IP forwarding. (The LPM is sort of like TCAM but not exactly -- for now, just consider it a limited resource in the same way as TCAM has been historically.) Neither of these can independently hold the global table. It is possible to optimize the use of these resources by installing certain prefix lengths into LEM to preserve LPM space. It is also possible to do the reverse, expanding mid-sized prefixes that would otherwise end up in LPM into multiple LEM entries, to reduce the number of LPM entries needed -- basically creating an optimum balance. That is essentially the other Yang feature that was linked. As also mentioned, all of this works as advertised with basically no limitations. It's been running at Netflix (my employer) for years. Current production "switch" chips, e.g. Jericho2, contain significantly more LPM than is needed to hold the global table, and can be paired with additional off-board memory (B'com calls this KBP) for futureproofing or VRF scale needs. You can buy either option depending on your needs (e.g. the Arista 7280R3 is available in a "K" and non-"K" model) . The aforementioned LEM/LPM feature was a useful bridge into this world of bigger tables in cheaper chips, but it's not needed in new hardware. James's original question was about using cheaper L3 devices. At this point, for new installs, even if you're limited to buying used gear, you have options that don't involve any config gymnastics. Regards, Ryan Woolley