Re: Partial vs Full tables

7 Jun 2020

      On Fri, Jun 5, 2020 at 9:50 PM William Herrin <bill@herrin.us> wrote:
...
On Fri, Jun 5, 2020 at 6:08 PM Yang Yu <yang.yu.list@gmail.com> wrote:
...
On Fri, Jun 5, 2020 at 10:39 AM William Herrin <bill@herrin.us> wrote:
...
Speak of which, did anyone ever implement FIB compression? I seem to
remember the calculations looked really favorable for the leaf node
use case (like James') where the router sits at the edge with a small
number of more or less equivalent upstream transits. The FIB is the
expensive memory. The RIB sits in the cheap part of the hardware.
fib optimize => using LPM table for LEM
https://www.arista.com/en/um-eos/eos-section-28-11-ipv4-commands#ww1173031
Cool. So for folks who want a nutshell version about FIB compression,
here it is:
[...]
...
the same. FIB compression eliminates the implicit reject and instead
routes the unroutable packets to a more or less random next hop. If
that next hop is also using FIB compression, it may route them right
back to you, creating a routing loop until the packet's TTL expires.
The commercially available implementations do not work as you
described and fortunately do not carry that (or really, any) risk.

On platforms where the number of FIB entries is limited, but the
prefix length doesn't affect that limit (classic TCAM), it is possible
to combine adjacent entries (e.g. /24s) with the same FEC (next-hop)
into fewer entries.  This is probably what most people think of as
"FIB compression".  Maybe it's used somewhere, maybe it's not.

It's also possible to suppress the installation into the FIB of routes
when the prefix in question falls completely within a covering prefix
with the same FEC.  Doing so is computationally inexpensive, useful on
almost any FIB lookup structure, and significantly helpful (on the
order of 2x) on even very-well-connected routers.  This is implemented
by Arista in the feature that Yang linked to with the URL containing
"fib-compression", but the actual command is better named: "ip fib
compression redundant-specifics filter"

Also, on the B'com Jericho chip (used by the Arista 7500R/7280R and
Cisco NCS 5502/5508), there is a longest-prefix match (LPM) table and
a seperate, much larger, exact-match (LEM) table, both of which can be
used for IP forwarding.  (The LPM is sort of like TCAM but not exactly
-- for now, just consider it a limited resource in the same way as
TCAM has been historically.)  Neither of these can independently hold
the global table.  It is possible to optimize the use of these
resources by installing certain prefix lengths into LEM to preserve
LPM space.  It is also possible to do the reverse, expanding mid-sized
prefixes that would otherwise end up in LPM into multiple LEM entries,
to reduce the number of LPM entries needed -- basically creating an
optimum balance.  That is essentially the other Yang feature that was
linked.

As also mentioned, all of this works as advertised with basically no
limitations.  It's been running at Netflix (my employer) for years.

Current production "switch" chips, e.g. Jericho2, contain
significantly more LPM than is needed to hold the global table, and
can be paired with additional off-board memory (B'com calls this KBP)
for futureproofing or VRF scale needs.  You can buy either option
depending on your needs (e.g. the Arista 7280R3 is available in a "K"
and non-"K" model) .  The aforementioned LEM/LPM feature was a useful
bridge into this world of bigger tables in cheaper chips, but it's not
needed in new hardware.

James's original question was about using cheaper L3 devices.  At this
point, for new installs, even if you're limited to buying used gear,
you have options that don't involve any config gymnastics.

Regards,
Ryan Woolley