Re: maximum ipv4 bgp prefix length of /24 ?

2 Oct 2023

      On Mon, Oct 2, 2023 at 6:21 AM tim@pelican.org <tim@pelican.org> wrote:
...
On Monday, 2 October, 2023 09:39, "William Herrin" <bill@herrin.us> said:
...
That depends. When the FIB gets too big, routers don't immediately
die. Instead, their performance degrades. Just like what happens with
oversubscription elsewhere in the system.
With a TCAM-based router, the least specific routes get pushed off the
TCAM (out of the fast path) up to the main CPU. As a result, the PPS
(packets per second) degrades really fast.
With a DRAM+SRAM cache system, the least used routes fall out of the
cache. They haven't actually been pushed out of the fast path, but the
fast path gets a little bit slower. The PPS degrades, but not as
sharply as with a TCAM-based router.
Spit-balling here, is there a possible design for not-Tier-1 providers
where routing optimality (which is probably not a word) degrades rather
than packet-shifting performance?
If the FIB is full, can we start making controlled and/or smart decisions
about what to install, rather than either of the simple overflow conditions?
For starters, as long as you have *somewhere* you can point a default at
in the worst case, even if it's far from the *best* route, you make damn
sure you always install a default.
Then you could have knobs for what other routes you discard when you run
out of space.  Receiving a covering /16?  Maybe you can drop the /24s, even
if they have a different next hop - routing will be sub-optimal, but it
will work.   (I know, previous discussions around traffic engineering and
whether the originating network must / does do that in practice...)
The problem with this approach is you now have non-deterministic routing.

Depending on the state of FIB compression, packets *may* flow out
interfaces that are not what the RIB thinks they will be.
This can be a good recipe for routing micro-loops that come and go as your
FIB compression size ebbs and flows.

Taking your example:    RTR-A----------RTR-B---------RTR-C
RTR-A is announcing a /16 to RTR-B
RTR-C is announcing a /24 from within the /16 to RTR-B, which is passing it
along to RTR-A

If RTR-B's FIB compression fills up, and falls back to "drop the /24, since
I see a /16", packets destined to the /24 arriving from RTR-A will reach
RTR-B,
which will check its FIB, and send them back towards RTR-A....which will
send them back to RTR-B, until TTL is exceeded.

BTW, this scenario holds true even when it's a default route coming from
RTR-A, so saying "well, OK, but we can do FIB compression easily as long as
we have a default route to fall back on" still leads to packet-ping-ponging
on your upstream interface towards your default if you ever drop a more
specific from your FIB that is destined downstream of you.

You're better off doing the filtering at the RIB end of things, so that
RTR-B no longer passes the /24 to RTR-A; sure, routing breaks at that
point, but at least you haven't filled up the RTR-A to RTR-B link with
packets ping-ponging back and forth.

Your routing protocols *depend* on packets being forwarded along the
interfaces the RIB thinks they'll be going out in order for loop-free
routing to occur.
If the FIB decisions are made independent of the RIB state, your routing
protocols might as well just give up and go home, because no matter how
many times they run Dijkstra, the path to the destination isn't going to
match where the packets ultimately end up going.

You could of course fix this issue by propagating the decisions made by the
FIB compression algorithm back up into the RIB; at least then, the network
engineer being paged at 3am to figure out why a link is full will instead
be paged to figure out why routes aren't showing up in the routing table
that policy *says* should be showing up.

Understand which routes your customers care about / where most of your
...
traffic goes?  Set the "FIB-preference" on those routes as you receive
them, to give them the greatest chance of getting installed.
Not a hardware designer, I have little idea as to how feasible this is - I
suspect it depends on the rate of churn, complexity of FIB updates, etc.
But it feels like there could be a way to build something other than
"shortest -> punt to CPU" or "LRU -> punt to CPU".
Or is everyone who could make use of this already doing the same filtering
at the RIB level, and not trying to fit a quart RIB into a pint FIB in the
first place?
The sane ones who care about the sanity of their network engineers
certainly do.   ^_^;
...
Thanks,
Tim.
Thanks!

Matt

Re: maximum ipv4 bgp prefix length of /24 ?

Matthew Petach