Re: maximum ipv4 bgp prefix length of /24 ?

Jakob Heitz (jheitz)

2 Oct 2023 2 Oct '23

12:40 a.m.

Among the issues: Suppose the FIB has all the /24 components to make a /20, so it programs a /20. Then one of the /24's changes nexthop. It now has to undo all that compression by reinstalling some of the routes and figuring out the minimum set of /21, /22, /23, /24 to make it happen. Then to avoid a transient, it needs to make before break. Quite a bit of FIB programming needs to happen just to modify a single /24. Then the next /24 in the set also modifies its nexthop. and so on for 100000 routes. All because a peer link flapped. Affecting convergence. Then you need to buy a line card that can hold all the individual routes, because you can't always compress, because not all the routes in your compressed set have the same nexthop during a transient. Finally, it's all nicely compressed. Now what? You have lots of empty slots in your FIB. I'm sure lots of nerds can come up with transient reduction algorithms, but I'd rather not. Kind Regards, Jakob --------------------------------------------------------------- Date: Sat, 30 Sep 2023 20:04:29 -0700 From: Owen DeLong <owen@delong.com> Not sure why you think FIB compression is a risk or will be a mess. It?s a pretty straightforward task. Owen

Attachments:

attachment.html (text/html — 4.1 KB)

Show replies by date

William Herrin

2 Oct 2 Oct

1:32 a.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Sun, Oct 1, 2023 at 5:40 PM Jakob Heitz (jheitz) via NANOG <nanog@nanog.org> wrote:

...

Among the issues: Suppose the FIB has all the /24 components to make a /20, so it programs a /20. Then one of the /24's changes nexthop. It now has to undo all that compression

Yeah... all this stuff is on the same level of complexity as implementing a B-Tree. Standard task on the road to an undergraduate computer science degree. Compared to decoding a BGP update message, where nearly everything is variable length and you have to nibble away at the current field to find the start of the next field, this is a cakewalk. It doesn't actually get complicated until you want to do more than just joining adjacent address blocks. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Jakob Heitz (jheitz)

4:55 a.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

While I did allude to some of the complexity, my main point is that FIB compression does not allow you to install a FIB with less memory. Because you must be prepared for transients during which the FIB needs to store mostly uncompressed anyway. All it does is to increase convergence time. Kind Regards, Jakob From: William Herrin <bill@herrin.us> Date: Sunday, October 1, 2023 at 6:32 PM To: Jakob Heitz (jheitz) <jheitz@cisco.com> Cc: nanog@nanog.org <nanog@nanog.org> Subject: Re: maximum ipv4 bgp prefix length of /24 ? On Sun, Oct 1, 2023 at 5:40 PM Jakob Heitz (jheitz) via NANOG <nanog@nanog.org> wrote:

...

Among the issues: Suppose the FIB has all the /24 components to make a /20, so it programs a /20. Then one of the /24's changes nexthop. It now has to undo all that compression

William Herrin

7:56 a.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Sun, Oct 1, 2023 at 9:55 PM Jakob Heitz (jheitz) <jheitz@cisco.com> wrote:

...

my main point is that FIB compression does not allow you to install a FIB with less memory.

Hi Jakob, The math disagrees. It's called "oversubscription," and we use it all over the place in network engineering. There are only a handful of route patterns that'd result in no compression at all. They'd have to be intentionally created, and that'd be a hacking challenge in and of itself. The patterns in question don't align with the distribution of addresses on the Internet. If you're at 80% FIB after compression, a compression transient could plausibly bump you to 85%. The odds of a natural transient bumping you to 100% are infinitesimal. If you try to run at 95% after compression... well, I'm sure someone will try it, but that's PEBKAC not compression's fault. FIB compression ranges from 30% in simple core scenarios to more than 90% in edge scenarios with advanced compression. Even keeping reasonable slack for transients, you're going to get some bang for your buck. All it means is that you have to keep an eye on your FIB size as well, since it's no longer the same as your RIB size. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Nick Hilliard

8:18 a.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

William Herrin wrote on 02/10/2023 08:56:

...

All it means is that you have to keep an eye on your FIB size as well, since it's no longer the same as your RIB size.

the point Jacob is making is is that when using FIB compression, the FIB size depends on both RIB size and RIB complexity. I.e. what was previously a deterministic 1:1 ratio between RIB and FIB - which is straightforward to handle from an operational point of view - becomes non-deterministic. The difficulty with this is that if you end up with a FIB overflow, your router will no longer route. That said, there are cases where FIB compression makes a lot of sense, e.g. leaf sites, etc. Conversely, it's not a generally appropriate technology for a dense dfz core device. It's a tool in the toolbox, one of many. Nick

William Herrin

8:39 a.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, Oct 2, 2023 at 1:18 AM Nick Hilliard <nick@foobar.org> wrote:

...

The difficulty with this is that if you end up with a FIB overflow, your router will no longer route.

Hi Nick, That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system. With a TCAM-based router, the least specific routes get pushed off the TCAM (out of the fast path) up to the main CPU. As a result, the PPS (packets per second) degrades really fast. With a DRAM+SRAM cache system, the least used routes fall out of the cache. They haven't actually been pushed out of the fast path, but the fast path gets a little bit slower. The PPS degrades, but not as sharply as with a TCAM-based router.

...

That said, there are cases where FIB compression makes a lot of sense, e.g. leaf sites, etc. Conversely, it's not a generally appropriate technology for a dense dfz core device. It's a tool in the toolbox, one of many.

The case for FIB compression deep in the core is... not as obvious as the case near the edge for sure. But I wouldn't discount it on any installation that has a reasonably defined notion of "upstream," as opposed to installations where the only sorts of interfaces are either lateral or downstream. Look at it this way: here are some numbers from last Friday's BGP report: BGP routing table entries examined: 930281 Prefixes after maximum aggregation (per Origin AS): 353509 Deaggregation factor: 2.63 Unique aggregates announced (without unneeded subnets): 453312 Obviously adjacent routes to the same AS aren't always going to have the same next hop. But I'll bet you that they do more often than not, even deep in the core. Even if only half the adjacent routes from the same AS have the same next hop when found deep in the core, according to these numbers that's still a 30% compression. If you keep a 10% slack for transients, you still have a 20% net gain in your equipment's capability versus no compression. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Tom Beecher

1:05 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

...

That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

If you consider blackholing traffic because the relevant next-hops aren't present in the FIB to be looked up as "degradation".... I guess? If you keep a 10%

...

slack for transients, you still have a 20% net gain in your equipment's capability versus no compression.

This ignores whatever % of something else you have up to get effective compression. On Mon, Oct 2, 2023 at 4:41 AM William Herrin <bill@herrin.us> wrote:

...

On Mon, Oct 2, 2023 at 1:18 AM Nick Hilliard <nick@foobar.org> wrote:

...
The difficulty with this is that if you end up with a FIB overflow, your router will no longer route.

Hi Nick,

That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the TCAM (out of the fast path) up to the main CPU. As a result, the PPS (packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the cache. They haven't actually been pushed out of the fast path, but the fast path gets a little bit slower. The PPS degrades, but not as sharply as with a TCAM-based router.

...
That said, there are cases where FIB compression makes a lot of sense, e.g. leaf sites, etc. Conversely, it's not a generally appropriate technology for a dense dfz core device. It's a tool in the toolbox, one of many.

The case for FIB compression deep in the core is... not as obvious as the case near the edge for sure. But I wouldn't discount it on any installation that has a reasonably defined notion of "upstream," as opposed to installations where the only sorts of interfaces are either lateral or downstream.

Look at it this way: here are some numbers from last Friday's BGP report:

BGP routing table entries examined: 930281 Prefixes after maximum aggregation (per Origin AS): 353509 Deaggregation factor: 2.63 Unique aggregates announced (without unneeded subnets): 453312

Obviously adjacent routes to the same AS aren't always going to have the same next hop. But I'll bet you that they do more often than not, even deep in the core. Even if only half the adjacent routes from the same AS have the same next hop when found deep in the core, according to these numbers that's still a 30% compression. If you keep a 10% slack for transients, you still have a 20% net gain in your equipment's capability versus no compression.

Regards, Bill Herrin

-- William Herrin bill@herrin.us https://bill.herrin.us/

William Herrin

6:02 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, Oct 2, 2023 at 6:05 AM Tom Beecher <beecher@beecher.cc> wrote:

...

...
That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

If you consider blackholing traffic because the relevant next-hops aren't present in the FIB to be looked up as "degradation".... I guess?

Come on man, go re-read the post. The two paragraphs you cut literally explained what happens -instead of- routes dropping out of the FIB or being black holed. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Tom Beecher

6:31 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

...

Come on man, go re-read the post. The two paragraphs you cut literally explained what happens -instead of- routes dropping out of the FIB or being black holed.

Ok On Mon, Oct 2, 2023 at 2:03 PM William Herrin <bill@herrin.us> wrote:

...

On Mon, Oct 2, 2023 at 6:05 AM Tom Beecher <beecher@beecher.cc> wrote:

...
...
That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

If you consider blackholing traffic because the relevant next-hops aren't present in the FIB to be looked up as "degradation".... I guess?

Come on man, go re-read the post. The two paragraphs you cut literally explained what happens -instead of- routes dropping out of the FIB or being black holed.

Regards, Bill Herrin

-- William Herrin bill@herrin.us https://bill.herrin.us/

tim＠pelican.org

1:19 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Monday, 2 October, 2023 09:39, "William Herrin" <bill@herrin.us> said:

...

That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the TCAM (out of the fast path) up to the main CPU. As a result, the PPS (packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the cache. They haven't actually been pushed out of the fast path, but the fast path gets a little bit slower. The PPS degrades, but not as sharply as with a TCAM-based router.

Spit-balling here, is there a possible design for not-Tier-1 providers where routing optimality (which is probably not a word) degrades rather than packet-shifting performance? If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions? For starters, as long as you have *somewhere* you can point a default at in the worst case, even if it's far from the *best* route, you make damn sure you always install a default. Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice...) Understand which routes your customers care about / where most of your traffic goes? Set the "FIB-preference" on those routes as you receive them, to give them the greatest chance of getting installed. Not a hardware designer, I have little idea as to how feasible this is - I suspect it depends on the rate of churn, complexity of FIB updates, etc. But it feels like there could be a way to build something other than "shortest -> punt to CPU" or "LRU -> punt to CPU". Or is everyone who could make use of this already doing the same filtering at the RIB level, and not trying to fit a quart RIB into a pint FIB in the first place? Thanks, Tim.

Joshua Miller

1:40 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

Seems like we've reached the limits of apriori speculation. At this point I'd like to see data demonstrating that it's at least viable from a statistical perspective. If someone is motivated to demonstrate this, a "backtest" against historical data would be the next step. Later, one could design the study to reveal "transient degradation" (loops, drops, etc.) and their probability, though the duration would be more a function of the implementation. It would be best to "backtest" the status quo as a control because it too has transient degradation, for a more apples-to-apples comparison. I'm not sufficiently motivated (nor knowledgeable in statistics) to take this on. I see this more in the domain of vendors to determine the best approach for their implementation. On Mon, Oct 2, 2023 at 9:21 AM tim@pelican.org <tim@pelican.org> wrote:

...

On Monday, 2 October, 2023 09:39, "William Herrin" <bill@herrin.us> said:

...
That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the TCAM (out of the fast path) up to the main CPU. As a result, the PPS (packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the cache. They haven't actually been pushed out of the fast path, but the fast path gets a little bit slower. The PPS degrades, but not as sharply as with a TCAM-based router.

Spit-balling here, is there a possible design for not-Tier-1 providers where routing optimality (which is probably not a word) degrades rather than packet-shifting performance?

If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions?

For starters, as long as you have *somewhere* you can point a default at in the worst case, even if it's far from the *best* route, you make damn sure you always install a default.

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice...)

Understand which routes your customers care about / where most of your traffic goes? Set the "FIB-preference" on those routes as you receive them, to give them the greatest chance of getting installed.

Not a hardware designer, I have little idea as to how feasible this is - I suspect it depends on the rate of churn, complexity of FIB updates, etc. But it feels like there could be a way to build something other than "shortest -> punt to CPU" or "LRU -> punt to CPU".

Or is everyone who could make use of this already doing the same filtering at the RIB level, and not trying to fit a quart RIB into a pint FIB in the first place?

Thanks, Tim.

Owen DeLong

4:52 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

William Herrin

5:57 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, Oct 2, 2023 at 6:40 AM Joshua Miller <contemno@gmail.com> wrote:

...

At this point I'd like to see data demonstrating that it's at least viable from a statistical perspective.

https://conferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p111.pdf https://yangtonghome.github.io/uploads/MAoFIBC.pdf More where those came from if you google "BGP FIB compression paper." Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Tom Beecher

1:51 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

...

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice...)

What you are describing is exactly what the /24 convention is doing already, just with different mask lengths. By and large, RIB/FIB size can be effectively managed today by thoughtful use of policies. If a point is reached that doesn't work anymore, it's _probably_ time to re-evaluate the hardware or the design. On Mon, Oct 2, 2023 at 9:20 AM tim@pelican.org <tim@pelican.org> wrote:

...

On Monday, 2 October, 2023 09:39, "William Herrin" <bill@herrin.us> said:

...
That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the TCAM (out of the fast path) up to the main CPU. As a result, the PPS (packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the cache. They haven't actually been pushed out of the fast path, but the fast path gets a little bit slower. The PPS degrades, but not as sharply as with a TCAM-based router.

Spit-balling here, is there a possible design for not-Tier-1 providers where routing optimality (which is probably not a word) degrades rather than packet-shifting performance?

If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions?

For starters, as long as you have *somewhere* you can point a default at in the worst case, even if it's far from the *best* route, you make damn sure you always install a default.

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice...)

Understand which routes your customers care about / where most of your traffic goes? Set the "FIB-preference" on those routes as you receive them, to give them the greatest chance of getting installed.

Not a hardware designer, I have little idea as to how feasible this is - I suspect it depends on the rate of churn, complexity of FIB updates, etc. But it feels like there could be a way to build something other than "shortest -> punt to CPU" or "LRU -> punt to CPU".

Or is everyone who could make use of this already doing the same filtering at the RIB level, and not trying to fit a quart RIB into a pint FIB in the first place?

Thanks, Tim.

Matthew Petach

6:24 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, Oct 2, 2023 at 6:21 AM tim@pelican.org <tim@pelican.org> wrote:

...

On Monday, 2 October, 2023 09:39, "William Herrin" <bill@herrin.us> said:

...
That depends. When the FIB gets too big, routers don't immediately die. Instead, their performance degrades. Just like what happens with oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the TCAM (out of the fast path) up to the main CPU. As a result, the PPS (packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the cache. They haven't actually been pushed out of the fast path, but the fast path gets a little bit slower. The PPS degrades, but not as sharply as with a TCAM-based router.

Spit-balling here, is there a possible design for not-Tier-1 providers where routing optimality (which is probably not a word) degrades rather than packet-shifting performance?

If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions?

For starters, as long as you have *somewhere* you can point a default at in the worst case, even if it's far from the *best* route, you make damn sure you always install a default.

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice...)

The problem with this approach is you now have non-deterministic routing. Depending on the state of FIB compression, packets *may* flow out interfaces that are not what the RIB thinks they will be. This can be a good recipe for routing micro-loops that come and go as your FIB compression size ebbs and flows. Taking your example: RTR-A----------RTR-B---------RTR-C RTR-A is announcing a /16 to RTR-B RTR-C is announcing a /24 from within the /16 to RTR-B, which is passing it along to RTR-A If RTR-B's FIB compression fills up, and falls back to "drop the /24, since I see a /16", packets destined to the /24 arriving from RTR-A will reach RTR-B, which will check its FIB, and send them back towards RTR-A....which will send them back to RTR-B, until TTL is exceeded. BTW, this scenario holds true even when it's a default route coming from RTR-A, so saying "well, OK, but we can do FIB compression easily as long as we have a default route to fall back on" still leads to packet-ping-ponging on your upstream interface towards your default if you ever drop a more specific from your FIB that is destined downstream of you. You're better off doing the filtering at the RIB end of things, so that RTR-B no longer passes the /24 to RTR-A; sure, routing breaks at that point, but at least you haven't filled up the RTR-A to RTR-B link with packets ping-ponging back and forth. Your routing protocols *depend* on packets being forwarded along the interfaces the RIB thinks they'll be going out in order for loop-free routing to occur. If the FIB decisions are made independent of the RIB state, your routing protocols might as well just give up and go home, because no matter how many times they run Dijkstra, the path to the destination isn't going to match where the packets ultimately end up going. You could of course fix this issue by propagating the decisions made by the FIB compression algorithm back up into the RIB; at least then, the network engineer being paged at 3am to figure out why a link is full will instead be paged to figure out why routes aren't showing up in the routing table that policy *says* should be showing up. Understand which routes your customers care about / where most of your

...

traffic goes? Set the "FIB-preference" on those routes as you receive them, to give them the greatest chance of getting installed.

Not a hardware designer, I have little idea as to how feasible this is - I suspect it depends on the rate of churn, complexity of FIB updates, etc. But it feels like there could be a way to build something other than "shortest -> punt to CPU" or "LRU -> punt to CPU".

Or is everyone who could make use of this already doing the same filtering at the RIB level, and not trying to fit a quart RIB into a pint FIB in the first place?

The sane ones who care about the sanity of their network engineers certainly do. ^_^;

...

Thanks, Tim.

Thanks! Matt

Tim Franklin

6:44 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On 02/10/2023 19:24, Matthew Petach wrote:

...

The problem with this approach is you now have non-deterministic routing.

Depending on the state of FIB compression, packets *may* flow out interfaces that are not what the RIB thinks they will be. This can be a good recipe for routing micro-loops that come and go as your FIB compression size ebbs and flows.

Had NOT considered the looping - that's what you get for writing in public without thinking it all the way through *blush*. Thanks for poking holes appropriately, Tim.

Mark Tinka

7:19 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On 10/2/23 20:44, Tim Franklin wrote:

...

Had NOT considered the looping - that's what you get for writing in public without thinking it all the way through *blush*.

Thanks for poking holes appropriately,

Like I said, it's going to be a messy experiment - for probably a decade, at least. As Saku has highlighted as well, vendors are likely to find a lot of sludge in this experiment that they will never be able to share with us... likely because it will be too complicated for us to understand in a coherent way, or likely because the experiment changes so rapidly it makes no sense to tell us about issues which will quickly become obsolete. Many lessons will be learned, but ultimately, one would be naive to think this black box will just work. All I want is a "set routing-options fib-compression disable" present for Christmas. Mark.

Delong.com

11:05 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

...

On Oct 2, 2023, at 12:19, Mark Tinka <mark@tinka.africa> wrote:

On 10/2/23 20:44, Tim Franklin wrote:

...
Had NOT considered the looping - that's what you get for writing in public without thinking it all the way through *blush*.

Thanks for poking holes appropriately,

Like I said, it's going to be a messy experiment - for probably a decade, at least.

That’s only (possibly) true _IF_ the compression being done is lossy compression. There are several FIB compression techniques available that don’t involve this kind of discarding data (think data deduplication vs. data removal).

...

Many lessons will be learned, but ultimately, one would be naive to think this black box will just work.

Agreed that if the vendors start implementing lossy compression, there should be a switch to turn it off or at least reduce its aggressiveness back to lossless compression. However, I have no reason to think that lossless compression wouldn’t “just work”.

...

All I want is a "set routing-options fib-compression disable" present for Christmas.

Best of luck with that. Probably won’t get it unless/until a rev or two after they break something with it. Owen

Matthew Petach

7:27 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, Oct 2, 2023 at 11:46 AM Tim Franklin <tim@pelican.org> wrote:

...

On 02/10/2023 19:24, Matthew Petach wrote:

The problem with this approach is you now have non-deterministic routing.

Depending on the state of FIB compression, packets *may* flow out interfaces that are not what the RIB thinks they will be. This can be a good recipe for routing micro-loops that come and go as your FIB compression size ebbs and flows.

Had NOT considered the looping - that's what you get for writing in public without thinking it all the way through *blush*.

Thanks for poking holes appropriately, Tim.

No worries--if this were easy, we would have been doing it decades ago without thinking twice. To William's point to Tom--we are perhaps using the term "compression" in incompatible ways at times during this conversation. There is a difference between what the papers Williams cited are doing, which is finding more optimal ways of storing the full structure in memory with what I think the general thread here is talking about, which is 'proxy-aggregation' of a form--reducing the actual number of entries in the forwarding table, regardless of the *method* of storage. "FIB compression" of the form talked about in the papers William cited is already being done; we don't store whole string representations of the routing table in memory, and look them up sequentially, we store them in binary tries, which are faster and take up less space (e, compressed), but they still encode and represent the *whole* set of prefixes in the forwarding table. "FIB-count-reduction" would be a more accurate term for what we're tossing about here, and that's where dragons lie, because that's where your FIB and RIB no longer represent the same set of information. And while Jon is right, it can help struggling ISPs stave off expensive upgrades, it does so at the cost of potentially increased troubleshooting nightmares when packets stop going where the RIB expects them to go, and network engineers are left scratching their heads trying to figure out why. ^_^; As Mark just said--sane ISPs push their vendor for a knob to disable it, so that they can return back the land of deterministic lookups for the sanity of their engineers. ;) Thanks! Matt

William Herrin

9:11 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, Oct 2, 2023 at 12:27 PM Matthew Petach <mpetach@netflight.com> wrote:

...

There is a difference between what the papers William cited are doing, which is finding more optimal ways of storing the full structure in memory with what I think the general thread here is talking about, which is 'proxy-aggregation' of a form--reducing the actual number of entries in the forwarding table, regardless of the *method* of storage.

"FIB compression" of the form talked about in the papers William cited is already being done; we don't store whole string representations of the routing table in memory, and look them up sequentially, we store them in binary tries, which are faster and take up less space (e, compressed), but they still encode and represent the *whole* set of prefixes in the forwarding table.

Hi Matthew, Either you're misreading the papers or I quoted the wrong ones. Apologies if it's the latter. The RIB is stored in a trie (a radix tree) in essentially all router implementations. From there it's processed into a FIB whose design depends on the hardware implementing it. For a software implementation that's another trie, sometimes with hash-based cache of current activity. For a TCAM, it's an array ordered from most specific to least specific. (TCAMs are special hardware that processes every array element simultaneously instead of linearly.) These structures don't change with FIB compression. What changes is the number of routes stored in them and the complexity of the software that translates the RIB into a FIB.

...

As Mark just said--sane ISPs push their vendor for a knob to disable it, so that they can return back the land of deterministic lookups for the sanity of their engineers. ;)

Just so it doesn't get lost - I endorse that idea as well. With a very few exceptions, I think the operator should have sufficient knobs to control how their equipment does its work. The developer doesn't (and can't) know all of the situations the operator will encounter, so can't plan his software to be smart enough to always do the right thing on its own. The exceptions deal with knobs where the operator is not just likely shoot themselves in the foot but likely shoot other people too. That doesn't apply here. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Chris Hills

3 Oct 3 Oct

2:44 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On 02/10/2023 14:19, tim@pelican.org wrote:

...

If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions?

There is a project [1] that make use of sflow to install the top n prefixes by traffic, presuming there is a default route for the remainder. Chris [1] https://github.com/sflow-rt/active-routes

Owen DeLong

2 Oct 2 Oct

4:45 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

...

On Oct 2, 2023, at 01:18, Nick Hilliard <nick@foobar.org> wrote:

William Herrin wrote on 02/10/2023 08:56:

...
All it means is that you have to keep an eye on your FIB size as well, since it's no longer the same as your RIB size.

the point Jacob is making is is that when using FIB compression, the FIB size depends on both RIB size and RIB complexity. I.e. what was previously a deterministic 1:1 ratio between RIB and FIB - which is straightforward to handle from an operational point of view - becomes non-deterministic. The difficulty with this is that if you end up with a FIB overflow, your router will no longer route.

It was never 1:1 if you have more than one path for any route. The FIB only contains the chosen path(s) to any destination even without fib compression.

...

That said, there are cases where FIB compression makes a lot of sense, e.g. leaf sites, etc. Conversely, it's not a generally appropriate technology for a dense dfz core device. It's a tool in the toolbox, one of many.

Even at a dense DFZ core device, there are a large number of single-origin consecutive /24s in the table which can be fib compressed with no loss. For a long time, someone was teaching up and coming operators in Asia that they should always announce everything as disaggregated /24s to guard against route hijacking. This unfortunate practice persists still, making FIB compression quite practical even at core nodes. Owen

...

Nick

Jakob Heitz (jheitz)

9:31 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On a related note, I'm working on a project to handle FIB overflow in such a way as to cause the least disruption in the network. I welcome suggestions either on or off list. Kind Regards, Jakob

Jon Lewis

1:03 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

On Mon, 2 Oct 2023, Jakob Heitz (jheitz) via NANOG wrote:

...

While I did allude to some of the complexity, my main point

is that FIB compression does not allow you to install a FIB with less memory.

Because you must be prepared for transients during which the FIB needs to store

mostly uncompressed anyway.

All it does is to increase convergence time.

In my experience, that's incorrect. FIB compression allows you to run with full tables on devices that can no longer accommodate full [uncompressed] tables in whatever is used to store FIB. Obviously, it does require more RAM due to the complexities of keeping track of more state. But RAM is typically far more easily upgraded (cheaply) than TCAM. Even if it does increase convergence time, that's a compromise many are willing to make if the alternative is toss your current gear and replace it with newer gear years sooner. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

Owen DeLong

4:35 p.m.

New subject: maximum ipv4 bgp prefix length of /24 ?

First, no, a transient where all route disaggregating disappears from the global table is extraordinarily unlikely. Second, as I understand it, each update cycle results in rebuilding the fib from scratch rather than figuring out how to splice and dice it, so the computation required to cope with that single /24 change isn’t exactly as described. Finally, unless that /24 change ends up changing the exit interface or next-hop on the exit interface (which decreases in probability with topological distance), it wouldn’t actually change the fib content. The process of downloading a new fib to line cards from the RE is very fast. Break before make may be tolerable risk. It won’t lose a statistically significant number of packets in a single event. If events are occurring frequently enough to be an issue, then I think the route flapping is a bigger problem than the lost packets. YMMV Owen

...

On Oct 1, 2023, at 21:55, Jakob Heitz (jheitz) via NANOG <nanog@nanog.org> wrote:

While I did allude to some of the complexity, my main point is that FIB compression does not allow you to install a FIB with less memory. Because you must be prepared for transients during which the FIB needs to store mostly uncompressed anyway. All it does is to increase convergence time.

Kind Regards, Jakob

From: William Herrin <bill@herrin.us> Date: Sunday, October 1, 2023 at 6:32 PM To: Jakob Heitz (jheitz) <jheitz@cisco.com> Cc: nanog@nanog.org <nanog@nanog.org> Subject: Re: maximum ipv4 bgp prefix length of /24 ?

On Sun, Oct 1, 2023 at 5:40 PM Jakob Heitz (jheitz) via NANOG <nanog@nanog.org> wrote:

...
Among the issues: Suppose the FIB has all the /24 components to make a /20, so it programs a /20. Then one of the /24's changes nexthop. It now has to undo all that compression

Yeah... all this stuff is on the same level of complexity as implementing a B-Tree. Standard task on the road to an undergraduate computer science degree. Compared to decoding a BGP update message, where nearly everything is variable length and you have to nibble away at the current field to find the start of the next field, this is a cakewalk.

It doesn't actually get complicated until you want to do more than just joining adjacent address blocks.

Regards, Bill Herrin

-- William Herrin bill@herrin.us https://bill.herrin.us/

746

Age (days ago)

747

Last active (days ago)

List overview

Download

24 comments

13 participants

participants (13)

Chris Hills
Delong.com
Jakob Heitz (jheitz)
Jon Lewis
Joshua Miller
Mark Tinka
Matthew Petach
Nick Hilliard
Owen DeLong
Tim Franklin
tim＠pelican.org
Tom Beecher
William Herrin

Re: maximum ipv4 bgp prefix length of /24 ?

tags

participants (13)