Partial vs Full tables
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering. I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well.... And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill. James W. Breeden Managing Partner [logo_transparent_background] Arenal Group: Arenal Consulting Group | Acilis Telecom | Pines Media PO Box 1063 | Smithville, TX 78957 Email: james@arenalgroup.co<mailto:james@arenalgroup.co> | office 512.360.0000 | www.arenalgroup.co<http://www.arenalgroup.co/>
On Thu, Jun 4, 2020 at 8:04 PM James Breeden <James@arenalgroup.co> wrote:
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering.
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers.
Why draw a line? Just take their directly connected routes + default. If you don’t like traffic mix, filter or play with local pref until you are happy. I've thought of straight preferencing one over another. I've thought of
using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
It is smart approach and used by many. I would just be sure your ACL / policing needs are met too.
*James W. Breeden*
*Managing Partner*
*[image: logo_transparent_background]*
*Arenal Group:* Arenal Consulting Group | Acilis Telecom | Pines Media
PO Box 1063 | Smithville, TX 78957
Email: james@arenalgroup.co | office 512.360.0000 | www.arenalgroup.co
* James Breeden
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
We started taking defaults from our transits and filtering most of the DFZ over three years ago. No regrets, it's one of the best decisions we ever made. Vastly reduced both convergence time and CapEx. Transit providers worth their salt typically include BGP communities you can use to selectively accept more-specific routes that you are interested in. You could, for example, accept routes learned by your transits from IX-es in in your geographic vicinity. Here's a PoC where we used communities to filter out all routes except for any routes learned by our primary transit provider anywhere in Scandinavia, while using defaults for everything else: https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h... (Note that we went away from the RIB->FIB filtering approach described in the post, what we have in production is traditional filtering on the BGP sessions.) Tore
On Fri, 5 Jun 2020 at 10:48, Tore Anderson <tore@fud.no> wrote:
We started taking defaults from our transits and filtering most of the DFZ over three years ago. No regrets, it's one of the best decisions we ever made. Vastly reduced both convergence time and CapEx.
Is this verbatim? I don't think there is a use case to ever carry default route in dynamic routing. In eBGP it should be some reliable indicator of operator network being up, like their own aggregate route, they have incentive to originate this correctly, as it affects their own services and products. So recurse static default to this route. Otherwise you cannot know how the operator originates default, they may just blindly generate it in the edge, and if edge becomes disconnected from core, you'll blackhole, compared to static route solution where the aggregate would not be generated by edge routers by any sane operator due to self-preservation instinct, you'd be able to converge instead of blackhole. In internal network, instead of having a default route in iBGP or IGP, you should have the same loopback address in every full DFZ router and advertise that loopback in IGP. Then non fullDFZ routers should static route default to that loopback, always reaching IGP closest full DFZ router. -- ++ytti
* Saku Ytti
On Fri, 5 Jun 2020 at 10:48, Tore Anderson <tore@fud.no> wrote:
We started taking defaults from our transits and filtering most of the DFZ over three years ago. No regrets, it's one of the best decisions we ever made. Vastly reduced both convergence time and CapEx.
Is this verbatim?
I do not understand this question, sorry.
you cannot know how the operator originates default
Sure you can, you just ask them. (We did.) Tore
On Fri, 5 Jun 2020 at 11:23, Tore Anderson <tore@fud.no> wrote:
Sure you can, you just ask them. (We did.)
And is it the same now? Some Ytti didn't 'fix' the config last night? Or NOS change which doesn't do conditional routes? Or they misunderstood their implementation and it doesn't actually work like they think it does. I personally always design my reliance to other people's clue to be as little as operationally feasible. -- ++ytti
* Saku Ytti
On Fri, 5 Jun 2020 at 11:23, Tore Anderson <tore@fud.no> wrote:
Sure you can, you just ask them. (We did.)
And is it the same now? Some Ytti didn't 'fix' the config last night? Or NOS change which doesn't do conditional routes? Or they misunderstood their implementation and it doesn't actually work like they think it does. I personally always design my reliance to other people's clue to be as little as operationally feasible.
The way they answered the question showed that they had already considered this particular failure case and engineered their implementation accordingly. That is good enough for us. Incorrect origination of a default route is, after all, just one of the essentially infinite ways our transit providers can screw up our services. Therefore it would make no sense to me to entrust the delivery of our business critical packets to a transit provider, yet at the same time not trust them to originate a default route reliably. If we did not feel I could trust my transit provider, we would simply find another one. There are plenty to choose from. Tore
Saku-
In internal network, instead of having a default route in iBGP or IGP, you should have the same loopback address in every full DFZ router and advertise that loopback in IGP. Then non fullDFZ routers should static route default to that loopback, always reaching IGP closest full DFZ router.
Just because DFZ role device can advertise loopback unconditionally in IGP doesn't mean the DFZ actually has a valid eBGP or iBGP session to another DFZ. It may be contrived but could this not be a possible way to blackhole nearby PEs..? We currently take a full RIB and I am currently doing full FIB. I'm currently choosing to create a default aggregate for downstream default-only connectors based on something like from { protocol bgp; as-path-group transit-providers; route-filter 0.0.0.0/0 prefix-length-range /8-/10; route-type external; } Of course there is something functionally equivalent for v6. I have time series data on the count of routes contributing to the aggregate which helps a bit with ease of mind of default being pulled when it shouldn't be. Like all tricks of this type I recognize this is susceptible to default being synthesized when it shouldn't be. I'm considering an approach similar to Tore's blog where at some point I keep the full RIB but selectively populate the FIB. Tore, care to comment on why you decided to filter the RIB as well? -Michael
Hey Michael, On Fri, 5 Jun 2020 at 19:37, Michael Hare <michael.hare@wisc.edu> wrote:
Just because DFZ role device can advertise loopback unconditionally in IGP doesn't mean the DFZ actually has a valid eBGP or iBGP session to another DFZ. It may be contrived but could this not be a possible way to blackhole nearby PEs..?
We currently take a full RIB and I am currently doing full FIB. I'm currently choosing to create a default aggregate for downstream default-only connectors based on something like
The comparison isn't between full or default, the comparison is between static default or dynamic default. Of course with any default scenario there are more failure modes you cannot route around. But if you need default, you should not want to use dynamic default. -- ++ytti
On Fri, Jun 5, 2020 at 9:49 AM Saku Ytti <saku@ytti.fi> wrote:
The comparison isn't between full or default, the comparison is between static default or dynamic default. Of course with any default scenario there are more failure modes you cannot route around. But if you need default, you should not want to use dynamic default.
It's a little more nuanced than that. You probably don't want to accept a default from your transit but you may want to pin defaults (or a set of broad routes as I did) to "representative" routes you do accept from your transit. By "pin" I mean tell BGP that 0.0.0.0/0 is reachable by some address inside a representative route you've picked that is NOT the next hop. That way the default goes away if your transit loses the representative route and the default pinned to one of your other transits takes over. You can craft and tune an effective solution here, but there has to be an awful lot of money at stake before the manpower is cheaper than just buying a better router. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Fri, 5 Jun 2020 at 20:20, William Herrin <bill@herrin.us> wrote:
It's a little more nuanced than that. You probably don't want to accept a default from your transit but you may want to pin defaults (or a set of broad routes as I did) to "representative" routes you do accept from your transit. By "pin" I mean tell BGP that 0.0.0.0/0 is reachable by some address inside a representative route you've picked that is NOT the next hop. That way the default goes away if your transit loses the representative route and the default pinned to one of your other transits takes over.
That is a great idea. Get all the utility of default with fewer risks. -- ++ytti
On Fri, Jun 05, 2020 at 10:20:00AM -0700, William Herrin wrote:
On Fri, Jun 5, 2020 at 9:49 AM Saku Ytti <saku@ytti.fi> wrote:
The comparison isn't between full or default, the comparison is between static default or dynamic default. Of course with any default scenario there are more failure modes you cannot route around. But if you need default, you should not want to use dynamic default.
It's a little more nuanced than that. You probably don't want to accept a default from your transit but you may want to pin defaults (or a set of broad routes as I did) to "representative" routes you do accept from your transit. By "pin" I mean tell BGP that 0.0.0.0/0 is reachable by some address inside a representative route you've picked that is NOT the next hop. That way the default goes away if your transit loses the representative route and the default pinned to one of your other transits takes over.
I do the above using routes to *.root-servers.net to contribute to the aggregate 0/0.
On 5/Jun/20 18:49, Saku Ytti wrote:
The comparison isn't between full or default, the comparison is between static default or dynamic default. Of course with any default scenario there are more failure modes you cannot route around. But if you need default, you should not want to use dynamic default.
I've found this to be easier to do if your network is reasonably "centralized", i.e., there is one or two (or small handful) of "entry and exit" points. With a stretchy, relatively flat network that neither has a definite entry nor exit point, it's a bit difficult to decide which failure mode should take the default route away. Mark.
On Tue, Jun 9, 2020, at 08:04, Mark Tinka wrote:
On 5/Jun/20 18:49, Saku Ytti wrote:
The comparison isn't between full or default, the comparison is between static default or dynamic default. Of course with any default scenario there are more failure modes you cannot route around. But if you need default, you should not want to use dynamic default.
I've found this to be easier to do if your network is reasonably "centralized", i.e., there is one or two (or small handful) of "entry and exit" points.
With a stretchy, relatively flat network that neither has a definite entry nor exit point, it's a bit difficult to decide which failure mode should take the default route away.
A strong case to take the default away is when the PE the customer is connected to has become entirely isolated from the rest of the network. This can happen as a result of multiple fiber-cuts, or the classic "oops, all the diverse fibers went through that one duct". One trick is to have each PE originating a default which depends on a route that comes from another PE (any other PE). This way a PE that for whatever reason has become entirely disconnected from the Autonomous System will cease advertising default. Make PEs with an odd-numbered loopback address depend on "ROUTE A" and PEs with an even-numbered loopback depend on "ROUTE B" - where A is originated only by even-numbered PEs and B is only originated by odd-numbered PEs. More advanced sharding strategies can be imagined, many additional failure cases too. Back to basics: as Ytti suggested earlier in the thread, it might be more sensible to generate your own default route based on a 'stable anchor prefix' coming from the ISP rather than accepting the default your ISP originates towards you. As an example: any NTT customers requesting to receive a default-route from AS 2914, will - in addition to 0.0.0.0/0 - also receive a route announcement for 129.250.0.0/16 (2001:418::/32 in IPv6), and if any customer loses visibility on 129.250.0.0/16 via the direct Customer<>NTT sessions, one probably doesn't want to point default in that direction. - If you originate defaults to your customers: try to make it so that the default is withdrawn if the node has become isolated. - If you want to point default at a service provider: anchor it to a stable prefix rather than their 0.0.0.0/0 route. The above two suggestions may seem at odds with each other :-) Kind regards, Job
On 11/Jun/20 00:41, Job Snijders wrote:
Back to basics: as Ytti suggested earlier in the thread, it might be more sensible to generate your own default route based on a 'stable anchor prefix' coming from the ISP rather than accepting the default your ISP originates towards you.
This, for me, makes a bit more sense. Especially when the number of customers asking for default far outweighs those that don't, for a large network where designing this can be quite tricky. Mark.
* Michael Hare
I'm considering an approach similar to Tore's blog where at some point I keep the full RIB but selectively populate the FIB. Tore, care to comment on why you decided to filter the RIB as well?
Not «as well», «instead». In the end I felt that running in production with the RIB and the FIB perpetually out of sync was too much of a hack, something that I would likely come to regret at a later point in time. That approach never made it out of the lab. For example, simple RIB lookups like «show route $dest» would not have given truthful answers, which would likely have confused colleagues. Even though we filter on the BGP sessions towards our transits, we still all get the routes in our RIB and can look them up explicitly we need to (e.g., in JunOS: «show route hidden $dest»). Tore
On Fri, Jun 5, 2020 at 10:30 AM Tore Anderson <tore@fud.no> wrote:
In the end I felt that running in production with the RIB and the FIB perpetually out of sync was too much of a hack, something that I would likely come to regret at a later point in time. That approach never made it out of the lab.
Speak of which, did anyone ever implement FIB compression? I seem to remember the calculations looked really favorable for the leaf node use case (like James') where the router sits at the edge with a small number of more or less equivalent upstream transits. The FIB is the expensive memory. The RIB sits in the cheap part of the hardware. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Fri, Jun 5, 2020 at 10:39 AM William Herrin <bill@herrin.us> wrote:
Speak of which, did anyone ever implement FIB compression? I seem to remember the calculations looked really favorable for the leaf node use case (like James') where the router sits at the edge with a small number of more or less equivalent upstream transits. The FIB is the expensive memory. The RIB sits in the cheap part of the hardware.
fib optimize => using LPM table for LEM https://www.arista.com/en/um-eos/eos-section-28-11-ipv4-commands#ww1173031 FIB compression => install only 1 entry into FIB for compressable routes with shared nexthop https://eos.arista.com/eos-4-21-3f/fib-compression/ The feature itself works as intended. version/platform/config compatibility needs some considerations.
On Fri, Jun 5, 2020 at 6:08 PM Yang Yu <yang.yu.list@gmail.com> wrote:
On Fri, Jun 5, 2020 at 10:39 AM William Herrin <bill@herrin.us> wrote:
Speak of which, did anyone ever implement FIB compression? I seem to remember the calculations looked really favorable for the leaf node use case (like James') where the router sits at the edge with a small number of more or less equivalent upstream transits. The FIB is the expensive memory. The RIB sits in the cheap part of the hardware.
fib optimize => using LPM table for LEM https://www.arista.com/en/um-eos/eos-section-28-11-ipv4-commands#ww1173031
Cool. So for folks who want a nutshell version about FIB compression, here it is: Your router has routing information bases (RIB) and a forwarding information base (FIB). The FIB is what picks where to move the packet. For each packet you look up the destination address in the FIB and then send the packet to the next hop that you found in the FIB. This is the expensive part of the router hardware because it has to do this for every packet at line speed. When we talk about the BGP table, we're talking about the RIB. That's what has AS paths, communities, distances, etc. It's stored in ordinary DRAM and computed by an ordinary CPU. Before the router can route packets, the data in the RIB is reduced to a forwarding table that's stored in the FIB. Normally, this means you take every route in the RIB, pick the "best" one, figure out the next hop and then store the result in the FIB. FIB compression replaces that process with one that's more selective about which RIB routes get installed in the FIB. The simplest version works like this: 1. Figure out which next hop has the most routes pointing to it. 2. Add a default route to that next hop 3. Remove all the now-unnecessary more specific routes that go to the same next hop. If you have two upstream transit services, you've probably just cut your FIB table size by more than half. Which means you don't need as much of the pricey part of the router. The actual algorithm gets more complex, the computational cost goes up and the reduction in fib routes improves. The down side is something you don't usually think about: default is implicitly routed to reject (respond with icmp unreachable). 0.0.0.0/0 -> reject. You don't see it in your configuration but it's there all the same. FIB compression eliminates the implicit reject and instead routes the unroutable packets to a more or less random next hop. If that next hop is also using FIB compression, it may route them right back to you, creating a routing loop until the packet's TTL expires. Happy weekend everybody, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Fri, Jun 5, 2020 at 9:50 PM William Herrin <bill@herrin.us> wrote:
On Fri, Jun 5, 2020 at 6:08 PM Yang Yu <yang.yu.list@gmail.com> wrote:
On Fri, Jun 5, 2020 at 10:39 AM William Herrin <bill@herrin.us> wrote:
Speak of which, did anyone ever implement FIB compression? I seem to remember the calculations looked really favorable for the leaf node use case (like James') where the router sits at the edge with a small number of more or less equivalent upstream transits. The FIB is the expensive memory. The RIB sits in the cheap part of the hardware.
fib optimize => using LPM table for LEM https://www.arista.com/en/um-eos/eos-section-28-11-ipv4-commands#ww1173031
Cool. So for folks who want a nutshell version about FIB compression, here it is:
[...]
the same. FIB compression eliminates the implicit reject and instead routes the unroutable packets to a more or less random next hop. If that next hop is also using FIB compression, it may route them right back to you, creating a routing loop until the packet's TTL expires.
The commercially available implementations do not work as you described and fortunately do not carry that (or really, any) risk. On platforms where the number of FIB entries is limited, but the prefix length doesn't affect that limit (classic TCAM), it is possible to combine adjacent entries (e.g. /24s) with the same FEC (next-hop) into fewer entries. This is probably what most people think of as "FIB compression". Maybe it's used somewhere, maybe it's not. It's also possible to suppress the installation into the FIB of routes when the prefix in question falls completely within a covering prefix with the same FEC. Doing so is computationally inexpensive, useful on almost any FIB lookup structure, and significantly helpful (on the order of 2x) on even very-well-connected routers. This is implemented by Arista in the feature that Yang linked to with the URL containing "fib-compression", but the actual command is better named: "ip fib compression redundant-specifics filter" Also, on the B'com Jericho chip (used by the Arista 7500R/7280R and Cisco NCS 5502/5508), there is a longest-prefix match (LPM) table and a seperate, much larger, exact-match (LEM) table, both of which can be used for IP forwarding. (The LPM is sort of like TCAM but not exactly -- for now, just consider it a limited resource in the same way as TCAM has been historically.) Neither of these can independently hold the global table. It is possible to optimize the use of these resources by installing certain prefix lengths into LEM to preserve LPM space. It is also possible to do the reverse, expanding mid-sized prefixes that would otherwise end up in LPM into multiple LEM entries, to reduce the number of LPM entries needed -- basically creating an optimum balance. That is essentially the other Yang feature that was linked. As also mentioned, all of this works as advertised with basically no limitations. It's been running at Netflix (my employer) for years. Current production "switch" chips, e.g. Jericho2, contain significantly more LPM than is needed to hold the global table, and can be paired with additional off-board memory (B'com calls this KBP) for futureproofing or VRF scale needs. You can buy either option depending on your needs (e.g. the Arista 7280R3 is available in a "K" and non-"K" model) . The aforementioned LEM/LPM feature was a useful bridge into this world of bigger tables in cheaper chips, but it's not needed in new hardware. James's original question was about using cheaper L3 devices. At this point, for new installs, even if you're limited to buying used gear, you have options that don't involve any config gymnastics. Regards, Ryan Woolley
On Mon, 8 Jun 2020 at 00:55, Ryan Woolley <rwoolleynanog@gmail.com> wrote:
order of 2x) on even very-well-connected routers. This is implemented by Arista in the feature that Yang linked to with the URL containing "fib-compression", but the actual command is better named: "ip fib compression redundant-specifics filter"
I'll take my imagination boat from the dry docks and sail to 2035. Lot of people still run Jericho ANET, it is the new CAT6500 PFC3. DFZ won't fit it anymore without redundant-specifics. Are we at all concerned that someone in the DFZ advertises a minimum set of prefixes needed to force decompression and if we are, how do we protect from it, if we are not, why are we not? -- ++ytti
On 08.06.2020 08.04, Saku Ytti wrote:
On Mon, 8 Jun 2020 at 00:55, Ryan Woolley <rwoolleynanog@gmail.com> wrote:
order of 2x) on even very-well-connected routers. This is implemented by Arista in the feature that Yang linked to with the URL containing "fib-compression", but the actual command is better named: "ip fib compression redundant-specifics filter" I'll take my imagination boat from the dry docks and sail to 2035. Lot of people still run Jericho ANET, it is the new CAT6500 PFC3. DFZ won't fit it anymore without redundant-specifics. Are we at all concerned that someone in the DFZ advertises a minimum set of prefixes needed to force decompression and if we are, how do we protect from it, if we are not, why are we not?
I imagine that is not so easily done. I can only get away with announcing prefixes that I own, which for most people will limit the amount of damage you could do. For someone who has unfiltered access to announce any prefix, he can already today announce 16 million x /24 and crash just about any router out there. Regards, Baldur
On Sun, Jun 7, 2020 at 11:07 PM Saku Ytti <saku@ytti.fi> wrote:
I'll take my imagination boat from the dry docks and sail to 2035. Lot of people still run Jericho ANET, it is the new CAT6500 PFC3. DFZ won't fit it anymore without redundant-specifics. Are we at all concerned that someone in the DFZ advertises a minimum set of prefixes needed to force decompression and if we are, how do we protect from it, if we are not, why are we not?
Limit announcements to /24: 2^24 max routes. Subtract: 0.0.0.0/8, 10.0.0.0/8, 127.0.0.0/8, 224.0.0.0/3 and some other reserved networks that don't (or at least aren't supposed to) show up in the DFZ. Leaves around 14M routes in the table at full disaggregation to /24. Current TCAM-based equipment supports 1M - 2M routes. The tech readily scales 7x just by throwing hardware at it (no redesign). Trie-based equipment already supports 14M routes with sufficient DRAM and CPU (4 gigs and 2 cores is more than sufficient for a 1 gbps router at the current 800k routes). And that's the worst case. The IPv4 table will surely saturate and stabilize long before 14M routes. No crisis to avert. Just keep up with your upgrade schedules. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Mon, Jun 08, 2020 at 07:14:01PM +0100, Nick Hilliard wrote:
William Herrin wrote on 08/06/2020 18:53:
4 gigs and 2 cores is more than sufficient for a 1 gbps router at the current 800k routes
1gbps is residential access speed. Is this still useful in the dfz?
Yes, it is. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "The strain of anti-intellectualism has been a constant thread winding its way through our political and cultural life, nurtured by the false notion that democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov
On Mon, Jun 8, 2020 at 11:14 AM Nick Hilliard <nick@foobar.org> wrote:
William Herrin wrote on 08/06/2020 18:53:
4 gigs and 2 cores is more than sufficient for a 1 gbps router at the current 800k routes 1gbps is residential access speed. Is this still useful in the dfz?
Not really the point. You can get 50-100gbps out of an x86 running DPDK by throwing more cores at it without appreciably changing the memory and CPU for the BGP load. My little 4 gig generic Linux VMs that connect my leaf node to the Internet are the ones I have reliable information on, so that's what I shared. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
I still know many ISPs that don't even come close to needing 1G of capacity and they serve hundreds of customers. I'd say it's still relevant. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Nick Hilliard" <nick@foobar.org> To: "William Herrin" <bill@herrin.us> Cc: "NANOG" <nanog@nanog.org> Sent: Monday, June 8, 2020 1:14:01 PM Subject: Re: Partial vs Full tables William Herrin wrote on 08/06/2020 18:53:
4 gigs and 2 cores is more than sufficient for a 1 gbps router at the current 800k routes 1gbps is residential access speed. Is this still useful in the dfz?
Nick
Does anyone have a non-paywalled version of that FIB Compression page? ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Yang Yu" <yang.yu.list@gmail.com> To: "William Herrin" <bill@herrin.us> Cc: "Tore Anderson" <tore@fud.no>, nanog@nanog.org Sent: Friday, June 5, 2020 8:07:52 PM Subject: Re: Partial vs Full tables On Fri, Jun 5, 2020 at 10:39 AM William Herrin <bill@herrin.us> wrote:
Speak of which, did anyone ever implement FIB compression? I seem to remember the calculations looked really favorable for the leaf node use case (like James') where the router sits at the edge with a small number of more or less equivalent upstream transits. The FIB is the expensive memory. The RIB sits in the cheap part of the hardware.
fib optimize => using LPM table for LEM https://www.arista.com/en/um-eos/eos-section-28-11-ipv4-commands#ww1173031 FIB compression => install only 1 entry into FIB for compressable routes with shared nexthop https://eos.arista.com/eos-4-21-3f/fib-compression/ The feature itself works as intended. version/platform/config compatibility needs some considerations.
On 5/Jun/20 18:30, Michael Hare via NANOG wrote:
I'm considering an approach similar to Tore's blog where at some point I keep the full RIB but selectively populate the FIB. Tore, care to comment on why you decided to filter the RIB as well?
We do this in the Metro, where FIB is limited, but prefer to avoid eBGP Multi-Hop for downstream customers that want a full feed. Mark.
Maybe instead of transit + 1, you use communities to just allow all customer prefixes, regardless of how deep they are. Obviously that community would need to be supported by that provider. I've been wondering a similar thing for how to take advantage of the 150k - 250k hardware routes the CRS317 now has in v7 beta. That many routes should cover the peering tables for most operators, maybe even transit's customers. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "James Breeden" <James@arenalgroup.co> To: nanog@nanog.org Sent: Thursday, June 4, 2020 10:00:51 PM Subject: Partial vs Full tables I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering. I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well.... And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill. James W. Breeden Managing Partner logo_transparent_background Arenal Group: Arenal Consulting Group | Acilis Telecom | Pines Media PO Box 1063 | Smithville, TX 78957 Email: james@arenalgroup.co | office 512.360.0000 | www.arenalgroup.co
Agree with Mike on looking at communities first. Depending on the provider, that could be a very nice tool, or completely worthless. For your planned idea on smaller "regional" nodes, you could do something like :"default || ( customer && specific cities/states/regions/countries )" , d I would definitely make sure you consider what your fallback options are in case of partitions as Bill mentioned in another reply. The less routes you have to start with the harder it gets though. On Fri, Jun 5, 2020 at 9:19 AM Mike Hammett <nanog@ics-il.net> wrote:
Maybe instead of transit + 1, you use communities to just allow all customer prefixes, regardless of how deep they are. Obviously that community would need to be supported by that provider.
I've been wondering a similar thing for how to take advantage of the 150k - 250k hardware routes the CRS317 now has in v7 beta. That many routes should cover the peering tables for most operators, maybe even transit's customers.
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
Midwest-IX http://www.midwest-ix.com
------------------------------ *From: *"James Breeden" <James@arenalgroup.co> *To: *nanog@nanog.org *Sent: *Thursday, June 4, 2020 10:00:51 PM *Subject: *Partial vs Full tables
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering.
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
*James W. Breeden*
*Managing Partner*
*[image: logo_transparent_background]*
*Arenal Group:* Arenal Consulting Group | Acilis Telecom | Pines Media
PO Box 1063 | Smithville, TX 78957
Email: james@arenalgroup.co | office 512.360.0000 | www.arenalgroup.co
On Thu, Jun 4, 2020 at 8:02 PM James Breeden <James@arenalgroup.co> wrote:
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
Hi James, When I was at the DNC in 2007, we considered APNIC-region /8s lower priority than ARNI region (for obvious reasons) so I got some extra life out of our router by pinning most APNIC /8s to a few stable announcements, preferring one transit to the other with a fallback static route. This worked in the short term but I wouldn't want to do it as a long term solution. As a more generic approach: filter distant (long AS path) routes because there's a higher probability that they're reachable from any transit with about the same efficiency. Any time you summarize routes, you WILL lose connectivity during network partitions. Which defeats part of the purpose of having BGP with multiple transits. Partitions are rare but they can persist for days (*cough* cogent *cough*). So that's a risk you should plan for. Regards, Bill Herrin -- William Herrin bill@herrin.us <https://bill.herrin.us/> https://bill.herrin.us/
On Jun 4, 2020, at 11:00 PM, James Breeden <James@arenalgroup.co> wrote:
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering.
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
We started filtering certain mixes of long and specific routes on transit, at least while some upgrades to our edge capability are in progress. We are a mix of transit providers, and public/private peering at our edge. Shortly after filtering, we started occasionally finding destinations that were unreachable over the Internet (generally /24) due to: - We filtered them on transit, probably due to long paths - They were filtered from all of our transits, so their /24 was not in our table - We did not receive their /24 on peering - However, we did receive a covering prefix on peering - Lastly, that actual destination network with the /24 no longer was connected to the network we received a covering route from, like a datacenter network that used to host them and SWIPed them their /24 to make it portable. A 3rd party SaaS netflix platform’s BGP/netflow/SNMP collectors were impacted by this, which was one of the first instances we encountered of this problem. We now have some convoluted scripting and routing policy in place, trying to proactively discover prefixes that may be impacted by this and then explicitly accepting that prefix or ASN on transit. It is not a desirable solution, but this seems like it could become more common over time with v4 prefix sales/swaps/deaggregation (with covering prefixes left in place); as well as increased TE where parties announce aggregates and specifics from disjoint locations. Our long term solution will be taking full tables again. Ryan
fre. 5. jun. 2020 20.12 skrev Ryan Rawdon <ryan@u13.net>:
Shortly after filtering, we started occasionally finding destinations that were unreachable over the Internet (generally /24) due to:
I have observed this too. I know of no router that can do this, but you need the router to automatically accept any prefix on your transit link that is covered by anything received from your peers.
On Jun 5, 2020, at 2:11 PM, Ryan Rawdon <ryan@u13.net> wrote:
On Jun 4, 2020, at 11:00 PM, James Breeden <James@arenalgroup.co> wrote:
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering.
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
A few clarifications to my previous e-mail below:
We started filtering certain mixes of long and specific routes on transit, at least while some upgrades to our edge capability are in progress. We are a mix of transit providers, and public/private peering at our edge.
Shortly after filtering, we started occasionally finding destinations that were unreachable over the Internet (generally /24) due to: - We filtered them on transit, probably due to long paths - They were filtered from all of our transits, so their /24 was not in our table - We did not receive their /24 on peering - However, we did receive a covering prefix on peering - Lastly, that actual destination network with the /24 no longer was connected to the network we received a covering route from, like a datacenter network that used to host them and SWIPed them their /24 to make it portable.
- Each of the criteria above is necessary but not sufficient alone; the whole list is required for the reachability failure mode I was describing
A 3rd party SaaS netflix platform’s BGP/netflow/SNMP collectors were impacted by this, which was one of the first instances we encountered of this problem.
- I meant Netflow, not Netflix…
We now have some convoluted scripting and routing policy in place, trying to proactively discover prefixes that may be impacted by this and then explicitly accepting that prefix or ASN on transit. It is not a desirable solution, but this seems like it could become more common over time with v4 prefix sales/swaps/deaggregation (with covering prefixes left in place); as well as increased TE where parties announce aggregates and specifics from disjoint locations.
Our long term solution will be taking full tables again.
Ryan
Hello, Some time ago we had a similar discussion on this list, in that moment I shared a small study we did in LACNIC but we had it only in Spanish. Here is the version in English (BGP: To filter or not to filter by prefix size. That is the question ): https://aaaa.acostasite.com/2019/07/bgp-to-filter-or-not-to-filter-by.html Alejandro, On 6/4/20 11:00 PM, James Breeden wrote:
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering.
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
*James W. Breeden*
/Managing Partner/
//
*logo_transparent_background*
*Arenal Group:* Arenal Consulting Group | Acilis Telecom | Pines Media
PO Box 1063 | Smithville, TX 78957
Email: james@arenalgroup.co <mailto:james@arenalgroup.co> | office 512.360.0000 | www.arenalgroup.co <http://www.arenalgroup.co/>
On 6/4/20 11:00 PM, James Breeden wrote:
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
One caveat that may or may not play into this is if you use uRPF (loose) on your transit links. -- inoc.net!rblayzor XMPP: rblayzor.AT.inoc.net PGP: https://pgp.inoc.net/rblayzor/
On Wed, Jun 10, 2020 at 9:25 AM Robert Blayzor <rblayzor.bulk@inoc.net> wrote:
One caveat that may or may not play into this is if you use uRPF (loose) on your transit links.
Hi Robert, The answer is "no," you're not running reverse-path filtering on a BGP speaker, not even in loose mode, because that's STUPID. At the very best you're tying up router resources on a very large filtering table without measurable benefit. More likely you're blackholing packets you failed to think about, like ICMPs from routers on peering lans whose route is intentionally not introduced to the Internet at large. I suppose the customers don't really need pmtud or traceroute... Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Wed, Jun 10, 2020 at 9:43 AM William Herrin <bill@herrin.us> wrote:
The answer is "no," you're not running reverse-path filtering on a BGP speaker, not even in loose mode, because that's STUPID.
Sorry, it'd be pre-coffee if I drank coffee and I was overly harsh here. Let me back up: The most basic spoofing protection is: don't accept remote packets pretending to be from my IP address. Strict mode URPF extends this to networks: don't accept packets on interfaces where I know for sure the source host isn't in that direction. It works fine in network segments whose structure requires routes to be perfectly symmetrical: on every interface, the packet for every source can only have been from one particular next hop, the same one that advertises acceptance of packets with that destination. The use of BGP breaks the symmetry requirement so close to always that you may as well think of it as always. Even with a single transit or a partial table. Don't use strict mode URPF on BGP speakers. Loose mode URPF is... broken. It was a valiant attempt to extend reverse path filtering into networks with asymmetry but I've yet to discover a use where there wasn't some faulty corner case. If you think you want to use loose mode RPF, trust me: you've already passed the point where any RPF was going to be helpful to you. Time to set it aside and solve the problem a different way. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
Disagree with Bill here. It will depend on the complexity of the network as to use of uRPF in any mode (loose or strict). In general, I never use uRPF on transit links and use pure filters to ensure accurate filters are in place. uRPF may be used internally in either mode to great advantage and I’ve done it both ways. If you are looking for corner cases, avoid networking. I do not know of a protocol or a technique that I cannot find a corner case for. - Brian
On Jun 10, 2020, at 12:31 PM, William Herrin <bill@herrin.us> wrote:
On Wed, Jun 10, 2020 at 9:43 AM William Herrin <bill@herrin.us> wrote:
The answer is "no," you're not running reverse-path filtering on a BGP speaker, not even in loose mode, because that's STUPID.
Sorry, it'd be pre-coffee if I drank coffee and I was overly harsh here. Let me back up:
The most basic spoofing protection is: don't accept remote packets pretending to be from my IP address.
Strict mode URPF extends this to networks: don't accept packets on interfaces where I know for sure the source host isn't in that direction. It works fine in network segments whose structure requires routes to be perfectly symmetrical: on every interface, the packet for every source can only have been from one particular next hop, the same one that advertises acceptance of packets with that destination. The use of BGP breaks the symmetry requirement so close to always that you may as well think of it as always. Even with a single transit or a partial table. Don't use strict mode URPF on BGP speakers.
Loose mode URPF is... broken. It was a valiant attempt to extend reverse path filtering into networks with asymmetry but I've yet to discover a use where there wasn't some faulty corner case. If you think you want to use loose mode RPF, trust me: you've already passed the point where any RPF was going to be helpful to you. Time to set it aside and solve the problem a different way.
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/
On Wed, Jun 10, 2020 at 11:20 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
Disagree with Bill here. It will depend on the complexity of the network as to use of uRPF in any mode (loose or strict). In general, I never use uRPF on transit links and use pure filters to ensure accurate filters are in place. uRPF may be used internally in either mode to great advantage and I’ve done it both ways.
Hi Brian, Do you know and understand what you broke? It's one thing to make a judgement call. Quite another to wave your hands and say, "Oh well, nobody complained so it must be OK."
If you are looking for corner cases, avoid networking. I do not know of a protocol or a technique that I cannot find a corner case for.
Not sure what you're saying here. Corner cases aren't a bad thing. They're just the point where a technology or technique is most likely to break. If you want reliability, you're supposed to identify the corner cases and then you're supposed to game out what happens in those corner cases. A result will be acceptable or unacceptable and if it's unacceptable you don't do that. If you haven't identified and gamed the corner cases then (A) you can't prove your stuff is reliable and (B) it probably isn't. With RPF, the corner cases you're looking for are: what situations would cause a packet to come from the wrong interface? For example, if you had some sort of routing loop where router A thought it could get to a destination via router B but router B thinks that destination unreachable so it returns the packet to its default route at router A. RPF then drops the packet because router B isn't an acceptable source. That's a corner case for RPF but it's an acceptable case because the packet would be dropped regardless. Another corner case with strict RPF is that your best route to a destination is transit C but a packet with that source arrives from transit D. That's broken, it causes significant problems for the network and as a result it constrains you to not use strict RPF in network scenarios where that's possible. Loose mode RPF tries to overcome the limitation by saying: as long as there's a route announced from D we'll accept packets from D even if C is the best route. So loose mode changes the nature of the corner cases you're looking for. Instead of looking for situations where the packet came from somewhere other than the best route, you're looking for situations where the packet came from an interface that advertises no return route at all. What are these situations? You may have gotten a packet from a reciprocal peer whose customer has told them not to advertise their route to you. Your peer isn't policy-routing deep in their core, so no matter what their customer instructs their packets to you will follow your peer's preferred path. If you use loose RPF there, you will black-hole that network's packets. You may have gotten an ICMP TTL expired from a router on a distant peering lan whose operator doesn't announce the lan's route for security purposes. After all, those routers don't need to be reached from the Internet. If you lose that packet, traceroute fails to reveal the hop. You may have gotten an ICMP fragmentation needed message from a router on the same distant peering lan. If you drop that packet, path MTU discovery fails and everything beyond that router is unreachable with TCP. So you might want to consider whether any of these corner cases is activated by the way you use loose mode RPF. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
Am I correct in assuming loose mode RPF only drops packets from unannounced address space in the global routing table? And the downside of doing so is that sometimes we do receive packets from that address space, usually back scatter from traceroute or other ICMP messages. Currently about 25% of the routable address space is not advertised in the DFZ. Loose mode RPF could filter this. Is there any data on how much traffic actually arrives from this space? Regards, Baldur On Wed, Jun 10, 2020 at 10:07 PM William Herrin <bill@herrin.us> wrote:
On Wed, Jun 10, 2020 at 11:20 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
Disagree with Bill here. It will depend on the complexity of the network as to use of uRPF in any mode (loose or strict). In general, I never use uRPF on transit links and use pure filters to ensure accurate filters are in place. uRPF may be used internally in either mode to great advantage and I’ve done it both ways.
Hi Brian,
Do you know and understand what you broke? It's one thing to make a judgement call. Quite another to wave your hands and say, "Oh well, nobody complained so it must be OK."
If you are looking for corner cases, avoid networking. I do not know of a protocol or a technique that I cannot find a corner case for.
Not sure what you're saying here. Corner cases aren't a bad thing. They're just the point where a technology or technique is most likely to break. If you want reliability, you're supposed to identify the corner cases and then you're supposed to game out what happens in those corner cases. A result will be acceptable or unacceptable and if it's unacceptable you don't do that. If you haven't identified and gamed the corner cases then (A) you can't prove your stuff is reliable and (B) it probably isn't.
With RPF, the corner cases you're looking for are: what situations would cause a packet to come from the wrong interface? For example, if you had some sort of routing loop where router A thought it could get to a destination via router B but router B thinks that destination unreachable so it returns the packet to its default route at router A. RPF then drops the packet because router B isn't an acceptable source. That's a corner case for RPF but it's an acceptable case because the packet would be dropped regardless.
Another corner case with strict RPF is that your best route to a destination is transit C but a packet with that source arrives from transit D. That's broken, it causes significant problems for the network and as a result it constrains you to not use strict RPF in network scenarios where that's possible.
Loose mode RPF tries to overcome the limitation by saying: as long as there's a route announced from D we'll accept packets from D even if C is the best route.
So loose mode changes the nature of the corner cases you're looking for. Instead of looking for situations where the packet came from somewhere other than the best route, you're looking for situations where the packet came from an interface that advertises no return route at all. What are these situations?
You may have gotten a packet from a reciprocal peer whose customer has told them not to advertise their route to you. Your peer isn't policy-routing deep in their core, so no matter what their customer instructs their packets to you will follow your peer's preferred path. If you use loose RPF there, you will black-hole that network's packets.
You may have gotten an ICMP TTL expired from a router on a distant peering lan whose operator doesn't announce the lan's route for security purposes. After all, those routers don't need to be reached from the Internet. If you lose that packet, traceroute fails to reveal the hop.
You may have gotten an ICMP fragmentation needed message from a router on the same distant peering lan. If you drop that packet, path MTU discovery fails and everything beyond that router is unreachable with TCP.
So you might want to consider whether any of these corner cases is activated by the way you use loose mode RPF.
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/
On Wed, Jun 10, 2020 at 3:02 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Am I correct in assuming loose mode RPF only drops packets from unannounced address space in the global routing table?
Actually, I'm not sure since my plan around RPF is "10 foot pole." Is "loose mode" really just filtering packets the current routing table deems to be bogons? If it's not tied in any way to the actual routing paths then it seems poorly named.
And the downside of doing so is that sometimes we do receive packets from that address space, usually back scatter from traceroute or other ICMP messages.
Those "other" ICMP messages are kinda important since TCP fails if they're discarded. If it's just a bogon filter then by definition only simplex communications can be impacted since there's known to be no way for duplex communication to occur. PMTUD and traceroute responses are examples: a router telling a host information but expecting no response. SNMP traps are simplex though it's not obvious to me how that would matter here. What else can you think of that's simplex? Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
Once upon a time, William Herrin <bill@herrin.us> said:
On Wed, Jun 10, 2020 at 3:02 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Am I correct in assuming loose mode RPF only drops packets from unannounced address space in the global routing table?
Actually, I'm not sure since my plan around RPF is "10 foot pole." Is "loose mode" really just filtering packets the current routing table deems to be bogons? If it's not tied in any way to the actual routing paths then it seems poorly named.
I think it's just named that because it was an extension of uRPF; it's the same mechanism, just stops one step sooner (loose uRPF looks up the source IP in the FIB to see if it exists, while strict mode then also looks at the source interface to see if it matches the FIB next-hop). Loose mode does also make dropping bad traffic easier - for example, if you have a BGP-triggered remote blackhole, not only will you drop traffic destined to the IP, but from the source (at least, depending on the router and config - some treat null routes as "valid path" for loose uRPF and some do not).
PMTUD and traceroute responses are examples: a router telling a host information but expecting no response.
The only typical potentially-valid sources that a router with a full table wouldn't have that I can see is some peering networks, where the peering fabric space is not announced in BGP. You should never see PMTU issues there, since everybody properly operating on the peering fabric should have the same MTU (or they'll potentially have BGP issues anyway). And while TTL expired messages could also come from a peering IP, that seems a super corner case (especially since peering is usually closer rather than farther away). I've seen enough providers that drop hops in traceroute that I can only assume nobody really cares about that case either. -- Chris Adams <cma@cmadams.net>
On Thu, Jun 11, 2020 at 12:01:38AM +0200, Baldur Norddahl wrote:
Am I correct in assuming loose mode RPF only drops packets from unannounced address space in the global routing table? And the downside of doing so is that sometimes we do receive packets from that address space, usually back scatter from traceroute or other ICMP messages.
uRPF absolutely kills the pps performance or your hardware due to the packet having to be recirculated to do the check(at least this is the case on every platform that ive ever tested it on). use acl's to protect your edge. -b
On Thu, Jun 11, 2020 at 9:08 AM brad dreisbach <bradd@us.ntt.net> wrote:
uRPF absolutely kills the pps performance or your hardware due to the packet having to be recirculated to do the check(at least this is the case on every platform that ive ever tested it on). use acl's to protect your edge.
Hi Brad, Don't the ACLs generally live in a partition of the TCAM too? So you're going from two constant-time TCAM lookups per packet (route, acls) to three (route, urpf, acls)? Not rhetorical; getting close to the edge of my knowledge here. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
This is just my experience so do whatever you want with that. The only time we have ever noticed any sort of operational downside of using uRPF loose was when NTTs router in NYC thought a full table was only 500,000 routes a few years back. That is a fairly real consideration though. =) -----Original Message----- From: NANOG <nanog-bounces@nanog.org> On Behalf Of William Herrin Sent: Thursday, June 11, 2020 12:18 PM To: brad dreisbach <bradd@us.ntt.net> Cc: nanog@nanog.org Subject: Re: Partial vs Full tables On Thu, Jun 11, 2020 at 9:08 AM brad dreisbach <bradd@us.ntt.net> wrote:
uRPF absolutely kills the pps performance or your hardware due to the packet having to be recirculated to do the check(at least this is the case on every platform that ive ever tested it on). use acl's to protect your edge.
Hi Brad, Don't the ACLs generally live in a partition of the TCAM too? So you're going from two constant-time TCAM lookups per packet (route, acls) to three (route, urpf, acls)? Not rhetorical; getting close to the edge of my knowledge here. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
Hey Drew,
The only time we have ever noticed any sort of operational downside of using uRPF loose was when NTTs router in NYC thought a full table was only 500,000 routes a few years back.
If NTT is 2914 this can no longer happen and it is difficult to see 2914 would ever go back to uRPF. In typical implementation today ACL is much cheaper than uRPF, so we've migrated to ACL. uRPF value proposition is mostly on CLI Jockey networks, if configuration are generated for most use-cases ACL is superior solution anyhow. In your particular defect, it doesn't seem to matter if uRPF was or was not enabled, was it dropped by uRPF/loose failure or lookup failure seems uninteresting (We do not default route). -- ++ytti
Yeah, as I mentioned this was a few years ago. =) -----Original Message----- From: Saku Ytti <saku@ytti.fi> Sent: Monday, June 15, 2020 8:54 AM To: Drew Weaver <drew.weaver@thenap.com> Cc: William Herrin <bill@herrin.us>; brad dreisbach <bradd@us.ntt.net>; nanog@nanog.org Subject: Re: Partial vs Full tables Hey Drew,
The only time we have ever noticed any sort of operational downside of using uRPF loose was when NTTs router in NYC thought a full table was only 500,000 routes a few years back.
If NTT is 2914 this can no longer happen and it is difficult to see 2914 would ever go back to uRPF. In typical implementation today ACL is much cheaper than uRPF, so we've migrated to ACL. uRPF value proposition is mostly on CLI Jockey networks, if configuration are generated for most use-cases ACL is superior solution anyhow. In your particular defect, it doesn't seem to matter if uRPF was or was not enabled, was it dropped by uRPF/loose failure or lookup failure seems uninteresting (We do not default route). -- ++ytti
On Jun 10, 2020, at 6:40 PM, brad dreisbach <bradd@us.ntt.net> wrote:
On Thu, Jun 11, 2020 at 12:01:38AM +0200, Baldur Norddahl wrote:
Am I correct in assuming loose mode RPF only drops packets from unannounced address space in the global routing table? And the downside of doing so is that sometimes we do receive packets from that address space, usually back scatter from traceroute or other ICMP messages.
uRPF absolutely kills the pps performance or your hardware due to the packet having to be recirculated to do the check(at least this is the case on every platform that ive ever tested it on). use acl's to protect your edge.
-b
Completely agree for edge scenario 100%. And now for Bill to talk down to me…. :/
On 6/10/20 6:01 PM, Baldur Norddahl wrote:
Am I correct in assuming loose mode RPF only drops packets from unannounced address space in the global routing table? And the downside of doing so is that sometimes we do receive packets from that address space, usually back scatter from traceroute or other ICMP messages.
Currently about 25% of the routable address space is not advertised in the DFZ. Loose mode RPF could filter this. Is there any data on how much traffic actually arrives from this space?
Loose mode RPF will essentially drop traffic received on the interface if the router does not have any route for. (will not match a default or a discard route, at least in IOS-XR) As Bill has pointed out, this may drop traffic from some peering networks that are not in the global routing table. Though one could argue that if a packet needs to be fragged it's typically closer to the edges rather than the transit/peering links. -- inoc.net!rblayzor XMPP: rblayzor.AT.inoc.net PGP: https://pgp.inoc.net/rblayzor/
Hi all,
Loose mode RPF will essentially drop traffic received on the interface if the router does not have any route for. (will not match a default or a discard route, at least in IOS-XR)
As Bill has pointed out, this may drop traffic from some peering networks that are not in the global routing table. Though one could argue that if a packet needs to be fragged it's typically closer to the edges rather than the transit/peering links.
No one has mentioned it , but you can also use an acl combined with urpf. You could even go so far as permiting everything and just using urpf for rtbh purposes. Brian
On Jun 10, 2020, at 3:05 PM, William Herrin <bill@herrin.us> wrote:
On Wed, Jun 10, 2020 at 11:20 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
Disagree with Bill here. It will depend on the complexity of the network as to use of uRPF in any mode (loose or strict). In general, I never use uRPF on transit links and use pure filters to ensure accurate filters are in place. uRPF may be used internally in either mode to great advantage and I’ve done it both ways.
Hi Brian,
Do you know and understand what you broke? It's one thing to make a judgement call. Quite another to wave your hands and say, "Oh well, nobody complained so it must be OK."
Hi Bill, I fully understand that I have not “broken” anything. If I run uRPF in loose mode on my internal links, I will not forward packets from a source that doesn’t exist in my routing tables to the rest of the internet. Just say thank you and we can stop this silliness.
If you are looking for corner cases, avoid networking. I do not know of a protocol or a technique that I cannot find a corner case for.
Not sure what you're saying here. Corner cases aren't a bad thing. They're just the point where a technology or technique is most likely to break. If you want reliability, you're supposed to identify the corner cases and then you're supposed to game out what happens in those corner cases. A result will be acceptable or unacceptable and if it's unacceptable you don't do that. If you haven't identified and gamed the corner cases then (A) you can't prove your stuff is reliable and (B) it probably isn't.
Actually corner cases are the exception to any rule. You will never solve for all of the corner cases unless you have dictatorial control of the entire system. I can only control what I send. I can say that sending packets from unknown sources from a network I do control is a no-no, ICMP response or not.
With RPF, the corner cases you're looking for are: what situations would cause a packet to come from the wrong interface? For example, if you had some sort of routing loop where router A thought it could get to a destination via router B but router B thinks that destination unreachable so it returns the packet to its default route at router A. RPF then drops the packet because router B isn't an acceptable source. That's a corner case for RPF but it's an acceptable case because the packet would be dropped regardless.
So a non-problem corner-case. Interesting... I never thought of a non-problem as a problem.
Another corner case with strict RPF is that your best route to a destination is transit C but a packet with that source arrives from transit D. That's broken, it causes significant problems for the network and as a result it constrains you to not use strict RPF in network scenarios where that's possible.
See my original response where I say not to use it in that scenario.
Loose mode RPF tries to overcome the limitation by saying: as long as there's a route announced from D we'll accept packets from D even if C is the best route.
Saying that a route to the source has to be in the routing table on an internal network, something I can control, is a valid way to stop spoofing. uRPF cannot, and should never be used to, make routing decisions and does not do that anyway. The rest of your comments are not reasonable since I already said to not use uRPF on TRANSIT LINKS. Happy to expand that to PEERING LINKS. For internal networking, you CAN use uRPF to great effect, corner case arguments notwithstanding. Not everyone runs a large multi-national tier-1/2 network. Some of us run the thousands of eye-ball networks and just say thank you when we don’t allow spoofing. BTW... RPF and uRPF are significantly different things. :) <SNIP>
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/
On Thu, Jun 11, 2020 at 6:25 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
I fully understand that I have not “broken” anything.
Handwaving, la la la, only sunshine in the sky. Got it. -Bill -- William Herrin bill@herrin.us https://bill.herrin.us/
You are a dismissive little twit aren’t you. :/
On Jun 11, 2020, at 9:56 AM, William Herrin <bill@herrin.us> wrote:
On Thu, Jun 11, 2020 at 6:25 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
I fully understand that I have not “broken” anything.
Handwaving, la la la, only sunshine in the sky. Got it.
-Bill
-- William Herrin bill@herrin.us https://bill.herrin.us/
On Thu, Jun 11, 2020 at 9:35 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
You are a dismissive little twit aren’t you. :/
Someone stood up and said, "Nope, nothing I did could possibly have broken anything." I'm pretty sure that someone was you but feel free to call me on it if I'm mistaken. Look, at the risk of doing further offense, it's like I said: it's one thing to educate yourself about a topic and then make a judgement call about what's acceptable. It's quite another to remain willfully ignorant in service of your preferred view. I just got through describing specific scenarios where loose urpf fails when you responded that no, it doesn't break anything. If you'd said, "no, that breakage is a small price worth paying," I'd have debated the merits with you or simply let it stand as a contrary opinion. Refusing to acknowledge the breakage is worth only dismissal. Regards, Bill -- William Herrin bill@herrin.us https://bill.herrin.us/
Wow. Full distorted vision of reality mode here… uRPF doesn’t “break” anything. I stand by that. It’s not a religious position. It’s an operational experience. One that I have multitudes of real world examples of it working to SOLVE issues. You seem to be willfully ignorant about how real networks use tools that you dislike to solve problems. This is way more of a problem with you disliking uRPF than me telling you that I like it for some applications. Now I remember why I usually never post on this list now. I will just dismiss your opinions going forward instead of trying to point out that you aren’t the only measure of a network. Thanks Bill.
On Jun 11, 2020, at 1:11 PM, William Herrin <bill@herrin.us> wrote:
On Thu, Jun 11, 2020 at 9:35 AM Brian Johnson <brian.johnson@netgeek.us> wrote:
You are a dismissive little twit aren’t you. :/
Someone stood up and said, "Nope, nothing I did could possibly have broken anything." I'm pretty sure that someone was you but feel free to call me on it if I'm mistaken.
Look, at the risk of doing further offense, it's like I said: it's one thing to educate yourself about a topic and then make a judgement call about what's acceptable. It's quite another to remain willfully ignorant in service of your preferred view. I just got through describing specific scenarios where loose urpf fails when you responded that no, it doesn't break anything. If you'd said, "no, that breakage is a small price worth paying," I'd have debated the merits with you or simply let it stand as a contrary opinion. Refusing to acknowledge the breakage is worth only dismissal.
Regards, Bill
-- William Herrin bill@herrin.us https://bill.herrin.us/
On 10/Jun/20 19:31, William Herrin wrote:
Sorry, it'd be pre-coffee if I drank coffee and I was overly harsh here. Let me back up:
The most basic spoofing protection is: don't accept remote packets pretending to be from my IP address.
Strict mode URPF extends this to networks: don't accept packets on interfaces where I know for sure the source host isn't in that direction. It works fine in network segments whose structure requires routes to be perfectly symmetrical: on every interface, the packet for every source can only have been from one particular next hop, the same one that advertises acceptance of packets with that destination. The use of BGP breaks the symmetry requirement so close to always that you may as well think of it as always. Even with a single transit or a partial table. Don't use strict mode URPF on BGP speakers.
Loose mode URPF is... broken. It was a valiant attempt to extend reverse path filtering into networks with asymmetry but I've yet to discover a use where there wasn't some faulty corner case. If you think you want to use loose mode RPF, trust me: you've already passed the point where any RPF was going to be helpful to you. Time to set it aside and solve the problem a different way.
We don't run Loose Mode on peering routers because they don't carry a full table. If anyone sent the wrong packets that way, they wouldn't be able to leave the box anyway. We do run Loose Mode on transit routers, no issues thus far. We do run Strict Mode on customer-facing links that are stub-homed to us (DIA). We also run Loose Mode on customer-facing links that buy transit (BGP). But mostly, BCP-38 deployed at the edge (peering, transit and customer routers) also goes a long way in protecting the network. Mark.
Mark (and others), I used to run loose uRPF on peering/transit links for AS3128 because I used to think that tightening the screws was always the "right thing to do". I instrumented at 60s granularity with vendor J uRPF drop counters on these links. Drops during steady state [bgp converged] were few [Kbps]. Drops during planned maintenance were at much rates for a few minutes. What was happening: I advertise a handful of routes to transit/peers from multiple ASBR. Typically my ASBR sees 800K FIB and a few million RIB routes We all know this takes a good amount of time to churn.. For planned maintenance of ASBR A [cold boot upgrades], if recovery didn't include converging my inbound routes before doing eBGP advertising, I'd be tossing packets due to loose uRPF. Remember during this time 'ASBR B' in my AS is happy egressing traffic as soon as 'ASBR A' advertises my dozen or so prefixes via eBGP, I start to see return traffic much sooner than before 'ASBR A' has converged. No more specific return route yet other than maybe default for a few minutes if unlucky.. The result is bit bucket networkwide despite ASBR B functioning just fine. Maybe everyone already convergences inbound before advertising eBGP and I made a rookie mistake, but what about unplanned events? For me the summary is that I was causing more collateral damage than good [verified by time series data], so I turned off loose URPF. YMMV. -Michael
-----Original Message----- From: NANOG <nanog-bounces@nanog.org> On Behalf Of Mark Tinka Sent: Thursday, June 11, 2020 12:14 PM To: nanog@nanog.org Subject: Re: Partial vs Full tables
On 10/Jun/20 19:31, William Herrin wrote:
Sorry, it'd be pre-coffee if I drank coffee and I was overly harsh here. Let me back up:
The most basic spoofing protection is: don't accept remote packets pretending to be from my IP address.
Strict mode URPF extends this to networks: don't accept packets on interfaces where I know for sure the source host isn't in that direction. It works fine in network segments whose structure requires routes to be perfectly symmetrical: on every interface, the packet for every source can only have been from one particular next hop, the same one that advertises acceptance of packets with that destination. The use of BGP breaks the symmetry requirement so close to always that you may as well think of it as always. Even with a single transit or a partial table. Don't use strict mode URPF on BGP speakers.
Loose mode URPF is... broken. It was a valiant attempt to extend reverse path filtering into networks with asymmetry but I've yet to discover a use where there wasn't some faulty corner case. If you think you want to use loose mode RPF, trust me: you've already passed the point where any RPF was going to be helpful to you. Time to set it aside and solve the problem a different way.
We don't run Loose Mode on peering routers because they don't carry a full table. If anyone sent the wrong packets that way, they wouldn't be able to leave the box anyway.
We do run Loose Mode on transit routers, no issues thus far.
We do run Strict Mode on customer-facing links that are stub-homed to us (DIA). We also run Loose Mode on customer-facing links that buy transit (BGP).
But mostly, BCP-38 deployed at the edge (peering, transit and customer routers) also goes a long way in protecting the network.
Mark.
On 12/Jun/20 04:01, Michael Hare wrote:
Mark (and others),
I used to run loose uRPF on peering/transit links for AS3128 because I used to think that tightening the screws was always the "right thing to do".
I instrumented at 60s granularity with vendor J uRPF drop counters on these links. Drops during steady state [bgp converged] were few [Kbps]. Drops during planned maintenance were at much rates for a few minutes.
What was happening: I advertise a handful of routes to transit/peers from multiple ASBR. Typically my ASBR sees 800K FIB and a few million RIB routes We all know this takes a good amount of time to churn..
For planned maintenance of ASBR A [cold boot upgrades], if recovery didn't include converging my inbound routes before doing eBGP advertising, I'd be tossing packets due to loose uRPF.
Remember during this time 'ASBR B' in my AS is happy egressing traffic as soon as 'ASBR A' advertises my dozen or so prefixes via eBGP, I start to see return traffic much sooner than before 'ASBR A' has converged. No more specific return route yet other than maybe default for a few minutes if unlucky.. The result is bit bucket networkwide despite ASBR B functioning just fine.
Maybe everyone already convergences inbound before advertising eBGP and I made a rookie mistake, but what about unplanned events?
For me the summary is that I was causing more collateral damage than good [verified by time series data], so I turned off loose URPF. YMMV.
To be honest, we haven't seen this. We've got plenty of peering and transit exit/entry points, each just about dedicated to its own router, across multiple cities in Africa (peering) and Europe (peering + transit). We also only do about 10% - 15% of traffic via transit (remember, we don't run any kind of uRPF on our peering routers). We originate our aggregates from deep within the core, never from the transit, peering or edge routers. We did experience some slowness with the ASR9001 some years back during convergence for the transit network that was connected to that router in Amsterdam, but it had been slowing down for years. Since swapping it out for MX480's and/or MX204's, that problem has since gone away. Mark.
participants (24)
-
Alejandro Acosta
-
Baldur Norddahl
-
brad dreisbach
-
Brian Johnson
-
Brian Turnbow
-
Ca By
-
Chris Adams
-
Chuck Anderson
-
Drew Weaver
-
James Breeden
-
Job Snijders
-
Joe Greco
-
Mark Tinka
-
Michael Hare
-
Mike Hammett
-
Nick Hilliard
-
Robert Blayzor
-
Ryan Rawdon
-
Ryan Woolley
-
Saku Ytti
-
Tom Beecher
-
Tore Anderson
-
William Herrin
-
Yang Yu