On Jun 5, 2020, at 2:11 PM, Ryan Rawdon <ryan@u13.net> wrote:
On Jun 4, 2020, at 11:00 PM, James Breeden <James@arenalgroup.co> wrote:
I have been doing a lot of research recently on operating networks with partial tables and a default to the rest of the world. Seems like an easy enough approach for regional networks where you have maybe only 1 upstream transit and some peering.
I come to NANOG to get feedback from others who may be doing this. We have 3 upstream transit providers and PNI and public peers in 2 locations. It'd obviously be easy to transition to doing partial routes for just the peers, etc, but I'm not sure where to draw the line on the transit providers. I've thought of straight preferencing one over another. I've thought of using BGP filtering and community magic to basically allow Transit AS + 1 additional AS (Transit direct customer) as specific routes, with summarization to default for the rest. I'm sure there are other thoughts that I haven't had about this as well....
And before I get asked why not just run full tables, I'm looking at regional approaches to being able to use smaller, less powerful routers (or even layer3 switches) to run some areas of the network where we can benefit from summarization and full tables are really overkill.
A few clarifications to my previous e-mail below:
We started filtering certain mixes of long and specific routes on transit, at least while some upgrades to our edge capability are in progress. We are a mix of transit providers, and public/private peering at our edge.
Shortly after filtering, we started occasionally finding destinations that were unreachable over the Internet (generally /24) due to: - We filtered them on transit, probably due to long paths - They were filtered from all of our transits, so their /24 was not in our table - We did not receive their /24 on peering - However, we did receive a covering prefix on peering - Lastly, that actual destination network with the /24 no longer was connected to the network we received a covering route from, like a datacenter network that used to host them and SWIPed them their /24 to make it portable.
- Each of the criteria above is necessary but not sufficient alone; the whole list is required for the reachability failure mode I was describing
A 3rd party SaaS netflix platform’s BGP/netflow/SNMP collectors were impacted by this, which was one of the first instances we encountered of this problem.
- I meant Netflow, not Netflix…
We now have some convoluted scripting and routing policy in place, trying to proactively discover prefixes that may be impacted by this and then explicitly accepting that prefix or ASN on transit. It is not a desirable solution, but this seems like it could become more common over time with v4 prefix sales/swaps/deaggregation (with covering prefixes left in place); as well as increased TE where parties announce aggregates and specifics from disjoint locations.
Our long term solution will be taking full tables again.
Ryan