"Tactical" /24 announcements
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers? All of our allocations are larger and those prefixes we announce for clients as well usually are. But we had a request recently to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack. It seemed a little bit like a tragedy of the commons situation. Is this seen as route table pollution, or a necessary evil in today's world? How many routers out there today would be affected if everyone did this? Are there any big networks that drop or penalize announcements like this?
I prefer the approach of disaggregating only when needed, not as a preventative measure. There are tools that can help with automating this disaggregation (ARTEMIS can do this, for example). — Chris On Mon, Aug 9, 2021 at 10:50 AM Billy Croan <BCroan@unrealservers.net> wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers?
All of our allocations are larger and those prefixes we announce for clients as well usually are. But we had a request recently to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack. It seemed a little bit like a tragedy of the commons situation.
Is this seen as route table pollution, or a necessary evil in today's world? How many routers out there today would be affected if everyone did this? Are there any big networks that drop or penalize announcements like this?
It's route table pollution if you ask me.. in today's world we have many IXPs and several tier-1 operators that support RPKI ROV, so when you have issued ROAs for the supernet of the IP space in question it'll already significantly reduce the effects of a BGP hijack. Best regards, Martijn On 8/9/21 5:47 PM, Billy Croan wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers?
All of our allocations are larger and those prefixes we announce for clients as well usually are. But we had a request recently to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack. It seemed a little bit like a tragedy of the commons situation.
Is this seen as route table pollution, or a necessary evil in today's world? How many routers out there today would be affected if everyone did this? Are there any big networks that drop or penalize announcements like this?
On Mon, 9 Aug 2021 at 19:07, Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
It's route table pollution if you ask me.. in today's world we have many IXPs and several tier-1 operators that support RPKI ROV, so when you have issued ROAs for the supernet of the IP space in question it'll already significantly reduce the effects of a BGP hijack.
Not just a route table. - RIB scale - FIB scale - Configuration scale We just recently learned of a IOS-XR prefix-set limit of 300001 when a particular customer AS-SET expanded to a higher number of prefixes. The problem with this scaling is that it doesn't reflect an increase of revenue but it reflects an increase of cost. CAPEX to upgrade devices without winning new business and OPEX to manage those upgrades. So it is a somewhat selfish solution to a problem. -- ++ytti
On Mon, Aug 9, 2021 at 8:48 AM Billy Croan <BCroan@unrealservers.net> wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers? How many routers out there today would be affected if everyone did this?
Hi Billy, I did some math on this years ago and it worked out to about 8.5 million IPv4 routes. That's 10 times the current table size, more than any big-iron router can handle today. If everybody did it, it'd crash the Internet.
Is this seen as route table pollution, or a necessary evil in today's world?
Pollution. And it won't save you from a hijack either, since your adversary's /24 routes will compete and win for at least part of the Internet.
Are there any big networks that drop or penalize announcements like this?
Not in an automated way. Which is bad news for you if you do this because it means getting folks to -undo- the restrictions they manually enforce on your specific address space is nearly impossible. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Mon, Aug 9, 2021 at 9:24 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
William Herrin wrote:
I did some math on this years ago and it worked out to about 8.5 million IPv4 routes.
It should be 14M.
Doubtful. Like I said, I did the math. The question I asked at the time was: If: IPv6 fails to overtake IPv4 and IPv4 continues to be divided and redistributed to progressively higher-value uses and the /24 public Internet announcement boundary holds then What will the terminal size of the IPv4 Internet BGP table be? There are 2^24 = 16.8M /24s in the IPv4 address space. Many of these are reserved for non-unicast uses, e.g. 224/3, 0/8 Many of the unicast addresses are reserved for non-public uses, e.g. 10/8, 127/8 Some portion of the assigned address space is used off-Internet in valuable enough ways that its owners are unlikely to release it for use on the Internet. Some portion of the address space won't be disaggregated to /24 because their owners won't find it convenient Some portion of the address space will have overlapping announcements (/24s and the overlapping the /20, that sort of thing) I no longer have the exact formula but I made some reasonable assumptions for each of these factors and it worked out to 8.5 million as the probable terminal size of the IPv4 table. There were error bands too, I forget what they were, but nothing came even close to 14M. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
----- On Aug 9, 2021, at 9:22 AM, Masataka Ohta mohta@necom830.hpcl.titech.ac.jp wrote: Hi,
It should be 14M.
Just for fun, I did the math. A total of 16,777,216 /24s fit in 32 bits. Take away all the reserved space as per IANA (this is 1,266,696 /24s, see below), and we end up with 16,777,216 - 1,266,696 = 15,510,520 potential /24 advertisements. The largest FIB table I have seen (hi Jim!) was 3,563,546 routes in hardware. This was in a lab environment, of course. Thanks, Sabri https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-specia... Subnet Number of /24s 0.0.0.0/8 65536 10.0.0.0/8 65536 100.64.0.0/10 16384 127.0.0.0/8 65536 169.254.0.0/16 256 172.16.0.0/12 4096 192.0.0.0/24 1 192.0.2.0/24 1 192.31.196.0/24 1 192.52.193.0/24 1 192.88.99.0/24 1 192.168.0.0/16 256 192.175.48.0/24 1 198.18.0.0/15 512 198.51.100.0/24 1 203.0.113.0/24 1 240.0.0.0/4 1048576 Total reserved 1,266,696
On Mon, Aug 9, 2021 at 10:31 AM Sabri Berisha <sabri@cluecentral.net> wrote:
Just for fun, I did the math. A total of 16,777,216 /24s fit in 32 bits. Take away all the reserved space as per IANA (this is 1,266,696 /24s, see below), and we end up with 16,777,216 - 1,266,696 = 15,510,520 potential /24 advertisements.
Howdy, It's not that simple. For example, 224/4 is not a 'reserved' space but it can't appear in the unicast BGP table either. That alone is a million routes unaccounted for in your math. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
Sabri Berisha wrote:
Just for fun, I did the math. A total of 16,777,216 /24s fit in 32 bits. Take away all the reserved space as per IANA (this is 1,266,696 /24s, see below),
240.0.0.0/4 1048576
I think we should also take away multicast addresses of
224.0.0.0/4 1048576
because multicast route can not be aggregated and must be treated as /32. Anyway,
The largest FIB table I have seen (hi Jim!) was 3,563,546 routes in hardware. This was in a lab environment, of course.
for /24, these days, having 16M entry SRAM (simple one, not TCAM) is trivially easy. Masataka Ohta
Bill said,
Is this seen as route table pollution, or a necessary evil in today's world?
Pollution. And it won't save you from a hijack either, since your adversary's /24 routes will compete and win for at least part of the Internet.
I agree, of course, that moving to announce every /24 would pollute the net. Note that if you use ROAs, you'll also have to make corresponding /24 ROAs, and I don't know if this won't have problematic impact also on the RPKI infrastructure. Not good. But: - assuming the /24 will have proper ROA, and ROV is reasonably deployed, this _would_ protect most of the traffic sent to the /24 from a hijacker announcing /24 (and even more if hijack is of shorter prefix, of course). - As long as ROV isn't _very_ widely deployed, it would often fail to protect against the hijack without such measure (competing /24), so this will remain necessary (if you wish to prevent hijack). We've done some relevant simulations, as well as proposed a simple extension to ROV, called ROV++, which protects against such sub-prefix hijacks without requiring competing /24 announcement, and effective already with modest adoption (of ROV++) by BGP routers. (Should also be assisted by mixed ROV / ROV++ adoption but we didn't do these simulations yet.) See at: https://www.ndss-symposium.org/ndss-paper/rov-improved-deployable-defense-ag... tl; dr : ROV++ routers would blackhole subprefix traffic rather than send it on a route which would be hijacked (i.e., if the route is to a neighbor AS that announced legit prefix _and_ hijacked subprefix). Simple. [and no, I'm not happy with the resulting disconnections. but it's better than hijack imho] best, Amir -- Amir Herzberg Comcast professor of Security Innovations, Computer Science and Engineering, University of Connecticut Homepage: https://sites.google.com/site/amirherzberg/home `Applied Introduction to Cryptography' textbook and lectures: https://sites.google.com/site/amirherzberg/applied-crypto-textbook <https://sites.google.com/site/amirherzberg/applied-crypto-textbook> On Mon, Aug 9, 2021 at 12:10 PM William Herrin <bill@herrin.us> wrote:
On Mon, Aug 9, 2021 at 8:48 AM Billy Croan <BCroan@unrealservers.net> wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers? How many routers out there today would be affected if everyone did this?
Hi Billy,
I did some math on this years ago and it worked out to about 8.5 million IPv4 routes. That's 10 times the current table size, more than any big-iron router can handle today. If everybody did it, it'd crash the Internet.
Is this seen as route table pollution, or a necessary evil in today's world?
Pollution. And it won't save you from a hijack either, since your adversary's /24 routes will compete and win for at least part of the Internet.
Are there any big networks that drop or penalize announcements like this?
Not in an automated way. Which is bad news for you if you do this because it means getting folks to -undo- the restrictions they manually enforce on your specific address space is nearly impossible.
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/
Yes, it is bad practice. Yes, it's polluting the route table. If the # of /24s involved is not ridiculously large (say, <64?) them I would go ahead, as long as IRR and/or RPKI are also updated. Obviously if everyone did it (i.e. advertising /24s exclusively) then our FIBs would collectively balloon to a grotesquely un-manageable size, at least on platforms that can't auto-aggregate that back down. Thankfully, everyone isn't doing it. I, too, would vastly prefer no-one did this, but I have two customers that demand it from time to time... and we've even done it for our own allocation sometimes, and there's no robust, never mind bullet-proof, technical argument why I can't do that for them (or for ourselves). OTOH robust arguments exist for why it's a good thing to do - sometimes, and temporarily. ¯\_(ツ)_/¯ -Adam Adam Thompson Consultant, Infrastructure Services [1593169877849] 100 - 135 Innovation Drive Winnipeg, MB, R3T 6A8 (204) 977-6824 or 1-800-430-6404 (MB only) athompson@merlin.mb.ca<mailto:athompson@merlin.mb.ca> www.merlin.mb.ca<http://www.merlin.mb.ca/> ________________________________ From: NANOG <nanog-bounces+athompson=merlin.mb.ca@nanog.org> on behalf of Billy Croan <BCroan@unrealservers.net> Sent: August 9, 2021 10:47 To: nanog list <nanog@nanog.org> Subject: "Tactical" /24 announcements How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers? All of our allocations are larger and those prefixes we announce for clients as well usually are. But we had a request recently to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack. It seemed a little bit like a tragedy of the commons situation. Is this seen as route table pollution, or a necessary evil in today's world? How many routers out there today would be affected if everyone did this? Are there any big networks that drop or penalize announcements like this?
On 09/08/2021 18:47, Billy Croan wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers?
All of our allocations are larger and those prefixes we announce for clients as well usually are. But we had a request recently to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack. It seemed a little bit like a tragedy of the commons situation.
Is this seen as route table pollution, or a necessary evil in today's world? How many routers out there today would be affected if everyone did this? Are there any big networks that drop or penalize announcements like this?
In addition to what everyone else said, announcing /24s will not help you one bit since ASNs announce /25s, /26s, /27s, etc. Attached is a 7800+ line text file sorted by ASN with prefixes being announced that are more specific than /24 (only /25+/26+/27 listed). This is based on http://www.ris.ripe.net/dumps/riswhoisdump.IPv4.gz from about a month ago. That dump lists all the IPv4 prefixes seen in the collective of latest RIS table dumps, together with origin AS and number of peers that passed the routes to RIS. So good luck with announcing /24s. Regards, Hank
Folks can announce longer than 24 masks all day. They're unlikely to propagate very far though, since most won't accept longer than 24 from the world at large. To the OP, there are some valid reasons to strategically deaggregate here and there, but a blanket "yolo my entire allocation into /24s" seems to be a pretty ill considered request. On Mon, Aug 9, 2021 at 1:34 PM Hank Nussbacher <hank@interall.co.il> wrote:
On 09/08/2021 18:47, Billy Croan wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers?
All of our allocations are larger and those prefixes we announce for clients as well usually are. But we had a request recently to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack. It seemed a little bit like a tragedy of the commons situation.
Is this seen as route table pollution, or a necessary evil in today's world? How many routers out there today would be affected if everyone did this? Are there any big networks that drop or penalize announcements like this?
In addition to what everyone else said, announcing /24s will not help you one bit since ASNs announce /25s, /26s, /27s, etc. Attached is a 7800+ line text file sorted by ASN with prefixes being announced that are more specific than /24 (only /25+/26+/27 listed).
This is based on http://www.ris.ripe.net/dumps/riswhoisdump.IPv4.gz from about a month ago.
That dump lists all the IPv4 prefixes seen in the collective of latest RIS table dumps, together with origin AS and number of peers that passed the routes to RIS.
So good luck with announcing /24s.
Regards, Hank
Dear team, I have resorted to more specific announcements during hijacks, though with only one purpose in mind: To buy us a bit of time while the upstreams and peers put blocks in place to thwart the hijack as close to the source as possible. The more specifics are an imperfect solution, since they don't always propagate as widely or as quickly as the hijacks, but it buys us a bit of time. The more important part of that solution is to network with fellow network operators. This is my go-to solution for everything from hijacking to DDoS to "what the heck is that?!" :) Be well, Rabbi Rob. On 8/9/21 1:38 PM, Tom Beecher wrote:
Folks can announce longer than 24 masks all day. They're unlikely to propagate very far though, since most won't accept longer than 24 from the world at large.
To the OP, there are some valid reasons to strategically deaggregate here and there, but a blanket "yolo my entire allocation into /24s" seems to be a pretty ill considered request.
On Mon, Aug 9, 2021 at 1:34 PM Hank Nussbacher <hank@interall.co.il <mailto:hank@interall.co.il>> wrote:
On 09/08/2021 18:47, Billy Croan wrote: > How does the community feel about using /24 originations in BGP as a > tactical advantage against potential bgp hijackers? > > All of our allocations are larger and those prefixes we announce for > clients as well usually are. But we had a request recently to > originate everything as distinct /24 prefixes, to reduce the effect of > a potential bgp hijack. It seemed a little bit like a tragedy of the > commons situation. > > Is this seen as route table pollution, or a necessary evil in today's world? > How many routers out there today would be affected if everyone did this? > Are there any big networks that drop or penalize announcements like this? >
In addition to what everyone else said, announcing /24s will not help you one bit since ASNs announce /25s, /26s, /27s, etc. Attached is a 7800+ line text file sorted by ASN with prefixes being announced that are more specific than /24 (only /25+/26+/27 listed).
This is based on http://www.ris.ripe.net/dumps/riswhoisdump.IPv4.gz <http://www.ris.ripe.net/dumps/riswhoisdump.IPv4.gz> from about a month ago.
That dump lists all the IPv4 prefixes seen in the collective of latest RIS table dumps, together with origin AS and number of peers that passed the routes to RIS.
So good luck with announcing /24s.
Regards, Hank
-- Rabbi Rob Thomas Team Cymru "It is easy to believe in freedom of speech for those with whom we agree." - Leo McKern
On 8/9/21 19:38, Tom Beecher wrote:
Folks can announce longer than 24 masks all day. They're unlikely to propagate very far though, since most won't accept longer than 24 from the world at large.
Been waiting for the day when /27's, /28's and /29's are going to make it into the DFZ, as was promised 5 or more years ago :-). Mark.
On 10/08/2021 12:31, Mark Tinka wrote:
Been waiting for the day when /27's, /28's and /29's are going to make it into the DFZ, as was promised 5 or more years ago :-).
2914 permit you to leak prefixes as specific as a /28 between your own ports with them. Someone once referred to it as a 'sneaky backhaul', believe. Given that there's no default in 2914, I guess that counts? :D -- I'm really not being serious. A nice feature by NTT, but please let's never make it OK to populate the _actual_ DFZ with an IPv4 prefix greater than a /24. -- Tom
On 8/11/21 12:07, Tom Hill wrote:
2914 permit you to leak prefixes as specific as a /28 between your own ports with them. Someone once referred to it as a 'sneaky backhaul', believe. Given that there's no default in 2914, I guess that counts? :D
I suppose some arrangement between you and your provider is alright. But since I (and I'm guessing, several other) operators that purchase from NTT will not accept anything longer than a /24 from them, it only serves internal forwarding within NTT for their customers that leak /25's or longer into them. I was wondering why we haven't seen this take off in the global DFZ :-). Mark.
man. 9. aug. 2021 22.13 skrev Grzegorz Janoszka <grzegorz@janoszka.pl>:
On 2021-08-09 17:47, Billy Croan wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers?
RPKI is more effective than a competing /24. Unless they hijack you ASn as well.
You will usually get an as path length advantage even if they do hijack your asn. Regards Baldur
On 2021-08-09 22:39, Baldur Norddahl wrote:
man. 9. aug. 2021 22.13 skrev Grzegorz Janoszka <grzegorz@janoszka.pl>:
On 2021-08-09 17:47, Billy Croan wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers?
RPKI is more effective than a competing /24. Unless they hijack you ASn as well.
You will usually get an as path length advantage even if they do hijack your asn.
Unless your RPKI is set to allow /24 but you normally advertise /21 or something shorter.. then RPKI works to the hijacker's advantage. You could argue this is no different than before RPKI which is true.. except that now that RPKI exists people are tempted to use it to automate configuration and take humans out of the loop. I imagine there are quite a few RPKI enabled prefixes (those configured to allow too long advertisements) that are easier to hijack now than they were before RPKI existed. -Rob
This will break the internet at scale. No. Ms. Lady Benjamin PD Cannon of Glencoe, ASCE 6x7 Networks & 6x7 Telecom, LLC CEO lb@6by7.net "The only fully end-to-end encrypted global telecommunications company in the world.” FCC License KJ6FJJ Sent from my iPhone via RFC1149.
On Aug 9, 2021, at 5:20 PM, Robert McKay <robert@mckay.com> wrote:
On 2021-08-09 22:39, Baldur Norddahl wrote:
man. 9. aug. 2021 22.13 skrev Grzegorz Janoszka <grzegorz@janoszka.pl>:
On 2021-08-09 17:47, Billy Croan wrote:
How does the community feel about using /24 originations in BGP as a tactical advantage against potential bgp hijackers? RPKI is more effective than a competing /24. Unless they hijack you ASn as well. You will usually get an as path length advantage even if they do hijack your asn.
Unless your RPKI is set to allow /24 but you normally advertise /21 or something shorter.. then RPKI works to the hijacker's advantage.
You could argue this is no different than before RPKI which is true.. except that now that RPKI exists people are tempted to use it to automate configuration and take humans out of the loop.
I imagine there are quite a few RPKI enabled prefixes (those configured to allow too long advertisements) that are easier to hijack now than they were before RPKI existed.
-Rob
On Mon, 9 Aug 2021 at 17:47, Billy Croan <BCroan@unrealservers.net> wrote:
Are there any big networks that drop or penalize announcements like this?
It's possible you could get your peering request denied for this. I have put *reasonable* prefix aggregation into peering requirements for some years now. If you are a small eyeball network with 8192 IP addresses and originate 32 /24's, that is *not* reasonable. On Mon, 9 Aug 2021 at 17:47, Billy Croan <BCroan@unrealservers.net> wrote:
to originate everything as distinct /24 prefixes, to reduce the effect of a potential bgp hijack.
Some men just want to see the world burn. lukas
On 10/08/2021 07:15, Lukas Tribus wrote:
Are there any big networks that drop or penalize announcements like this? It's possible you could get your peering request denied for this. I have put *reasonable* prefix aggregation into peering requirements for some years now. If you are a small eyeball network with 8192 IP addresses and originate 32 /24's, that is *not* reasonable.
It is quite an issue when a network tries to programmatically filter-out the /24 more-specifics advertisements made from an allocated, .e.g, /20. Such anti-disaggregation/save-my-TCAM efforts really do not work, and will spawn all manner of support tickets. I'm saying this in the hope that it may prevent someone from reading this thread and concluding that it may be a good idea to try. It is not. Speaking to your peers is good, as I think you're encouraging there. I would of course default to asking them if they've read from the Good Book of RPKI. :) I also often find that very outdated "Good Security Practice" is as much to blame for this as anything else, and so when we do talk to our peers and/or customers, we should always be asking the question: "who told you this was a good idea?" -- Tom
On Wed, 11 Aug 2021 at 12:24, Tom Hill <tom@ninjabadger.net> wrote:
On 10/08/2021 07:15, Lukas Tribus wrote:
Are there any big networks that drop or penalize announcements like this? It's possible you could get your peering request denied for this. I have put *reasonable* prefix aggregation into peering requirements for some years now. If you are a small eyeball network with 8192 IP addresses and originate 32 /24's, that is *not* reasonable.
It is quite an issue when a network tries to programmatically filter-out the /24 more-specifics advertisements made from an allocated, .e.g, /20.
Such anti-disaggregation/save-my-TCAM efforts really do not work, and will spawn all manner of support tickets. I'm saying this in the hope that it may prevent someone from reading this thread and concluding that it may be a good idea to try. It is not.
For the record: I did not suggest anything like this. Denying peering requests due to lack of *reasonable* prefix aggregation does not mean installing fancy, impossibile to maintain prefix-lists on transit ingress. I agree with you here, that would be very bad. This save-my-TCAM effort is successful when the peer on the other site actually realizes that there are consequences to decisions like this and reverts it, which is a long shot, sure, but at least I'm not encouraging this. I don't get to dictate other peoples configurations. I do get to decide who is directly exchanging traffic with my network and who isn't. lukas
On Wed, 11 Aug 2021, Tom Hill wrote:
On 10/08/2021 07:15, Lukas Tribus wrote:
Are there any big networks that drop or penalize announcements like this? It's possible you could get your peering request denied for this. I have put *reasonable* prefix aggregation into peering requirements for some years now. If you are a small eyeball network with 8192 IP addresses and originate 32 /24's, that is *not* reasonable.
It is quite an issue when a network tries to programmatically filter-out the /24 more-specifics advertisements made from an allocated, .e.g, /20.
Such anti-disaggregation/save-my-TCAM efforts really do not work, and will spawn all manner of support tickets. I'm saying this in the hope that it may prevent someone from reading this thread and concluding that it may be a good idea to try. It is not.
What sort of hands-on experience is this opinion based on? I've done this manually in the past (quite some time ago), and done properly, it works fine. At least one major network hardware vendor has implemented it as a feature. Turn it on, and the "deaggregates" with same next-hop as an aggregate are not programmed into the FIB. The savings will vary depending on the device's connectivity, but I've seen >40%. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
On 11/08/2021 14:09, Jon Lewis wrote:
What sort of hands-on experience is this opinion based on?
Having an upstream provider that did it, in a very aggressive fashion.
I've done this manually in the past (quite some time ago), and done properly, it works fine.
At least one major network hardware vendor has implemented it as a feature. Turn it on, and the "deaggregates" with same next-hop as an aggregate are not programmed into the FIB. The savings will vary depending on the device's connectivity, but I've seen >40%.
Limiting the pruning to cases with the same next-hop does indeed sound like it would be safer than what I've seen done in the past. Doing this with per-peer prefix-lists would not (certainly not in classic IOS) provide you with the ability to compare the next-hop before rejecting those more-specific prefixes, and likely why the attempts to do this caused the problems that I'm referring to. I'm glad to hear a vendor has implemented a useful knob. Which vendor? -- Tom
On Thu, Aug 12, 2021 at 7:44 AM Tom Hill <tom@ninjabadger.net> wrote:
On 11/08/2021 14:09, Jon Lewis wrote:
At least one major network hardware vendor has implemented it as a feature. Turn it on, and the "deaggregates" with same next-hop as an aggregate are not programmed into the FIB. The savings will vary depending on the device's connectivity, but I've seen >40%.
Limiting the pruning to cases with the same next-hop does indeed sound like it would be safer than what I've seen done in the past.
Hi Tom, To be clear, Jon was talking about pruning it from the FIB not the RIB. You can always safely prune overlapping routes with the same next hop from the Forwarding Information Base because the FIB lookup will still select the same next hop regardless. This is valuable because the main cost driver is carrying the routes in the FIB table that's consulted for every packet handled. If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet. Just because it's the same next hop for you doesn't mean the routes actually share fate. The routers past you need both routes to figure out their own position in a valid path. And it doesn't save you much anyway because the RIB is only consulted when routes change and need to be reprocessed into FIB entries. A $10 virtual server can handle today's BGP RIB with ease and equipment with only a little more power could handle one much larger. It's the FIB which drives the limits. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On 12/08/2021 17:59, William Herrin wrote:
If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet.
How does this break the Internet? I would think it would just result in sub-optimal routing (provided there is a covering larger prefix) but everything should continue to work. Clue me in, please. -Hank
On Thu, Aug 12, 2021 at 12:43 PM Hank Nussbacher <hank@interall.co.il> wrote:
On 12/08/2021 17:59, William Herrin wrote:
If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet.
How does this break the Internet? I would think it would just result in sub-optimal routing (provided there is a covering larger prefix) but everything should continue to work. Clue me in, please.
Hi Hank, I think you're right, it could result in sub-optimal routing and in particular, in your AS not being used for these subprefixes (the traffic will go instead to a competing provider who sent the subprefix), hence, as you said, sub-optimal routing. I think some people (maybe Bill included) may consider the resulting harm to routing to be sufficiently severe to consider this `breaking'. It becomes a judgement call, I guess. Cheers, Amir
-Hank
On 8/12/21 19:17, Amir Herzberg wrote:
Hi Hank, I think you're right, it could result in sub-optimal routing and in particular, in your AS not being used for these subprefixes (the traffic will go instead to a competing provider who sent the subprefix), hence, as you said, sub-optimal routing.
Incorrect - you hold the full table in RIB, which is what you offer to your downstream customers. It's the FIB which wouldn't have the route, but would still be able to forward the traffic to a router that knows better. Downstream customers are none the wiser. Mark.
On Thu, Aug 12, 2021 at 9:41 AM Hank Nussbacher <hank@interall.co.il> wrote:
On 12/08/2021 17:59, William Herrin wrote:
If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet.
How does this break the Internet? I would think it would just result in sub-optimal routing (provided there is a covering larger prefix) but everything should continue to work. Clue me in, please.
A originates 10.0.0.0/16 to paid transit C B originates 10.0.1.0/24 also to paid transit C C offers both routes to D. D discards 10.0.1.0/24 from the RIB based on same-next-hop You peer with A and D. You receive only 10.0.0.0/16 since A doesn't originate 10.0.1.0/24 and D has discarded it. You send packets for 10.0.1.0/24 to A (the shortest path for 10.0.0.0/16), stealing A's paid transit to C to get to B. Unless A filters C-bound packets purportedly from 10.0.1.0/24. B doesn't currently transit for A so from B's perspective that's not an allowed path. In which case, your path to 10.0.1.0/24 is black holed. D broke the Internet. If packets from you reach A at all, they do so through an unpermitted path. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Thu, Aug 12, 2021 at 10:19 AM William Herrin <bill@herrin.us> wrote:
On Thu, Aug 12, 2021 at 9:41 AM Hank Nussbacher <hank@interall.co.il> wrote:
On 12/08/2021 17:59, William Herrin wrote:
If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet.
How does this break the Internet? I would think it would just result in sub-optimal routing (provided there is a covering larger prefix) but everything should continue to work. Clue me in, please.
A originates 10.0.0.0/16 to paid transit C B originates 10.0.1.0/24 also to paid transit C C offers both routes to D. D discards 10.0.1.0/24 from the RIB based on same-next-hop You peer with A and D. You receive only 10.0.0.0/16 since A doesn't originate 10.0.1.0/24 and D has discarded it. You send packets for 10.0.1.0/24 to A (the shortest path for 10.0.0.0/16), stealing A's paid transit to C to get to B. Unless A filters C-bound packets purportedly from 10.0.1.0/24.
I mashed this sentence together wrong. I meant say: "Unless A filters packets from peers which would use their paid transit," a common policy restriction placed on settlement-free peering.
B doesn't currently transit for A so from B's perspective that's not an allowed path. In which case, your path to 10.0.1.0/24 is black holed.
D broke the Internet. If packets from you reach A at all, they do so through an unpermitted path.
-- William Herrin bill@herrin.us https://bill.herrin.us/
On Thu, Aug 12, 2021 at 1:22 PM William Herrin <bill@herrin.us> wrote:
On Thu, Aug 12, 2021 at 9:41 AM Hank Nussbacher <hank@interall.co.il> wrote:
On 12/08/2021 17:59, William Herrin wrote:
If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet.
How does this break the Internet? I would think it would just result in sub-optimal routing (provided there is a covering larger prefix) but everything should continue to work. Clue me in, please.
A originates 10.0.0.0/16 to paid transit C B originates 10.0.1.0/24 also to paid transit C C offers both routes to D. D discards 10.0.1.0/24 from the RIB based on same-next-hop You peer with A and D. You receive only 10.0.0.0/16 since A doesn't originate 10.0.1.0/24 and D has discarded it. You send packets for 10.0.1.0/24 to A (the shortest path for 10.0.0.0/16), stealing A's paid transit to C to get to B.
Unless A filters C-bound packets purportedly from 10.0.1.0/24. B
doesn't currently transit for A so from B's perspective that's not an allowed path. In which case, your path to 10.0.1.0/24 is black holed.
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence. I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers. If A doesn't, though, then B receives a packet from A to 10.0.1.0/24. Unless B is filtering based on the specific IP prefixes of A - which seems to me unlikely - then B has no way to know that this packet is from `you' rather than from A itself (or a customer of A). So B will carry this traffic, imho. So A is just paying for the traffic since it announced the prefix. Such situations, to best of my knowledge, actually happen on the Internet when a subprefix is filtered for different reasons. We observed it happens with ROV , in our ROV++ simulations, but I'll refrain from attaching the URL again so not to be `plugging' that paper (and since I'm lazy to look it up hhh) have great day and I'll be happy to learn if I'm wrong. Amir
On Thu, Aug 12, 2021 at 10:39 AM Amir Herzberg <amir.lists@gmail.com> wrote:
On Thu, Aug 12, 2021 at 1:22 PM William Herrin <bill@herrin.us> wrote:
A originates 10.0.0.0/16 to paid transit C B originates 10.0.1.0/24 also to paid transit C
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
Hi Amir, Why would I take offense? How do any of us learn except by trying to poke holes in claims to see what holds up and what doesn't?
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
The alternative is that A has to disaggregate 10.0.0.0/16 into at least 8 prefixes on the -possibility- that some jackass might filter the one /24 that B announces. If trying to filter one route results in 7 extra routes being added to the table, that's net badness. Filtering may not even be intentional on A's part. If A's peering router only receives A's customer-originated routes (a common enough architecture) then the peering router won't even have a route to B while B's route only arrives from C. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Thu, Aug 12, 2021 at 7:39 PM Amir Herzberg <amir.lists@gmail.com> wrote:
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
You are right that it is wrong but it happens. Some years back I tried a setup where we wanted to reduce the size of the routing table. We dropped everything but routes received from peers and added a default to one of our IP transit providers. This should have been ok because either we had a route to a peer or the packet would go to someone who had the full routing table, yes? So we got complaints. One was a company who would advertise a /20 on a peering with us. But somewhere else far away they had a site from where they would announce a /24 from the same prefix. With no internal routing between the peering site with the /20 to the other site with the /24. We therefore lost the ability to communicate with that /24. You see variants of this. For example a large telco has a /16 from which they many years ago allocated a /24 to a multihomed customer. This customer left but took their /24 with them. This fact will seldom make the large telco split up their /16. They will keep it as a /16 but will no longer route to that /24. The question is also if we really would want a large telco to explode a large subnet due to this case. Regards, Baldur
On Thu, Aug 12, 2021 at 4:32 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
On Thu, Aug 12, 2021 at 7:39 PM Amir Herzberg <amir.lists@gmail.com> wrote:
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
You are right that it is wrong but it happens. Some years back I tried a setup where we wanted to reduce the size of the routing table. We dropped everything but routes received from peers and added a default to one of our IP transit providers. This should have been ok because either we had a route to a peer or the packet would go to someone who had the full routing table, yes?
Baldur, thanks, but, sorry, this isn't the same, or I miss something. If I get you right, you dropped all announcements from _providers_ except making one provider your default gateway (essentially, 0.0.0.0/0). But this is very different from what I understood from what Bill wrote. Your change could (and, from what you say next, did) cause a problem if one of these announcements you dropped from providers was a legit subprefix of a prefix announced by one of your peers, causing you to route to the peer traffic whose destination is in the subprefix. But let me be concrete using what you wrote:
So we got complaints. One was a company who would advertise a /20 on a peering with us. But somewhere else far away they had a site from where they would announce a /24 from the same prefix. With no internal routing between the peering site with the /20 to the other site with the /24. We therefore lost the ability to communicate with that /24.
exactly; but this is since you incorrectly dropped the subprefix announcement which you evidently received from one of your providers. If this analysis is correct, you could have solved the problem - reducing the FIB while avoiding such loss of connectivity - if you retained (only) the announcements from your providers which were to subprefixes of announcements you got from your peers. A bit of scripting required, of course... I'm sure you can do it 100 times faster and better than me :)
You see variants of this. For example a large telco has a /16 from which they many years ago allocated a /24 to a multihomed customer. This customer left but took their /24 with them. This fact will seldom make the large telco split up their /16. They will keep it as a /16 but will no longer route to that /24. The question is also if we really would want a large telco to explode a large subnet due to this case.
No way, agreed! But, as I explained, it's also unnecessary; I mean, that's exactly why we do `most specific' routing. Just don't kill the subprefix announcement! btw... yes, this is a possible issue with ROV, when sometimes there's a ROA for a prefix (say /16) but no roa to a (legitimately announced) subprefix (e.g. /20). We show such case in our 2015 ROV paper, and also measured how many such issues exist; it appears their number is much reduced now, based on more recent measurements. (ah and here, our ROV++ doesn't help; in fact, it would disconnection even more likely than with ROV, since ROV protection against subprefix hijacks is rather weak). Regards, -- Amir Herzberg Comcast professor of Security Innovations, Computer Science and Engineering, University of Connecticut Homepage: https://sites.google.com/site/amirherzberg/home `Applied Introduction to Cryptography' textbook and lectures: https://sites.google.com/site/amirherzberg/applied-crypto-textbook <https://sites.google.com/site/amirherzberg/applied-crypto-textbook>
On Fri, Aug 13, 2021 at 3:54 AM Amir Herzberg <amir.lists@gmail.com> wrote:
On Thu, Aug 12, 2021 at 4:32 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
On Thu, Aug 12, 2021 at 7:39 PM Amir Herzberg <amir.lists@gmail.com> wrote:
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
You are right that it is wrong but it happens. Some years back I tried a setup where we wanted to reduce the size of the routing table. We dropped everything but routes received from peers and added a default to one of our IP transit providers. This should have been ok because either we had a route to a peer or the packet would go to someone who had the full routing table, yes?
Baldur, thanks, but, sorry, this isn't the same, or I miss something.
I think it is exactly the same? Our peer is advertising a prefix for which they will not route all addresses covered. Is that route not then a lie? Should they not have exploded the prefix so they could avoid covering the part of the prefix they will not accept traffic to? (ps: not arguing they should!)
If I get you right, you dropped all announcements from _providers_ except making one provider your default gateway (essentially, 0.0.0.0/0). But this is very different from what I understood from what Bill wrote. Your change could (and, from what you say next, did) cause a problem if one of these announcements you dropped from providers was a legit subprefix of a prefix announced by one of your peers, causing you to route to the peer traffic whose destination is in the subprefix.
Your understanding is correct. But this is just the way we ended up in that situation. Does not change the fact that we got a route from a peer that we believed we could use, but turns out part of that announcement was a lie. Consider that everyone filters received routes. The most common is to filter at the /24 level but nowhere is there a RFC stating that /24 is anything special. So what if I was to filter at a different level, say /20 ? The same thing would happen, we would drop the "/24 exception route" and use the route that is a lie. Regards, Baldur
On Fri, Aug 13, 2021 at 9:49 AM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Our peer is advertising a prefix for which they will not route all addresses covered. Is that route not then a lie? Should they not have exploded the prefix so they could avoid covering the part of the prefix they will not accept traffic to? (ps: not arguing they should!)
Hi Baldur, You do understand the consequence of the position you're taking? You're saying that when an ISP provides a /24 to a customer for multihoming, a common practice throughout the history of the commercial Internet, that ISP MUST also disaggregate the announcement for the supernet that /24 is a part of, exploding the size of the BGP table. If they don't, the overlapping announcement is a "lie" because they don't always have a route to the /24. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Fri, Aug 13, 2021 at 12:50 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
On Fri, Aug 13, 2021 at 3:54 AM Amir Herzberg <amir.lists@gmail.com> wrote:
On Thu, Aug 12, 2021 at 4:32 PM Baldur Norddahl < baldur.norddahl@gmail.com> wrote:
On Thu, Aug 12, 2021 at 7:39 PM Amir Herzberg <amir.lists@gmail.com> wrote:
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
You are right that it is wrong but it happens. Some years back I tried a setup where we wanted to reduce the size of the routing table. We dropped everything but routes received from peers and added a default to one of our IP transit providers. This should have been ok because either we had a route to a peer or the packet would go to someone who had the full routing table, yes?
Baldur, thanks, but, sorry, this isn't the same, or I miss something.
I think it is exactly the same? Our peer is advertising a prefix for which they will not route all addresses covered. Is that route not then a lie? Should they not have exploded the prefix so they could avoid covering the part of the prefix they will not accept traffic to? (ps: not arguing they should!)
I think it isn't the same. This scenario, of an AS selling part of its IP block, e.g., 10.0.1.0/24, and continuing to announce the block, e.g., 10.0.0.0/16, is the classical example used (e.g. by me) to explain the `most specific' rule. Due to `most specific', it is considered, imho, legit to continue to announce 10.0.0.0/16; if 10.0.1.0/24 is reachable, traffic will route to it anyway due to `more specific', so no problem; and if 10.0.1.0/24 isn't reachable, then anyway no harm done... By dropping a legit 10.0.1.0/24 announcement, you may - and in the cited example, did - break connectivity, imho. And quite unnecessarily, too.
If I get you right, you dropped all announcements from _providers_ except making one provider your default gateway (essentially, 0.0.0.0/0). But this is very different from what I understood from what Bill wrote. Your change could (and, from what you say next, did) cause a problem if one of these announcements you dropped from providers was a legit subprefix of a prefix announced by one of your peers, causing you to route to the peer traffic whose destination is in the subprefix.
Your understanding is correct. But this is just the way we ended up in that situation. Does not change the fact that we got a route from a peer that we believed we could use, but turns out part of that announcement was a lie.
Was not a lie, as I explained.
Consider that everyone filters received routes. The most common is to filter at the /24 level but nowhere is there a RFC stating that /24 is anything special.
Oh that's a point I was quite annoyed with for years - who said one should drop prefixes longer than /24 ??? And I searched for it, and indeed found no RFC. But I did find it mentioned in some BCPs! Unfortunately and stupidly, I didn't save these sources, but I did a quick google now and found https://nsrc.org/workshops/2018/linx103-bgp/networking/peering-ixp/en/presen... But that was years ago, and indeed, I also found mention in RFC 7454:
6.1.3 <https://www.rfc-editor.org/rfc/rfc7454.html#section-6.1.3>. Prefixes That Are Too Specific
Most ISPs will not accept advertisements beyond a certain level of specificity (and in return, they do not announce prefixes they consider to be too specific). That acceptable specificity is decided for each peering between the two BGP peers. Some ISP communities have tried to document acceptable specificity. This document does not make any judgement on what the best approach is, it just notes that there are existing practices on the Internet and recommends that the reader refer to them. As an example, the RIPE community has documented that, at the time of writing of this document, IPv4 prefixes longer than /24 and IPv6 prefixes longer than /48 are generally neither announced nor accepted in the Internet [20 <https://www.rfc-editor.org/rfc/rfc7454.html#ref-20>] [21 <https://www.rfc-editor.org/rfc/rfc7454.html#ref-21>]. These values may change in the future.
I also did an experiment that seemed to confirm that most ISPs filter announcements more specific than /24. I think that the NANOG (or in general, operators) community may do well to state the `/24 rule' clearly in a BCP, preferably an RFC. A mismatch in the most-specific rule can definitely allow different problems (and attacks). As mentioned above, RIPE has essentially done this (although could be more explicit). I've seen a similar /48 rule for IPv6, btw. Theoretically, universal adoption of RPKI (incl ROV) could provide an alternative solution to the disconnections, but will not protect against explosion of the routing tables.
So what if I was to filter at a different level, say /20 ? The same thing would happen, we would drop the "/24 exception route" and use the route that is a lie.
Not a lie, but yes, this will lead to loss of connectivity; hence, it's important to standardize this.
Regards,
Baldur
I think that the NANOG (or in general, operators) community may do well to state the `/24 rule' clearly in a BCP, preferably an RFC.
https://datatracker.ietf.org/doc/html/rfc7454 6.1.3 <https://datatracker.ietf.org/doc/html/rfc7454#section-6.1.3>.
Prefixes That Are Too Specific Most ISPs will not accept advertisements beyond a certain level of specificity (and in return, they do not announce prefixes they consider to be too specific). That acceptable specificity is decided for each peering between the two BGP peers. Some ISP communities have tried to document acceptable specificity. This document does not make any judgement on what the best approach is, it just notes that there are existing practices on the Internet and recommends that the reader refer to them. As an example, the RIPE community has documented that, at the time of writing of this document, IPv4 prefixes longer than /24 and IPv6 prefixes longer than /48 are generally neither announced nor accepted in the Internet [20 <https://datatracker.ietf.org/doc/html/rfc7454#ref-20>] [21 <https://datatracker.ietf.org/doc/html/rfc7454#ref-21>]. These values may change in the future.
On Fri, Aug 13, 2021 at 4:54 PM Amir Herzberg <amir.lists@gmail.com> wrote:
On Fri, Aug 13, 2021 at 12:50 PM Baldur Norddahl < baldur.norddahl@gmail.com> wrote:
On Fri, Aug 13, 2021 at 3:54 AM Amir Herzberg <amir.lists@gmail.com> wrote:
On Thu, Aug 12, 2021 at 4:32 PM Baldur Norddahl < baldur.norddahl@gmail.com> wrote:
On Thu, Aug 12, 2021 at 7:39 PM Amir Herzberg <amir.lists@gmail.com> wrote:
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
You are right that it is wrong but it happens. Some years back I tried a setup where we wanted to reduce the size of the routing table. We dropped everything but routes received from peers and added a default to one of our IP transit providers. This should have been ok because either we had a route to a peer or the packet would go to someone who had the full routing table, yes?
Baldur, thanks, but, sorry, this isn't the same, or I miss something.
I think it is exactly the same? Our peer is advertising a prefix for which they will not route all addresses covered. Is that route not then a lie? Should they not have exploded the prefix so they could avoid covering the part of the prefix they will not accept traffic to? (ps: not arguing they should!)
I think it isn't the same. This scenario, of an AS selling part of its IP block, e.g., 10.0.1.0/24, and continuing to announce the block, e.g., 10.0.0.0/16, is the classical example used (e.g. by me) to explain the `most specific' rule. Due to `most specific', it is considered, imho, legit to continue to announce 10.0.0.0/16; if 10.0.1.0/24 is reachable, traffic will route to it anyway due to `more specific', so no problem; and if 10.0.1.0/24 isn't reachable, then anyway no harm done...
By dropping a legit 10.0.1.0/24 announcement, you may - and in the cited example, did - break connectivity, imho. And quite unnecessarily, too.
If I get you right, you dropped all announcements from _providers_ except making one provider your default gateway (essentially, 0.0.0.0/0). But this is very different from what I understood from what Bill wrote. Your change could (and, from what you say next, did) cause a problem if one of these announcements you dropped from providers was a legit subprefix of a prefix announced by one of your peers, causing you to route to the peer traffic whose destination is in the subprefix.
Your understanding is correct. But this is just the way we ended up in that situation. Does not change the fact that we got a route from a peer that we believed we could use, but turns out part of that announcement was a lie.
Was not a lie, as I explained.
Consider that everyone filters received routes. The most common is to filter at the /24 level but nowhere is there a RFC stating that /24 is anything special.
Oh that's a point I was quite annoyed with for years - who said one should drop prefixes longer than /24 ??? And I searched for it, and indeed found no RFC. But I did find it mentioned in some BCPs! Unfortunately and stupidly, I didn't save these sources, but I did a quick google now and found
https://nsrc.org/workshops/2018/linx103-bgp/networking/peering-ixp/en/presen...
But that was years ago, and indeed, I also found mention in RFC 7454:
6.1.3 <https://www.rfc-editor.org/rfc/rfc7454.html#section-6.1.3>. Prefixes That Are Too Specific
Most ISPs will not accept advertisements beyond a certain level of specificity (and in return, they do not announce prefixes they consider to be too specific). That acceptable specificity is decided for each peering between the two BGP peers. Some ISP communities have tried to document acceptable specificity. This document does not make any judgement on what the best approach is, it just notes that there are existing practices on the Internet and recommends that the reader refer to them. As an example, the RIPE community has documented that, at the time of writing of this document, IPv4 prefixes longer than /24 and IPv6 prefixes longer than /48 are generally neither announced nor accepted in the Internet [20 <https://www.rfc-editor.org/rfc/rfc7454.html#ref-20>] [21 <https://www.rfc-editor.org/rfc/rfc7454.html#ref-21>]. These values may change in the future.
I also did an experiment that seemed to confirm that most ISPs filter announcements more specific than /24.
I think that the NANOG (or in general, operators) community may do well to state the `/24 rule' clearly in a BCP, preferably an RFC. A mismatch in the most-specific rule can definitely allow different problems (and attacks). As mentioned above, RIPE has essentially done this (although could be more explicit). I've seen a similar /48 rule for IPv6, btw.
Theoretically, universal adoption of RPKI (incl ROV) could provide an alternative solution to the disconnections, but will not protect against explosion of the routing tables.
So what if I was to filter at a different level, say /20 ? The same thing would happen, we would drop the "/24 exception route" and use the route that is a lie.
Not a lie, but yes, this will lead to loss of connectivity; hence, it's important to standardize this.
Regards,
Baldur
Tom, I also referred to the same text from 7454! But Baldur is right: the text does NOT clearly state that announcement more specific than /24 should be filtered. If you allow different operators to filter at different lengths, you can get disconnections. We never like to standards to be fixed to some number, but this seems necessary (to me). -- Amir Herzberg Comcast professor of Security Innovations, Computer Science and Engineering, University of Connecticut Homepage: https://sites.google.com/site/amirherzberg/home `Applied Introduction to Cryptography' textbook and lectures: https://sites.google.com/site/amirherzberg/applied-crypto-textbook <https://sites.google.com/site/amirherzberg/applied-crypto-textbook> On Fri, Aug 13, 2021 at 5:05 PM Tom Beecher <beecher@beecher.cc> wrote:
I think that the NANOG (or in general, operators) community may do well to
state the `/24 rule' clearly in a BCP, preferably an RFC.
https://datatracker.ietf.org/doc/html/rfc7454
6.1.3 <https://datatracker.ietf.org/doc/html/rfc7454#section-6.1.3>.
Prefixes That Are Too Specific Most ISPs will not accept advertisements beyond a certain level of specificity (and in return, they do not announce prefixes they consider to be too specific). That acceptable specificity is decided for each peering between the two BGP peers. Some ISP communities have tried to document acceptable specificity. This document does not make any judgement on what the best approach is, it just notes that there are existing practices on the Internet and recommends that the reader refer to them. As an example, the RIPE community has documented that, at the time of writing of this document, IPv4 prefixes longer than /24 and IPv6 prefixes longer than /48 are generally neither announced nor accepted in the Internet [20 <https://datatracker.ietf.org/doc/html/rfc7454#ref-20>] [21 <https://datatracker.ietf.org/doc/html/rfc7454#ref-21>]. These values may change in the future.
On Fri, Aug 13, 2021 at 4:54 PM Amir Herzberg <amir.lists@gmail.com> wrote:
On Fri, Aug 13, 2021 at 12:50 PM Baldur Norddahl < baldur.norddahl@gmail.com> wrote:
On Fri, Aug 13, 2021 at 3:54 AM Amir Herzberg <amir.lists@gmail.com> wrote:
On Thu, Aug 12, 2021 at 4:32 PM Baldur Norddahl < baldur.norddahl@gmail.com> wrote:
On Thu, Aug 12, 2021 at 7:39 PM Amir Herzberg <amir.lists@gmail.com> wrote:
Bill, I beg to respectfully differ, knowing that I'm just a researcher and working `for real' like you guys, so pls take no offence.
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
You are right that it is wrong but it happens. Some years back I tried a setup where we wanted to reduce the size of the routing table. We dropped everything but routes received from peers and added a default to one of our IP transit providers. This should have been ok because either we had a route to a peer or the packet would go to someone who had the full routing table, yes?
Baldur, thanks, but, sorry, this isn't the same, or I miss something.
I think it is exactly the same? Our peer is advertising a prefix for which they will not route all addresses covered. Is that route not then a lie? Should they not have exploded the prefix so they could avoid covering the part of the prefix they will not accept traffic to? (ps: not arguing they should!)
I think it isn't the same. This scenario, of an AS selling part of its IP block, e.g., 10.0.1.0/24, and continuing to announce the block, e.g., 10.0.0.0/16, is the classical example used (e.g. by me) to explain the `most specific' rule. Due to `most specific', it is considered, imho, legit to continue to announce 10.0.0.0/16; if 10.0.1.0/24 is reachable, traffic will route to it anyway due to `more specific', so no problem; and if 10.0.1.0/24 isn't reachable, then anyway no harm done...
By dropping a legit 10.0.1.0/24 announcement, you may - and in the cited example, did - break connectivity, imho. And quite unnecessarily, too.
If I get you right, you dropped all announcements from _providers_ except making one provider your default gateway (essentially, 0.0.0.0/0). But this is very different from what I understood from what Bill wrote. Your change could (and, from what you say next, did) cause a problem if one of these announcements you dropped from providers was a legit subprefix of a prefix announced by one of your peers, causing you to route to the peer traffic whose destination is in the subprefix.
Your understanding is correct. But this is just the way we ended up in that situation. Does not change the fact that we got a route from a peer that we believed we could use, but turns out part of that announcement was a lie.
Was not a lie, as I explained.
Consider that everyone filters received routes. The most common is to filter at the /24 level but nowhere is there a RFC stating that /24 is anything special.
Oh that's a point I was quite annoyed with for years - who said one should drop prefixes longer than /24 ??? And I searched for it, and indeed found no RFC. But I did find it mentioned in some BCPs! Unfortunately and stupidly, I didn't save these sources, but I did a quick google now and found
https://nsrc.org/workshops/2018/linx103-bgp/networking/peering-ixp/en/presen...
But that was years ago, and indeed, I also found mention in RFC 7454:
6.1.3 <https://www.rfc-editor.org/rfc/rfc7454.html#section-6.1.3>. Prefixes That Are Too Specific
Most ISPs will not accept advertisements beyond a certain level of specificity (and in return, they do not announce prefixes they consider to be too specific). That acceptable specificity is decided for each peering between the two BGP peers. Some ISP communities have tried to document acceptable specificity. This document does not make any judgement on what the best approach is, it just notes that there are existing practices on the Internet and recommends that the reader refer to them. As an example, the RIPE community has documented that, at the time of writing of this document, IPv4 prefixes longer than /24 and IPv6 prefixes longer than /48 are generally neither announced nor accepted in the Internet [20 <https://www.rfc-editor.org/rfc/rfc7454.html#ref-20>] [21 <https://www.rfc-editor.org/rfc/rfc7454.html#ref-21>]. These values may change in the future.
I also did an experiment that seemed to confirm that most ISPs filter announcements more specific than /24.
I think that the NANOG (or in general, operators) community may do well to state the `/24 rule' clearly in a BCP, preferably an RFC. A mismatch in the most-specific rule can definitely allow different problems (and attacks). As mentioned above, RIPE has essentially done this (although could be more explicit). I've seen a similar /48 rule for IPv6, btw.
Theoretically, universal adoption of RPKI (incl ROV) could provide an alternative solution to the disconnections, but will not protect against explosion of the routing tables.
So what if I was to filter at a different level, say /20 ? The same thing would happen, we would drop the "/24 exception route" and use the route that is a lie.
Not a lie, but yes, this will lead to loss of connectivity; hence, it's important to standardize this.
Regards,
Baldur
Tom Beecher wrote:
6.1.3 <https://datatracker.ietf.org/doc/html/rfc7454#section-6.1.3>.
at the time of writing of this document, IPv4 prefixes longer than /24 and IPv6 prefixes longer than /48 are generally neither announced nor accepted in the Internet
That's why, unlike IPv4, IPv6 is hopeless as rfc2374 was abandoned. Masataka Ohta
On Fri, Aug 13, 2021 at 10:53 PM Amir Herzberg <amir.lists@gmail.com> wrote:
I think it isn't the same.
I am still not sure but maybe I misunderstood what you originally said. It is probably not important.
I think that the NANOG (or in general, operators) community may do well to state the `/24 rule' clearly in a BCP, preferably an RFC. A mismatch in the most-specific rule can definitely allow different problems (and attacks). As mentioned above, RIPE has essentially done this (although could be more explicit). I've seen a similar /48 rule for IPv6, btw.
I am not sure how big a problem this is. We only had this one case that I described and it was easily fixed by allowing that one prefix from our transit. The peer also offered to fix their announcement. But we did not run with it for very long because we only reduced our routing table to debug a different problem. Maybe we could have a community or other mechanism to mark the few routes that can not be dropped in exchange for a default route. For all the stub networks out there we should be able to aggressively filter routes without much harm. Regards, Baldur
Baldur Norddahl wrote:
For all the stub networks out there we should be able to aggressively filter routes without much harm.
Stub networks, which, by definition, do not have transit traffic over them, can not filter routes for transit traffic at all. But, if both of two stub networks with address ranges of 131.112.32.0/24 and 131.112.33.0/24 advertise 131.112.32.0/23, the result will be disastrous for the networks. As such, even stub networks should advertise exact address ranges of them. Masataka Ohta
Every major vendor at some point in time has implemented RIB->FIB(really BGP->RIB->FIB) filtering, on Redback/Ericsson routers we did around 2013/2014(@Jakob Heitz;-)) Route compression is a more complex topic, it is not difficult to aggregate, it is to effectively disaggregate on changes. MS research published a white paper in early 2010s, Volta in late 2010s implemented quite effectively route aggregation on top of FRR BGP stack (full BGP table into Trident2 class silicon), to my memory, Spotify did a similar implementation with Arista. Most importantly - these days (at least Cisco and Juniper) through service layer APIs allow to run best path off box and reinject the results back into RIB, where the routes computed would still be a subject to best route selection and hence reasonably safe wrt loops. So if you feel advantageous - write your own compression code, toolset is there. Cheers, Jeff
On Aug 14, 2021, at 06:21, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Baldur Norddahl wrote:
For all the stub networks out there we should be able to aggressively filter routes without much harm.
Stub networks, which, by definition, do not have transit traffic over them, can not filter routes for transit traffic at all.
But, if both of two stub networks with address ranges of 131.112.32.0/24 and 131.112.33.0/24 advertise 131.112.32.0/23, the result will be disastrous for the networks.
As such, even stub networks should advertise exact address ranges of them.
Masataka Ohta
Jeff Tantsura wrote:
where the routes computed would still be a subject to best route selection and hence reasonably safe wrt loops.
As Baldur said:
For all the stub networks out there we should be able to aggressively filter routes without much harm.
thanks to IRRs and RPKI, whatever wrong things stub networks might do, they only harm the stub networks and their peers and are "reasonably safe" for rest of us. So? Masataka Ohta
----- On Aug 12, 2021, at 10:38 AM, Amir Herzberg amir.lists@gmail.com wrote: Hi,
I don't think A would be right to filter these packets to 10.0.1.0/24; A has announced 10.0.0.0/16 so should route to that (entire) prefix, or A is misleading its peers.
This is what it boils down to. If you don't want to route it, don't advertise it. Thanks, Sabri
On Thu, 12 Aug 2021, William Herrin wrote:
On Thu, Aug 12, 2021 at 9:41 AM Hank Nussbacher <hank@interall.co.il> wrote:
On 12/08/2021 17:59, William Herrin wrote:
If you prune the routes from the Routing Information Base instead, for any widely accepted size (i.e. /24 or shorter netmask) you break the Internet.
How does this break the Internet? I would think it would just result in sub-optimal routing (provided there is a covering larger prefix) but everything should continue to work. Clue me in, please.
A originates 10.0.0.0/16 to paid transit C B originates 10.0.1.0/24 also to paid transit C C offers both routes to D. D discards 10.0.1.0/24 from the RIB based on same-next-hop You peer with A and D. You receive only 10.0.0.0/16 since A doesn't originate 10.0.1.0/24 and D has discarded it. You send packets for 10.0.1.0/24 to A (the shortest path for 10.0.0.0/16), stealing A's paid transit to C to get to B. Unless A filters C-bound packets purportedly from 10.0.1.0/24. B doesn't currently transit for A so from B's perspective that's not an allowed path. In which case, your path to 10.0.1.0/24 is black holed.
D broke the Internet. If packets from you reach A at all, they do so through an unpermitted path.
A originated the /16 and should be prepared to deal with all bits to IPs within it. What's worse is when A originates/advertises the /16 to C. A also advertises the /24(s) only to other transits D, E, and F. C's peers that don't see the subnets send traffic to C that C then has to send out via transit to reach D, E, or F. I've been C :( We asked A to make it stop. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
On 8/12/21 19:19, William Herrin wrote:
A originates 10.0.0.0/16 to paid transit C B originates 10.0.1.0/24 also to paid transit C C offers both routes to D. D discards 10.0.1.0/24 from the RIB based on same-next-hop
Yeah, discarding from RIB is not the idea. It's discarding from FIB. RIB is always globally converged. Mark.
On Thu, 12 Aug 2021, Tom Hill wrote:
On 11/08/2021 14:09, Jon Lewis wrote:
What sort of hands-on experience is this opinion based on?
Having an upstream provider that did it, in a very aggressive fashion.
Odds are, they did it wrong, and you had no control and limited, if any, visibility into what they did. Obviously, if you're going to blindly filter routes based on prefix-length, you need to point default at something that doesn't...and if you're acting as a transit provider, you're likely no longer able to provide "full routes" to customers from devices doing this or fed the "not so full table" from devices doing it. I can work though, and on pretty much any platform.
Limiting the pruning to cases with the same next-hop does indeed sound like it would be safer than what I've seen done in the past.
Doing this with per-peer prefix-lists would not (certainly not in classic IOS) provide you with the ability to compare the next-hop before rejecting those more-specific prefixes, and likely why the attempts to do this caused the problems that I'm referring to.
I'm glad to hear a vendor has implemented a useful knob. Which vendor?
Arista. They call it FIB compression. They mention it's a trade-off, more memory and CPU utilization (keeping track of things) in exchange for being able to keep hardware that might otherwise be out of FIB space able to cope with full tables. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
On 12/08/2021 18:09, Jon Lewis wrote:
Having an upstream provider that did it, in a very aggressive fashion.
Odds are, they did it wrong, and you had no control and limited, if any, visibility into what they did. Obviously, if you're going to blindly filter routes based on prefix-length, you need to point default at something that doesn't...and if you're acting as a transit provider, you're likely no longer able to provide "full routes" to customers from devices doing this or fed the "not so full table" from devices doing it.
Yes. This is precisely why I wrote my initial email, and perhaps I wasn't specific enough, but it was a fairly generic warning against "bright ideas" that don't include the proper scrutiny (or _do_ include unnecessary amounts of arrogance).
Arista. They call it FIB compression. They mention it's a trade-off, more memory and CPU utilization (keeping track of things) in exchange for being able to keep hardware that might otherwise be out of FIB space able to cope with full tables.
Ah, thank you, noted. -- Tom
Jon Lewis wrote on 12/08/2021 18:09:
Arista. They call it FIB compression. They mention it's a trade-off, more memory and CPU utilization (keeping track of things) in exchange for being able to keep hardware that might otherwise be out of FIB space able to cope with full tables.
it also causes non-deterministic fib resource consumption. On most edge deployments this won't matter, but it wouldn't be hard to cook up a topology that could fail in interesting ways. Overall fib compression is a net win, but you need to be careful with it. Nick
On Thu, 12 Aug 2021, Nick Hilliard wrote:
Jon Lewis wrote on 12/08/2021 18:09:
Arista. They call it FIB compression. They mention it's a trade-off, more memory and CPU utilization (keeping track of things) in exchange for being able to keep hardware that might otherwise be out of FIB space able to cope with full tables.
it also causes non-deterministic fib resource consumption. On most edge deployments this won't matter, but it wouldn't be hard to cook up a topology that could fail in interesting ways. Overall fib compression is a net win, but you need to be careful with it.
Yeah...changes to the network could suddenly run such a box out of FIB resources, and you could easily be wrong when predicting how much longer a box has for it's "full routes" days...but the alternatives are "don't do full routes" or replace the box much sooner. In that respect, it's somewhat remarkable that Arista even developed the feature. "We can sell them newer hardware with larger FIB capabilities, or offer a software update that extends the life of the gear they've already bought." What company chooses the latter? :) ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
On 8/12/21 19:57, Jon Lewis wrote:
Yeah...changes to the network could suddenly run such a box out of FIB resources, and you could easily be wrong when predicting how much longer a box has for it's "full routes" days...but the alternatives are "don't do full routes" or replace the box much sooner. In that respect, it's somewhat remarkable that Arista even developed the feature. "We can sell them newer hardware with larger FIB capabilities, or offer a software update that extends the life of the gear they've already bought." What company chooses the latter? :)
There was a time when vendors were actually ran by engineers :-). I recall asking for the feature from Cisco around 2011/2012, for the ME3600X/3800X, and that's how it arrived. The team developing that breed of box were excited about the prospect of its success, since I began working with them to develop it in 2009. So they rolled out as many features as I could help them make sense of, and BGP-SD was one of them. The good news is it made it into IOS XE, and became available for a ton of other platforms, the ASR920 included. Nowadays, one wonders who's actually running the show at vendor-land... Mark.
On 8/12/21 19:30, Nick Hilliard wrote:
it also causes non-deterministic fib resource consumption. On most edge deployments this won't matter, but it wouldn't be hard to cook up a topology that could fail in interesting ways. Overall fib compression is a net win, but you need to be careful with it.
We only needed it on boxes with a small FIB, to begin with. We don't use this on large routers with millions of FIB slots to spare. The challenge on small boxes is that at some point, even with the FIB filtering, you will run out of FIB slots, and then weird things start happening. At that point, dumping the box and going for something larger (say, dropping a 20,000 FIB box and going for a 256,000 FIB box) is your only real option. It's still cheaper than a large box with lots more FIB, and you can continue the benefits of FIB filtering without causing weird problems in the network. Mark.
On 8/11/21 12:24, Tom Hill wrote:
Such anti-disaggregation/save-my-TCAM efforts really do not work, and will spawn all manner of support tickets. I'm saying this in the hope that it may prevent someone from reading this thread and concluding that it may be a good idea to try. It is not.
We've been doing this on low-FIB platforms (mainly for our Metro, where we want to hold a full table in RIB and a few internal routes in RIB) since about 2014, and it works - as Scar in "The Lion King" would say - rather well. The only downside is when your IGP and LDP database grows larger than the available FIB. But that's another issue; although not an insignificant one. Mark.
participants (22)
-
Adam Thompson
-
Amir Herzberg
-
Baldur Norddahl
-
Billy Croan
-
Chris Cummings
-
Grzegorz Janoszka
-
Hank Nussbacher
-
Jeff Tantsura
-
Jon Lewis
-
Lady Benjamin Cannon of Glencoe, ASCE
-
Lukas Tribus
-
Mark Tinka
-
Martijn Schmidt
-
Masataka Ohta
-
Nick Hilliard
-
Rabbi Rob Thomas
-
Robert McKay
-
Sabri Berisha
-
Saku Ytti
-
Tom Beecher
-
Tom Hill
-
William Herrin