Request comment: list of IPs to block outbound
The following list is what I'm thinking of using for blocking traffic between an edge router acting as a firewall and an ISP/upstream. This table is limited to address blocks only; TCP/UDP port filtering, and IP protocol filtering, is a separate discussion. This is for an implementation of BCP-38 recommendations. I'm trying to decide whether the firewall should just blackhole these addresses in the routing table, or use rules in NFTABLES against source and destination addresses, or some combination. If NFTABLES, the best place to put the blocks (inbound and outbound) would be in the FORWARD chain, both inbound and outbound. (N.B. for endpoint boxes, they go into the OUTPUT chain.) In trying to research what would constitute "best practice", the papers I found were outdated, potentially incomplete (particularly with reference to IPv6), or geared toward other applications. This table currently does not have exceptions -- some may need to be added as a specific "allow" route or list. The Linux rp_filter knob is effective for endpoint servers and workstations, and I turn it on religiously (easy because it's the default). For a firewall router without blackhole routes, it's less effective because, for incoming packets, a source address matching one of your inside netblocks will pass. A subset of the list would be useful in endpoint boxes to relieve pressure on the upstream edge router -- particularly if a ne'er-do-well successfully hijacks the endpoint box to participate in a DDoS flood. IPv4 Address block Scope Description 0.0.0.0/8 Software Current network (only valid as source address). 10.0.0.0/8 Private network Used for local communications within a private network. 100.64.0.0/10 Private network Shared address space[3] for communications between a service provider and its subscribers when using a carrier-grade NAT. 127.0.0.0/8 Host Used for loopback addresses to the local host. 169.254.0.0/16 Subnet Used for link-local addresses between two hosts on a single link when no IP address is otherwise specified, such as would have normally been retrieved from a DHCP server. 172.16.0.0/12 Private network Used for local communications within a private network. 192.0.0.0/24 Private network IETF Protocol Assignments. 192.0.2.0/24 Documentation Assigned as TEST-NET-1, documentation and examples. 192.88.99.0/24 Internet Reserved. Formerly used for IPv6 to IPv4 relay 192.168.0.0/16 Private network Used for local communications within a private network. 198.18.0.0/15 Private network Used for benchmark testing of inter-network communications between two separate subnets. 198.51.100.0/24 Documentation Assigned as TEST-NET-2, documentation and examples. 203.0.113.0/24 Documentation Assigned as TEST-NET-3, documentation and examples. 224.0.0.0/4 Internet In use for IP multicast. 240.0.0.0/4 Internet Reserved for future use. 255.255.255.255/32 Subnet Reserved for the "limited broadcast" destination address. IPv6 Address block Usage Purpose ::/0 Routing Default route. ::/128 Software Unspecified address. ::1/128 Host Loopback address to local host. ::ffff:0:0/96 Software IPv4 mapped addresses. ::ffff:0:0:0/96 Software IPv4 translated addresses. 64:ff9b::/96 Global Internet IPv4/IPv6 translation. 100::/64 Routing Discard prefix. 2001::/32 Global Internet Teredo tunneling. 2001:20::/28 Software ORCHIDv2. 2001:db8::/32 Documentation Addresses used in documentation and example source code. 2002::/16 Global Internet The 6to4 addressing scheme fc00::/7 Private network Unique local address. fe80::/10 Link Link-local address. ff00::/8 Global Internet Multicast address.
Hi, sorry - but why would you want to block Teredo / 6to4? Florian Brandstetter President & Founder W // https://www.globalone.io (https://link.getmailspring.com/link/5EDC7C51-257C-47AC-B303-4B5A7F6E9AD9@getmailspring.com/0?redirect=https%3A%2F%2Fwww.globalone.io&recipient=bmFub2dAbmFub2cub3Jn) On Okt. 13 2019, at 5:58 pm, Stephen Satchell <list@satchell.net> wrote:
The following list is what I'm thinking of using for blocking traffic between an edge router acting as a firewall and an ISP/upstream. This table is limited to address blocks only; TCP/UDP port filtering, and IP protocol filtering, is a separate discussion. This is for an implementation of BCP-38 recommendations.
I'm trying to decide whether the firewall should just blackhole these addresses in the routing table, or use rules in NFTABLES against source and destination addresses, or some combination. If NFTABLES, the best place to put the blocks (inbound and outbound) would be in the FORWARD chain, both inbound and outbound. (N.B. for endpoint boxes, they go into the OUTPUT chain.)
In trying to research what would constitute "best practice", the papers I found were outdated, potentially incomplete (particularly with reference to IPv6), or geared toward other applications. This table currently does not have exceptions -- some may need to be added as a specific "allow" route or list.
The Linux rp_filter knob is effective for endpoint servers and workstations, and I turn it on religiously (easy because it's the default). For a firewall router without blackhole routes, it's less effective because, for incoming packets, a source address matching one of your inside netblocks will pass. A subset of the list would be useful in endpoint boxes to relieve pressure on the upstream edge router -- particularly if a ne'er-do-well successfully hijacks the endpoint box to participate in a DDoS flood.
IPv4 Address block Scope Description 0.0.0.0/8 Software Current network (only valid as source address). 10.0.0.0/8 Private network Used for local communications within a private network. 100.64.0.0/10 Private network Shared address space[3] for communications between a service provider and its subscribers when using a carrier-grade NAT. 127.0.0.0/8 Host Used for loopback addresses to the local host. 169.254.0.0/16 Subnet Used for link-local addresses between two hosts on a single link when no IP address is otherwise specified, such as would have normally been retrieved from a DHCP server. 172.16.0.0/12 Private network Used for local communications within a private network. 192.0.0.0/24 Private network IETF Protocol Assignments. 192.0.2.0/24 Documentation Assigned as TEST-NET-1, documentation and examples. 192.88.99.0/24 Internet Reserved. Formerly used for IPv6 to IPv4 relay 192.168.0.0/16 Private network Used for local communications within a private network. 198.18.0.0/15 Private network Used for benchmark testing of inter-network communications between two separate subnets. 198.51.100.0/24 Documentation Assigned as TEST-NET-2, documentation and examples. 203.0.113.0/24 Documentation Assigned as TEST-NET-3, documentation and examples. 224.0.0.0/4 Internet In use for IP multicast. 240.0.0.0/4 Internet Reserved for future use. 255.255.255.255/32 Subnet Reserved for the "limited broadcast" destination address.
IPv6 Address block Usage Purpose ::/0 Routing Default route. ::/128 Software Unspecified address. ::1/128 Host Loopback address to local host. ::ffff:0:0/96 Software IPv4 mapped addresses. ::ffff:0:0:0/96 Software IPv4 translated addresses. 64:ff9b::/96 Global Internet IPv4/IPv6 translation. 100::/64 Routing Discard prefix. 2001::/32 Global Internet Teredo tunneling. 2001:20::/28 Software ORCHIDv2. 2001:db8::/32 Documentation Addresses used in documentation and example source code. 2002::/16 Global Internet The 6to4 addressing scheme fc00::/7 Private network Unique local address. fe80::/10 Link Link-local address. ff00::/8 Global Internet Multicast address.
On 10/13/19 9:08 AM, Florian Brandstetter wrote:
Hi,
sorry - but why would you want to block Teredo?
I know nothing about Terendo tunneling.
In computer networking, Teredo is a transition technology that gives full IPv6 connectivity for IPv6-capable hosts that are on the IPv4 Internet but have no native connection to an IPv6 network. Unlike similar protocols such as 6to4, it can perform its function even from behind network address translation (NAT) devices such as home routers.
Teredo operates using a platform independent tunneling protocol that provides IPv6 (Internet Protocol version 6) connectivity by encapsulating IPv6 datagram packets within IPv4 User Datagram Protocol (UDP) packets. Teredo routes these datagrams on the IPv4 Internet and through NAT devices. Teredo nodes elsewhere on the IPv6 network (called Teredo relays) receive the packets, un-encapsulate them, and pass them on.
Are you saying that Terendo should come off the list? Is this useful between an ISP and an edge firewall fronting an internal network? Would I see inbound packets with a source address in the 2001::/32 netblock?
sorry - but why would you want to block 6to4? In my research, this is marked as deprecated. Would I see packets with a source address in the 2002::/16 netblock?
On 10/13/19 3:36 PM, Stephen Satchell wrote:
Are you saying that Terendo should come off the list? Is this useful between an ISP and an edge firewall fronting an internal network? Would I see inbound packets with a source address in the 2001::/32 netblock?
If you are running services which are "generally available to the public". you can absolutely expect to see these. Anyone stuck behind an IPv6-hostile NAT44 is likely to end up using Teredo as the "transition mechanism of last resort". It usually works, albeit with poor performance, in almost all situations unless the IPv6-hostile network has actively blocked it in their IPv4 ruleset. I personally use Teredo somewhat frequently. Yes, I could set up a similar tunneling mechanism to a network I control and get "production" addressing and probably better quality of service, but Teredo is as simple as "apt-get install miredo". It's also available on stock Windows albeit (I think) disabled by default. If your network only talks to specific, known destinations, then it's up to you. Your network; your rules. It's certainly unlikely you'll ever see any publicly accessible services of consequence being hosted in 2001::/32 if only because the addressing tends to be somewhat transient and NAT hole punching unreliable for inbound, unsolicited data.
In my research, this is marked as deprecated. Would I see packets with a source address in the 2002::/16 netblock?
In theory, this is just as legitimate as Teredo. In practice, it is indeed deprecated, and almost anyone who can set up 6to4 can get a "production" tunnel to someone like HE.net or likely has 6rd available from their native IPv4 provider. It can also be tricky to prevent reflection type attacks using 6to4 address space. IIRC, Windows used to set up 6to4 by default if it found it had what it believed to be publicly routable IPv4 connectivity, but I think this may now be disabled. Some consumer routers did the same. It was handy because you got a full /48 allowing non-NAT addressing of subtended networks and even prefix delegation if you wanted it. While this probably falls under the same justifications as the above, in practice I'd say 6to4 is probably all but dead in terms of legitimate uses on the public Internet of today. I haven't personally run 6to4 in over a decade. 6to4 was a neat idea, but I think it's dead, Jim. -- Brandon Martin
On 10/13/19 8:58 AM, Stephen Satchell wrote:
In trying to research what would constitute "best practice", the papers I found were outdated, potentially incomplete (particularly with reference to IPv6), or geared toward other applications. This table currently does not have exceptions -- some may need to be added as a specific "allow" route or list.
On Sun, Oct 13, 2019 at 8:58 AM Stephen Satchell <list@satchell.net> wrote:
The following list is what I'm thinking of using for blocking traffic between an edge router acting as a firewall and an ISP/upstream. This table is limited to address blocks only; TCP/UDP port filtering, and IP protocol filtering, is a separate discussion. This is for an implementation of BCP-38 recommendations.
BCP-38 as it applies to outbound traffic is more about blocking SOURCE IP addresses. You should block everything whose source IP address is not within your assigned address space.
100.64.0.0/10 Private network Shared address space[3] for communications between a service provider and its subscribers when using a carrier-grade NAT.
This space is set aside for your ISP to use. like RFC1918 but for ISPs. It is not specifically CGNAT. Unless you are an ISP using this space, you should not block destinations in this space.
224.0.0.0/4 Internet In use for IP multicast. 240.0.0.0/4 Internet Reserved for future use. 255.255.255.255/32 Subnet Reserved for the "limited broadcast" destination address.
This can be covered with a single rule: 224.0.0.0/3
IPv6 Address block Usage Purpose ::/0 Routing Default route.
The current IPv6 Internet is 2000::/3, not ::/0 and that won't change in the foreseeable future. You can tighten your filter to allow just that. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Sun, 13 Oct 2019 at 19:29, William Herrin <bill@herrin.us> wrote:
The current IPv6 Internet is 2000::/3, not ::/0 and that won't change in the foreseeable future. You can tighten your filter to allow just that.
Only do this, if this isn't CLI jockey network now or in the future. -- ++ytti
Subject: Re: Request comment: list of IPs to block outbound Date: Sun, Oct 13, 2019 at 09:24:39AM -0700 Quoting William Herrin (bill@herrin.us):
100.64.0.0/10 Private network Shared address space[3] for communications between a service provider and its subscribers when using a carrier-grade NAT.
This space is set aside for your ISP to use. like RFC1918 but for ISPs. It is not specifically CGNAT. Unless you are an ISP using this space, you should not block destinations in this space.
I have a hard time finding text that prohibits me from running machines on 100.64/10 addresses inside my network. It is just more RFC1918 space, a /10 unwisely spent on stalling IPv6 deployment. /Måns, guilty. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 It's OKAY -- I'm an INTELLECTUAL, too.
On 10/22/19 10:54 PM, Måns Nilsson wrote:
I have a hard time finding text that prohibits me from running machines on 100.64/10 addresses inside my network.
I think you are free to use RFC 6598 — Shared Address Space — in your network. Though you should be aware of caveats of doing so.
It is just more RFC1918 space, a /10 unwisely spent on stalling IPv6 deployment.
My understanding is that RFC 6598 — Shared Address Space — is *EXPLICITLY* /not/ a part of RFC 1918 — Private Internet (Space). And I do mean /explicitly/. The explicit nature of RFC 6598 is on purpose so that there is no chance that it will conflict with RFC 1918. This is important because it means that RFC 6598 can /safely/ be used for Carrier Grade NAT by ISPs without any fear of conflicting with any potential RFC 1918 IP space that clients may be using. RFC 6598 ∉ RFC 1918 and RFC 1918 ∉ RFC 6598 RFC 6598 and RFC 1918 are mutually exclusive of each other. Yes, you can run RFC 6598 in your home network. But you have nobody to complain to if (when) your ISP starts using RFC 6598 Shared Address Space to support Carrier Grade NAT and you end up with an IP conflict. Aside from that caveat, sure, use RFC 6598. -- Grant. . . . unix || die
On 10/22/19 10:11 PM, Grant Taylor via NANOG wrote:
The explicit nature of RFC 6598 is on purpose so that there is no chance that it will conflict with RFC 1918. This is important because it means that RFC 6598 can /safely/ be used for Carrier Grade NAT by ISPs without any fear of conflicting with any potential RFC 1918 IP space that clients may be using.
RFC 6598 ∉ RFC 1918 and RFC 1918 ∉ RFC 6598 RFC 6598 and RFC 1918 are mutually exclusive of each other.
Yes, you can run RFC 6598 in your home network. But you have nobody to complain to if (when) your ISP starts using RFC 6598 Shared Address Space to support Carrier Grade NAT and you end up with an IP conflict.
Aside from that caveat, sure, use RFC 6598.
So, to the reason for the comment request, you are telling me not to blackhole 100.64/10 in the edge router downstream from an ISP as a general rule, and to accept source addresses from this netblock. Do I understand you correctly? FWIW, I think I've received this recommendation before. The current version of my NetworkManager dispatcher-d-bcp38.sh script has the creation of the blackhole route already disabled; i.e., the netblock is not quarantined.
On 2019-10-22 22:38 -0700, Stephen Satchell wrote:
So, to the reason for the comment request, you are telling me not to blackhole 100.64/10 in the edge router downstream from an ISP as a general rule, and to accept source addresses from this netblock. Do I understand you correctly?
Depends. If your network is a typical home network, connected via a normal residential ISP, then you should very much expect to need to talk to 100.64/10, and even be assigned addresses from that block. On the other hand, if you have a fixed public address block, be it PI or PA space, reachable from the world, then you shouldn't see any traffic from addresses within the CGNAT block. So, at home I don't block such addresses. But at work (a department within a university, connected to the Swedish NREN), I do block the CGNAT addresses on our border links.
FWIW, I think I've received this recommendation before. The current version of my NetworkManager dispatcher-d-bcp38.sh script has the creation of the blackhole route already disabled; i.e., the netblock is not quarantined.
If this is a laptop which you may someday connect to some guest network somewhere in the world, then not blocking 100.64/10 is the right thing to do. Nor should you block RFC 1918 addresses in that situation. (Assuming you actually want to communicate with the rest of the world. :-) /Bellman
On 10/22/19 11:38 PM, Stephen Satchell wrote:
So, to the reason for the comment request, you are telling me not to blackhole 100.64/10 in the edge router downstream from an ISP as a general rule, and to accept source addresses from this netblock. Do I understand you correctly?
It depends. I think that 100.64/10 is /only/ locally significant and would /only/ be used within your ISP /if/ they use 100.64/10. If they don't use it, then you are probably perfectly safe considering 100.64/10 as a Bogon and treating it accordingly. Even in ISPs that use 100.64/10, I'd expect minimal traffic to / from it. Obviously you'll need to talk to a gateway in the 100.64/10 space. You /may/ need to talk to DNS servers and the likes therein. I've not heard of ISPs making any other service available via CGN Bypass. That being said, I have heard of CDNs working with ISPs to make CDN services available via CGN bypass. My limited experience with that still uses globally routed IPs on the CDN equipment with custom routing in the ISPs. So you still aren't communicating with 100.64/10 IPs directly. But my ignorance of CDNs using 100.64/10 doesn't preclude such from being done. The simple rules that I've used are: 1) Don't use 100.64/10 in your own network. Or if you do, accept the consequences /if/ it becomes a problem. 2) Don't filter 100.64/10 /if/ your external IP from your ISP is a 100.64/10 IP. 3) Otherwise, treat 100.64/10 like a bogon.
FWIW, I think I've received this recommendation before. The current version of my NetworkManager dispatcher-d-bcp38.sh script has the creation of the blackhole route already disabled; i.e., the netblock is not quarantined.
I suspect things like NetworkManager are somewhat at a disadvantage in that they are inherently machine local and don't have visibility beyond the directly attached network segments. As such, they can't /safely/ filter something that may be on the other side of a router. Thus they play it safe and don't do so. -- Grant. . . . unix || die
On 10/23/19 8:18 AM, Grant Taylor via NANOG wrote:
I suspect things like NetworkManager are somewhat at a disadvantage in that they are inherently machine local and don't have visibility beyond the directly attached network segments. As such, they can't /safely/ filter something that may be on the other side of a router. Thus they play it safe and don't do so.
You are 100 percent correct about NetworkManager. The facility only manages interfaces (including VPN and bridges). What I've done is added the ability to install and remove null routes when the upstream interface comes on-line and goes off-line. So this is only the first stage of filtering. Using NetFilter (in CentOS 8 case, NFTABLES), I will be adding rules to implement my policies on each system I have. What exactly will be accepted, what will be forwarded, what will be rejected, and what will be ignored. What adding the null routes does is let me use the FIB test commands so that the firewall files don't have to know the exact configuration of networking, or have monster lists that have to be maintained. Consider that one suggestion from this group is to look at using https://www.team-cymru.com/bogon-reference-http.html and doing periodic updates of the null routes based on the information there. (With caution.) This is specific to Linux. The idea is to let the computer do all the bookkeeping work, so I don't have to. Even if I have automation to "help". The first application of this work will be to replace my existing firewall router with up-to-date software and comprehensive rules to handle NAT and DNAT, on a local network with quite a number of VLANs.
Subject: Re: Request comment: list of IPs to block outbound Date: Tue, Oct 22, 2019 at 11:11:27PM -0600 Quoting Grant Taylor via NANOG (nanog@nanog.org):
On 10/22/19 10:54 PM, Måns Nilsson wrote:
It is just more RFC1918 space, a /10 unwisely spent on stalling IPv6 deployment.
My understanding is that RFC 6598 — Shared Address Space — is *EXPLICITLY* /not/ a part of RFC 1918 — Private Internet (Space). And I do mean /explicitly/.
I understand the reasoning. I appreciate the need. I just do not agree with the conclusion to waste a /10 on beating a dead horse. A /24 would have been more appropriate way of moving the cost of ipv6 non-deployment to those responsible. (put in RFC timescale, 6598 is 3000+ RFCen later than the v6 specification. That is a few human-years. There are no excuses for non-compliance except cheapness.) Easing the operation of CGN at scale serves no purpose except stalling necessary change. It is like installing an electric blanket to cure the chill from bed-wetting. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 I'm a nuclear submarine under the polar ice cap and I need a Kleenex!
On 10/23/19 12:16 AM, Måns Nilsson wrote:
I understand the reasoning. I appreciate the need. I just do not agree with the conclusion to waste a /10 on beating a dead horse. A /24 would have been more appropriate way of moving the cost of ipv6 non-deployment to those responsible. (put in RFC timescale, 6598 is 3000+ RFCen later than the v6 specification. That is a few human-years. There are no excuses for non-compliance except cheapness.)
For better or worse, I think IPv6 deployment is one of those things that will likely be completed about the time that spam problem is resolved. It's always going to be moving forward. I don't know if consuming 4+ million IPs for CGN support is warranted or not. The CGN that I've had experience … working with … (let's be polite) … in my day job have all been with providers having way more than a /24 worth of clients behind it. As such, they would need to have many (virtual) CGN appliances to deal with each of the /24 private networks. Would a /16 be better? Maybe. That is 1/64 th of what's allocated now. I personally would rather people use 100.64/10 instead of squatting on other globally routed IPs that they think they will never need to communicate with. (I've seen a bunch of people squat on DoD IP space behind CGN. I think such practice is adding insult to injury and should be avoided.
Easing the operation of CGN at scale serves no purpose except stalling necessary change. It is like installing an electric blanket to cure the chill from bed-wetting.
Much like humans can move passenter plains, even an electric blanket can /eventually/ overcome cold wet bed. -- Grant. . . . unix || die
On Wed, 23 Oct 2019 09:09:05 -0600, Grant Taylor via NANOG said:
Easing the operation of CGN at scale serves no purpose except stalling necessary change. It is like installing an electric blanket to cure the chill from bed-wetting.
Much like humans can move passenter plains, even an electric blanket can /eventually/ overcome cold wet bed.
Unless somebody gets electrocuted first.
Hi, On Sun, Oct 13, 2019 at 08:58:17AM -0700, Stephen Satchell wrote:
The following list is what I'm thinking of using for blocking traffic between an edge router acting as a firewall and an ISP/upstream. This
fe80::/10 Link Link-local address.
most people allow that range as blocking it will drop NA/NS packets with the upstream router which in turn can delay the establishment of the BGP session (provided there is one over IPv6). best Enno -- Enno Rey https://theinternetprotocol.blog Twitter: @Enno_Insinuator
On 10/13/19 9:58 AM, Stephen Satchell wrote:
The Linux rp_filter knob is effective for endpoint servers and workstations, and I turn it on religiously (easy because it's the default).
I think it's just as effective on routers as it is on servers and workstations.
For a firewall router without blackhole routes, it's less effective because, for incoming packets, a source address matching one of your inside netblocks will pass.
I'm not following that statement. Is incoming a reference to packets from the Internet to your LAN? Or is incoming a reference to packets coming in any interface, thus possibly including from your LAN to the Internet? Even without blackhole (reject) routes, a packet from the Internet spoofing a LAN IP will be rejected by rp_filter because it's coming in an interface that is not an outgoing interface for the purported source IP.
A subset of the list would be useful in endpoint boxes to relieve pressure on the upstream edge router -- particularly if a ne'er-do-well successfully hijacks the endpoint box to participate in a DDoS flood.
rp_filtering will filter packets coming in from the internal endpoint that's been compromised if the packets spoof a source from anywhere by the local LAN. (No comment about spoofing different LAN IPs.) I've been exceedingly happy with rp_filter and blackhole (reject) routes. I've taken this to another level where I have multiple routing tables and rules that cascade across tables. One of the later rules is a routing table for any and all bogons & RFC 3330. I am still able to access specific networks that fall into RFC 3330 on internal lab networks without a problem because those prefixes are found in routing tables that are searched before the bogon table that black holes (rejects) the packets. IMHO it works great. (I really should do a write up of that.) I think you should seriously re-consider using rp_filter on a router. -- Grant. . . . unix || die
On Mon, 14 Oct 2019 at 03:38, Grant Taylor via NANOG <nanog@nanog.org> wrote:
I think you should seriously re-consider using rp_filter on a router.
rp_filter is one of the most expensive features in modern routers, you should only use it, if PPS performance is not important. If PPS performance is important, ACL is much faster. ACL is also applicable to more scenarios, such as BGP customers. -- ++ytti
❦ 14 octobre 2019 09:14 +03, Saku Ytti <saku@ytti.fi>:
I think you should seriously re-consider using rp_filter on a router.
rp_filter is one of the most expensive features in modern routers, you should only use it, if PPS performance is not important. If PPS performance is important, ACL is much faster. ACL is also applicable to more scenarios, such as BGP customers.
How much performance impact should we expect with uRPF? Thanks. -- Make input easy to proofread. - The Elements of Programming Style (Kernighan & Plauger)
On Mon, 14 Oct 2019 at 09:30, Vincent Bernat <bernat@luffy.cx> wrote:
How much performance impact should we expect with uRPF?
Depends on the platform, but often it's 2nd lookup. So potentially 50% decrease in performance. Some platforms it means FIB duplication. And ultimately it doesn't really offer anything over ACL, which is, in comparison, much cheaper feature. I would encourage people to toolise this, then the ACL generation is no cost or complexity. And you can use ACL for many BGP customers too, as you create 'perfect' prefix-list for customer, you can reference to same prefix-list in ACL, without actually needing customer to announce that prefix, as it's entirely valid to originate traffic from allowable prefix without advertising the prefix (to you). -- ++ytti
Hello! On Tue, Oct 15, 2019 at 12:46 PM Saku Ytti <saku@ytti.fi> wrote:
On Mon, 14 Oct 2019 at 09:30, Vincent Bernat <bernat@luffy.cx> wrote:
How much performance impact should we expect with uRPF?
Depends on the platform, but often it's 2nd lookup. So potentially 50% decrease in performance. Some platforms it means FIB duplication. And ultimately it doesn't really offer anything over ACL, which is, in comparison, much cheaper feature. I would encourage people to toolise this, then the ACL generation is no cost or complexity. And you can use ACL for many BGP customers too, as you create 'perfect' prefix-list for customer, you can reference to same prefix-list in ACL, without actually needing customer to announce that prefix, as it's entirely valid to originate traffic from allowable prefix without advertising the prefix (to you).
This has the potential to brake things, because it requires symmetry and perfect IRR accuracy. Just because the prefix would be rejected by BGP does not mean there is not a legitimate announcement for it in the DFZ (which is the exact difference between uRPF loose mode and the ACL approach). For BGP customers where I control the announced IP space (it's mine, the customer has a private ASN and the only reason for BGP is so he can multi-home to different nodes of my network), sure. For real "IP Transit" where the customers may itself have multiple downstream ASNs, there is no guarantee that everyone in the chain will update the IRR records 24 - 48 hours before actually sourcing traffic from a new prefix (or enabling that new downstream as-path). Some other transit may just allow prefixes "manually" (for example, because of LOA's or inetnum objects, as opposed to route objects), so *a valid announcement is in the DFZ*, you are just not accepting it on your customers BGP session. In fact, maybe my downstream customer just wants to send traffic to my network, but not receive any, so I don't actually have to include that customer in my AS-macro (an exotic use-case for sure, just trying to point out that there will always be asymmetry). Routing, BGP and the IRR data is asymmetric by definition and neither real-time nor 100% accurate. That's not a problem for BGP and strict ingress prefix-lists, but it is a problem for ingress ACL'ing, because the latter effectively blackholes traffic, while uRPF loose mode does not (if there is a announcement for it in the DFZ). So I don't think ACL's can replace uRPF loose mode in the DFZ and frankly I find this proposal to be a bit dangerous. If my transit provider would do this without telling me, I'm turning up a new transit customer with an incomplete IRR record, causing an immediate partial outage for them, I would be *very* surprised (along with some other emotions). cheers, lukas
On Fri, 18 Oct 2019 at 20:15, Lukas Tribus <lists@ltri.eu> wrote:
This has the potential to brake things, because it requires symmetry and perfect IRR accuracy. Just because the prefix would be rejected by BGP does not mean there is not a legitimate announcement for it in the DFZ (which is the exact difference between uRPF loose mode and the ACL approach).
It's interesting to also think, when is good time to break things. CustomerA buys transit from ProviderB and ProviderA CustomerA gets new prefix, but does not appropriately register it. ProviderB doesn't filter anything, so it works. ProviderA does filter and does not accept this new prefix. Neither Provider has ACL. Some time passes, and ProviderB connection goes down, the new prefix, which is now old prefix experiences total outage. CustomerA is not happy. Would it have been better, if ProviderA would have ACLd the traffic from CustomerA? Forcing the problem to be evident when the prefix is young and not in production. Or was it better that it broke later on? -- ++ytti
Hello, On Fri, Oct 18, 2019 at 7:40 PM Saku Ytti <saku@ytti.fi> wrote:
It's interesting to also think, when is good time to break things.
CustomerA buys transit from ProviderB and ProviderA
CustomerA gets new prefix, but does not appropriately register it.
ProviderB doesn't filter anything, so it works. ProviderA does filter and does not accept this new prefix. Neither Provider has ACL.
Some time passes, and ProviderB connection goes down, the new prefix, which is now old prefix experiences total outage. CustomerA is not happy.
Would it have been better, if ProviderA would have ACLd the traffic from CustomerA? Forcing the problem to be evident when the prefix is young and not in production. Or was it better that it broke later on?
That's an orthogonal problem and it's solution hopefully doesn't require a traffic impacting ingress ACL. I'm saying this breaks valid configurations because even with textbook IRR registrations there is a race condition between the IRR registration (not a route-object, but a new AS in the AS-MACRO), the ACL update and the BGP turn-up of a new customer (on AS further down). Here's a environment for the examples below: Customer C1 uses existing transits Provider P11 and P12 (meaning C1 is actually a production network; dropping traffic sourced by it in the DFZ is very bad; P11 and P12 is otherwise irrelevant). Customer C1 is about to turn-up a BGP session to Provider P13. Provider P13 is a Tier2 and buys transit from Tier1 Providers P1 and P2 Provider P2 deploys ingress ACLs depending on IRR data, based on P13's AS-MACRO. Example 1: P13's AS-MACRO is updated last-minute because: - provisioning was last minute OR - provisioning was wrong initially OR - it's an emergency turn-up - whatever the case IRR records are corrected only 60 minutes before the turn up - and C1 is aware traffic towards C1 will completely converge only after additional 24 hours (but that's accepted, because $reasons; maybe C1 just needs TX bandwidth - in a hypothetical emergency turn-up for example) At the turn-up of C1_P13, traffic with as-path C1_P13_P2 is dropped, because the ingress ACL at P2 wasn't updated yet (updated only once every night). P13 expected prefixes not getting accepted at P2 on the BGP session, but never would have imagined that traffic sourced from valid prefixes present in the DFZ would be dropped. Example 2: Just as in example 1, C1 turns up BGP with P13, but the provisoning was "normal". P13 AS-MACRO was updated correctly 36 hours before the turn-up. However, at P2 the nightly cronjob for IRR updates (prefix-lists and ACL ingress filters) failed. It's is monitored and a ticket about the failing cronjob was raised, however they either: - the did not recognize the severity, because "worst-case some new prefixes are not allowed in ingress tomorrow" - where unable to fix it in just a few hours - did fix it, but did not trigger a subsequent full rerun ("it will run next time", or "it could not complete anyway before the next run") - maybe the node was actually just unreachable for a regular maintenance, so automation could not connect this time around - or maybe automation just couldn't connect to the $node, because someone regenerated the SSH key by mistake this morning Whatever the case, the point is: for internal problems at P2, the ACL wasn't updated during the night like it usually does. And at turn-up of C1_P13, C1_P13_P2 traffic is again dropped on the floor. When you reject a BGP prefix, you don't blackhole traffic, with an ingress ACL you do. That is a big difference and because of this, you *but more importantly every single downstream ASN* need to account for race conditions and failures in the entire process, that includes the immediate resolution thereof, which is not required for BGP strict prefix-lists and uRPF loose mode. Is this deployed like this in a production transit network? How does this network handle a failure like in example 2? How does it downstream customers handle the race conditions like in example 1? For the record: I'm imagining myself operating P13 getting blamed in both examples for partially blackholing C1's traffic at the turn-up. Thanks, Lukas
On Fri, 18 Oct 2019 at 23:45, Lukas Tribus <lists@ltri.eu> wrote: Hey Lukas,
I'm saying this breaks valid configurations because even with textbook IRR registrations there is a race condition between the IRR registration (not a route-object, but a new AS in the AS-MACRO), the ACL update and the BGP turn-up of a new customer (on AS further down).
I'm not proposing an answer, I'm asking a question. Could it be that the utter disinterest in working BGP filters is consequence of it not actually mattering in turn-ups in typical case? And would the examples be same, if we were not so disinterested in having proper BGP filters in place? If in common case we did ACL, would we evolve different mechanisms to ensure correctness of filtering before fact? Perhaps common API to query state of filters in provider networks? Perhaps maintenance window to turn-up new transit with option to fall back immediately and complain about their configurations?
Is this deployed like this in a production transit network? How does this network handle a failure like in example 2? How does it downstream customers handle the race conditions like in example 1?
Yes, I've ran BGP prefix-list == firewall filter (same prefix-list verbatim referred in BGP and Firewall) for all transit customers in one network for +decade. Few problems were had, the majority of customers were happy after explaining them logic behind it. But this was tier2 in Europe, data quality is high in Europe compared to other markets, so it doesn't communicate much of global state of affairs. I would not feel comfortable doing something like this in Tier1 for US+Asia markets. But there is also no particular reason why we couldn't get there, if we as a community decided it is what we want, it would fix not just unexpected BGP filter outages but also several dos and security issues, due to killing spoofing. It would give us incentive to do BGP filtering properly. -- ++ytti
Hello,
Is this deployed like this in a production transit network? How does this network handle a failure like in example 2? How does it downstream customers handle the race conditions like in example 1?
Yes, I've ran BGP prefix-list == firewall filter (same prefix-list verbatim referred in BGP and Firewall) for all transit customers in one network for +decade. Few problems were had, the majority of customers were happy after explaining them logic behind it. But this was tier2 in Europe, data quality is high in Europe compared to other markets, so it doesn't communicate much of global state of affairs. I would not feel comfortable doing something like this in Tier1 for US+Asia markets.
Ok, that is a very different message than what I interpreted from your initial post about this: just enable it, it's free, nothing will happen and your customers won't notice.
But there is also no particular reason why we couldn't get there, if we as a community decided it is what we want, it would fix not just unexpected BGP filter outages but also several dos and security issues, due to killing spoofing. It would give us incentive to do BGP filtering properly.
I agree this is something that should to be discussed, but to get there it's probably a very long road. Just look at the sorry state of BGP filtering itself. And this requires even more precision, automation,carefulness and *process changes*. I just want to emphasize that when I buy IP Transit and my provider does this *without telling me beforehand*, I will be very surprised and very unhappy (as I'm probably discovering this configuration because of a partial outage). Lukas
On Sun, 20 Oct 2019 at 15:22, Lukas Tribus <lists@ltri.eu> wrote:
I agree this is something that should to be discussed, but to get there it's probably a very long road. Just look at the sorry state of BGP filtering itself. And this requires even more precision, automation,carefulness and *process changes*.
BGP is broken, because it can be. If it could not be, it would not be. This would make BGP filters market driven fact. Instead of nice thing some nerds care about. Transition would invariably cause some gray hairs, but Internet is robust against technical and non-technical problems. -- ++ytti
-----Original Message----- From: NANOG <nanog-bounces@nanog.org> On Behalf Of Lukas Tribus Sent: Friday, October 18, 2019 9:45 PM To: Saku Ytti <saku@ytti.fi> Cc: nanog@nanog.org Subject: Re: Request comment: list of IPs to block outbound
Hello,
On Fri, Oct 18, 2019 at 7:40 PM Saku Ytti <saku@ytti.fi> wrote:
It's interesting to also think, when is good time to break things.
CustomerA buys transit from ProviderB and ProviderA
CustomerA gets new prefix, but does not appropriately register it.
ProviderB doesn't filter anything, so it works. ProviderA does filter and does not accept this new prefix. Neither Provider has ACL.
Some time passes, and ProviderB connection goes down, the new prefix, which is now old prefix experiences total outage. CustomerA is not happy.
Would it have been better, if ProviderA would have ACLd the traffic from CustomerA? Forcing the problem to be evident when the prefix is young and not in production. Or was it better that it broke later on?
That's an orthogonal problem and it's solution hopefully doesn't require a traffic impacting ingress ACL.
I'm saying this breaks valid configurations because even with textbook IRR registrations there is a race condition between the IRR registration (not a route-object, but a new AS in the AS-MACRO), the ACL update and the BGP turn-up of a new customer (on AS further down).
Here's a environment for the examples below:
Customer C1 uses existing transits Provider P11 and P12 (meaning C1 is actually a production network; dropping traffic sourced by it in the DFZ is very bad; P11 and P12 is otherwise irrelevant). Customer C1 is about to turn-up a BGP session to Provider P13. Provider P13 is a Tier2 and buys transit from Tier1 Providers P1 and P2 Provider P2 deploys ingress ACLs depending on IRR data, based on P13's AS- MACRO.
I still think that ACLs rule should go hand in hand with eBGP prefixes by default. But the ACLs should be updated based on advertised and accepted eBGP prefixes automatically (so not dependent on external data). If the IRR data accuracy and AS-MACROs get solved the filtering problem would be solved as well. If such mechanism was enabled by default in all vendors' implementations it would address the double lookup problem of uRPF while accomplishing the same thing and even address the source IP spoofing problem. 3 Simple rules: Rule 1) If you are advertising a prefix Then allow it as source prefix in your egress ACL And allow it as destination prefix in you ingress ACL (cause why do you advertise a prefix well you expect to send traffic sourced from IPs covered by that prefix and you expect to get a response back right?) And as a result Traffic sourced from IPs haven't advertised via a particular link would be blocked at egress from your AS (on that link) -boundary A1 Traffic destined to IPs you haven't advertised via a particular link it will be blocked at ingress to you AS (on that link) Rule 2) If you are accepting a prefix Then allow in as source in your ingress ACL And allow it as destination in your egress ACL (cause why do you accept a prefix well you expect to send traffic towards IPs covered by that prefix and you'd want those IPs to be able to respond back right?) And as a result Traffic sourced from IPs you haven't accepted via a particular link would be blocked at ingress to your AS (on that link) -boundary A2 Traffic destined to IPs you haven't accepted via a particular link would be blocked at egress from your AS (on that link) -required because there's already an egress ACL blocking everything. Rule 3) If interface can't be uniquely identified based on IPs used for the eBGP session warn the operator about the condition The obvious drawback especially for TCAM based systems is the scale, so not only we'd need to worry if our FIB can hold 800k prefixes, but also if the filter memory can hold the same amount -in addition to whatever additional filtering we're doing at the edge (comb filters for DoS protection etc...) adam
On Mon, 21 Oct 2019 at 23:14, <adamv0025@netconsultings.com> wrote:
The obvious drawback especially for TCAM based systems is the scale, so not only we'd need to worry if our FIB can hold 800k prefixes, but also if the filter memory can hold the same amount -in addition to whatever additional filtering we're doing at the edge (comb filters for DoS protection etc...)
This is actually somewhat cheap problem, if you optimise for it. That is rules are somewhat expensive, but N prefixes per rule are not, when designed with that requirement. Certainly the BOM effect can be entirely ignored. However this is of course only true if that was design goal, won't help in a situation where HW is in place and doesn't not scale there. Just pointing out that there are no technical or commercial problems getting there, should we so want. -- ++ytti
From: Saku Ytti <saku@ytti.fi> Sent: Tuesday, October 22, 2019 11:54 AM
On Mon, 21 Oct 2019 at 23:14, <adamv0025@netconsultings.com> wrote:
The obvious drawback especially for TCAM based systems is the scale, so not only we'd need to worry if our FIB can hold 800k prefixes, but also if the filter memory can hold the same amount -in addition to whatever additional filtering we're doing at the edge (comb filters for DoS protection etc...)
This is actually somewhat cheap problem, if you optimise for it. That is rules are somewhat expensive, but N prefixes per rule are not, when designed with that requirement. Certainly the BOM effect can be entirely ignored. However this is of course only true if that was design goal, won't help in a situation where HW is in place and doesn't not scale there. Just pointing out that there are no technical or commercial problems getting there, should we so want.
Well sure if BGP prefix=ACL prefix was true from the get go both scaling problems would be catered for in unison and we wouldn't even notice. People here would be asking for recommendations on new/replacement edge router that can support 1M routes and filter entries... But the reality is that long filters can significantly decrease performance of modern (supporting 100G interfaces) NPUs/PFEs. adam
On 19 Oct 2019, at 04:42, Saku Ytti <saku@ytti.fi> wrote:
On Fri, 18 Oct 2019 at 20:15, Lukas Tribus <lists@ltri.eu> wrote:
This has the potential to brake things, because it requires symmetry and perfect IRR accuracy. Just because the prefix would be rejected by BGP does not mean there is not a legitimate announcement for it in the DFZ (which is the exact difference between uRPF loose mode and the ACL approach).
It's interesting to also think, when is good time to break things.
CustomerA buys transit from ProviderB and ProviderA
CustomerA gets new prefix, but does not appropriately register it.
ProviderB doesn't filter anything, so it works. ProviderA does filter and does not accept this new prefix. Neither Provider has ACL.
Some time passes, and ProviderB connection goes down, the new prefix, which is now old prefix experiences total outage. CustomerA is not happy.
Would it have been better, if ProviderA would have ACLd the traffic from CustomerA? Forcing the problem to be evident when the prefix is young and not in production. Or was it better that it broke later on?
Having been through this exact situation recently (made worse by the fact that it was caused by provider b’s upstreams not having updated their filters and not provider b itself), I would suggest its 100 times better for it to happen right at the start rather than randomly down the track
-- ++ytti
participants (15)
-
adamv0025@netconsultings.com
-
Brandon Martin
-
Chris Jones
-
Enno Rey
-
Florian Brandstetter
-
Grant Taylor
-
Lukas Tribus
-
Måns Nilsson
-
Saku Ytti
-
Seth Mattinen
-
Stephen Satchell
-
Thomas Bellman
-
Valdis Klētnieks
-
Vincent Bernat
-
William Herrin