Re: Starting to Drop Invalids for Customers
Mark, Thanks for bringing this up again. I remember this from nearly 3 years ago when Randy brought it up. A bug was filed, but it disappeared in the woodwork. I have now given it the high priority tag that it should have had initially. Sorry about the mess up. In the meantime, you may be able to signal the validation state in iBGP once it is validated at the network edge. For an iBGP neighbor, use a configuration like this: neighbor 192.0.2.1 announce rpki state Regards, Jakob. Date: Sat, 11 Jan 2020 23:12:27 +0200 From: Mark Tinka <mark.tinka@seacom.mu> On 10/Jan/20 16:15, Lukas Tribus wrote:
Thanks for sharing all this. Regarding those 2 platforms specifically, what release are you using here that does not blow up?
On the ASR920, we are on 16(11)01a. On the ME3600X, we are on 15.6(2)SP6.
IIRC you had some RPKI related crash bugs at some point in time?
Yes, that was the first time we were deploying RPKI in 2014 and the code back then crashed the ME3600X. No such problem this time around.
- there is no ROA, so prefixes are supposed to be UNKNOWN on all nodes - but IOS-XE prefers VALID over UNKNOWN (changing best path selection) - iBGP is *always* VALID (even if it's really UNKNOWN), eBGP is showing UNKNOWN, so iBGP is preferred over eBGP which breaks a lot of assumptions and "hot potato" concepts (possible temporary routing loops, other than of course different egress behavior)
So your timing on this is ominous. In the last day or so, we had an issue with a customer on one of our ASR1006 edge routers that fell victim to this IOS XE stupidity. An alternate path toward them learned from a peer was sent back to the edge router they are connected to, which chose it over the local one because, well, it was an iBGP route. We didn't notice this issue with this customer since enabling ROV on this box weeks ago, which means the alternate route became available in the last 2 - 3 days, e.g., perhaps they turned up an alternative provider, or changed their routing toward them for us to see another path. Since this IOS XE stupidity is not configurable, what we've decided to do is disable ROV on all ASR1006 boxes for now. This is not a big issue for us. We've only got 2 customers using them as these boxes only carry non-Ethernet customers. While this should be an issue for our ASR920 and ME3600X routers also, it isn't because we run BGP-SD on those, i.e., even if the RIB will have all iBGP routes marked as Valid, they won't be installed in FIB, whereas the eBGP routes learned locally from the customer will. Having to create a ROA to solve this, while feasible, is inappropriate for a solution, especially when Juniper do it correctly. Randy and I complained to Cisco about this years back, and AFAIK, it was only fixed in IOS XR. That this is still going on in 2020 is silly, especially when it's clear that they are in violation of the RFC.
Apparently there is an IOS feature "Announce RPKI Validation State to Neighbors" to transmit the *real* RPKI state in iBGP (so as opposed to defaulting to VALID for all iBGP neighbors), I'm not sure if that fixes this problem or not. It doesn't really address the root cause (which is: unwanted and not configurable interference with the best path selection algorithm) - but at it can at least hide it's symptoms.
I've not tried communicating RPKI state between routers via BGP communities. One of the reasons I like RPKI is because it is a feature that works on each router independent of another. Each router has a discrete RTR session to a validator, and can make its own RPKI decisions without any regard for the rest of the network. And yet all routers in the network can do this and equally have a converged RPKI state, without ever speaking to one another. So the idea of having routers co-ordinate RPKI information through communities is one I am not so keen on, if I'm honest. Not only do you need to worry about inter-op issues between vendors, there is potential for problems when code changes over time. I'd rather not deal with that, especially since what Cisco are doing with IOS XE is simply a broken implementation. That said, if there is anyone out there who has done this and sees it as a solution to the problem, I'm sure this list would like to hear about it.
RPKI implementations should not touch best path selection. Dropping RPKI invalids is the real use-case here, and if someones wants to loc-pref based on RPKI status we should allow it (even if it doesn't make a lot of sense), but having the RPKI implementation intervene in the best path selection without the possibility to disable it is ... frustrating.
Agreed. At least, if IOS XE had a knob that could "set rpki [valid|notfound|invalid]" this would somewhat help. But alas, they don't :-(. You can only match on existing RPKI state. You cannot manually set RPKI state in IOS XE routing policy. I mean, how dumb is that? It's pretty presumptuous of Cisco to automatically apply policy for you re: RPKI in IOS XE, but then show they can do it right in IOS XR. Unreal!
How much do you rely on "hot potato" routing for peers/transit and customers? How does that work for you with RPKI unkowns?
As above, we've disabled RPKI on all ASR1006 edge routers. It affects only 2 of our customers, so not an issue. I'll go through a few bottles of wine and some music this weekend as I summon up the energy to write to Cisco again to fix this. If I don't, I'll just keep sending money Juniper's way. Simple, and simpler! Mark.
On 13/Jan/20 21:53, Jakob Heitz (jheitz) wrote:
Mark,
Thanks for bringing this up again. I remember this from nearly 3 years ago when Randy brought it up. A bug was filed, but it disappeared in the woodwork. I have now given it the high priority tag that it should have had initially. Sorry about the mess up.
Many thanks, Jakob, for bumping this. Much appreciated, as I was dreading running this through my account team :-). Most grateful if you can keep us (or me, whichever you prefer) posted on the progress of this fix. I am willing to test code to verify things.
In the meantime, you may be able to signal the validation state in iBGP once it is validated at the network edge. For an iBGP neighbor, use a configuration like this: neighbor 192.0.2.1 announce rpki state
So the majority of our peering and customer edge lives on Juniper. We don't run RPKI on our (Cisco) route reflectors either. So considering that this issue affects only 2 of our customers, we don't feel it justifies enabling this feature across the backbone for the moment, as a lot more testing and care would be needed, which I cannot currently dedicate time to given the only benefit would be to fix 2 non-Ethernet customers. But again, I am more than happy to help support the fixing of this bug in IOS and IOS XE, and would be okay to test when you're ready. Thanks. Mark.
Hello, On Tue, 14 Jan 2020 at 07:21, Mark Tinka <mark.tinka@seacom.mu> wrote:
On 13/Jan/20 21:53, Jakob Heitz (jheitz) wrote:
Mark,
Thanks for bringing this up again. I remember this from nearly 3 years ago when Randy brought it up. A bug was filed, but it disappeared in the woodwork. I have now given it the high priority tag that it should have had initially. Sorry about the mess up.
Many thanks, Jakob, for bumping this. Much appreciated, as I was dreading running this through my account team :-).
Most grateful if you can keep us (or me, whichever you prefer) posted on the progress of this fix. I am willing to test code to verify things.
I'm also very interested to follow the progress here. Is there a BugID you guys can share? Thank you, Lukas
Lukas, CSCvc84848 Will keep you in the loop too, Lukas. Regards, Jakob. -----Original Message----- From: Lukas Tribus <lists@ltri.eu> Sent: Monday, February 3, 2020 12:43 AM To: Mark Tinka <mark.tinka@seacom.mu>; Jakob Heitz (jheitz) <jheitz@cisco.com> Cc: nanog@nanog.org Subject: Re: Starting to Drop Invalids for Customers Hello, On Tue, 14 Jan 2020 at 07:21, Mark Tinka <mark.tinka@seacom.mu> wrote:
On 13/Jan/20 21:53, Jakob Heitz (jheitz) wrote:
Mark,
Thanks for bringing this up again. I remember this from nearly 3 years ago when Randy brought it up. A bug was filed, but it disappeared in the woodwork. I have now given it the high priority tag that it should have had initially. Sorry about the mess up.
Many thanks, Jakob, for bumping this. Much appreciated, as I was dreading running this through my account team :-).
Most grateful if you can keep us (or me, whichever you prefer) posted on the progress of this fix. I am willing to test code to verify things.
I'm also very interested to follow the progress here. Is there a BugID you guys can share? Thank you, Lukas
participants (3)
-
Jakob Heitz (jheitz)
-
Lukas Tribus
-
Mark Tinka