On 10/Jan/20 16:15, Lukas Tribus wrote:
Thanks for sharing all this. Regarding those 2 platforms specifically, what release are you using here that does not blow up?
On the ASR920, we are on 16(11)01a. On the ME3600X, we are on 15.6(2)SP6.
IIRC you had some RPKI related crash bugs at some point in time?
Yes, that was the first time we were deploying RPKI in 2014 and the code back then crashed the ME3600X. No such problem this time around.
- there is no ROA, so prefixes are supposed to be UNKNOWN on all nodes - but IOS-XE prefers VALID over UNKNOWN (changing best path selection) - iBGP is *always* VALID (even if it's really UNKNOWN), eBGP is showing UNKNOWN, so iBGP is preferred over eBGP which breaks a lot of assumptions and "hot potato" concepts (possible temporary routing loops, other than of course different egress behavior)
So your timing on this is ominous. In the last day or so, we had an issue with a customer on one of our ASR1006 edge routers that fell victim to this IOS XE stupidity. An alternate path toward them learned from a peer was sent back to the edge router they are connected to, which chose it over the local one because, well, it was an iBGP route. We didn't notice this issue with this customer since enabling ROV on this box weeks ago, which means the alternate route became available in the last 2 - 3 days, e.g., perhaps they turned up an alternative provider, or changed their routing toward them for us to see another path. Since this IOS XE stupidity is not configurable, what we've decided to do is disable ROV on all ASR1006 boxes for now. This is not a big issue for us. We've only got 2 customers using them as these boxes only carry non-Ethernet customers. While this should be an issue for our ASR920 and ME3600X routers also, it isn't because we run BGP-SD on those, i.e., even if the RIB will have all iBGP routes marked as Valid, they won't be installed in FIB, whereas the eBGP routes learned locally from the customer will. Having to create a ROA to solve this, while feasible, is inappropriate for a solution, especially when Juniper do it correctly. Randy and I complained to Cisco about this years back, and AFAIK, it was only fixed in IOS XR. That this is still going on in 2020 is silly, especially when it's clear that they are in violation of the RFC.
Apparently there is an IOS feature "Announce RPKI Validation State to Neighbors" to transmit the *real* RPKI state in iBGP (so as opposed to defaulting to VALID for all iBGP neighbors), I'm not sure if that fixes this problem or not. It doesn't really address the root cause (which is: unwanted and not configurable interference with the best path selection algorithm) - but at it can at least hide it's symptoms.
I've not tried communicating RPKI state between routers via BGP communities. One of the reasons I like RPKI is because it is a feature that works on each router independent of another. Each router has a discrete RTR session to a validator, and can make its own RPKI decisions without any regard for the rest of the network. And yet all routers in the network can do this and equally have a converged RPKI state, without ever speaking to one another. So the idea of having routers co-ordinate RPKI information through communities is one I am not so keen on, if I'm honest. Not only do you need to worry about inter-op issues between vendors, there is potential for problems when code changes over time. I'd rather not deal with that, especially since what Cisco are doing with IOS XE is simply a broken implementation. That said, if there is anyone out there who has done this and sees it as a solution to the problem, I'm sure this list would like to hear about it.
RPKI implementations should not touch best path selection. Dropping RPKI invalids is the real use-case here, and if someones wants to loc-pref based on RPKI status we should allow it (even if it doesn't make a lot of sense), but having the RPKI implementation intervene in the best path selection without the possibility to disable it is ... frustrating.
Agreed. At least, if IOS XE had a knob that could "set rpki [valid|notfound|invalid]" this would somewhat help. But alas, they don't :-(. You can only match on existing RPKI state. You cannot manually set RPKI state in IOS XE routing policy. I mean, how dumb is that? It's pretty presumptuous of Cisco to automatically apply policy for you re: RPKI in IOS XE, but then show they can do it right in IOS XR. Unreal!
How much do you rely on "hot potato" routing for peers/transit and customers? How does that work for you with RPKI unkowns?
As above, we've disabled RPKI on all ASR1006 edge routers. It affects only 2 of our customers, so not an issue. I'll go through a few bottles of wine and some music this weekend as I summon up the energy to write to Cisco again to fix this. If I don't, I'll just keep sending money Juniper's way. Simple, and simpler! Mark.