New subject: Starting to Drop Invalids for Customers

13 Jan 2020

      Mark,

Thanks for bringing this up again.
I remember this from nearly 3 years ago when Randy brought it up.
A bug was filed, but it disappeared in the woodwork.
I have now given it the high priority tag that it should have had initially.
Sorry about the mess up.

In the meantime, you may be able to signal the validation state in iBGP
once it is validated at the network edge.
For an iBGP neighbor, use a configuration like this:
   neighbor 192.0.2.1 announce rpki state

Regards,
Jakob.

Date: Sat, 11 Jan 2020 23:12:27 +0200
From: Mark Tinka <mark.tinka@seacom.mu>

On 10/Jan/20 16:15, Lukas Tribus wrote:
...
Thanks for sharing all this. Regarding those 2 platforms specifically,
what release are you using here that does not blow up?
On the ASR920, we are on 16(11)01a.

On the ME3600X, we are on 15.6(2)SP6.
...
IIRC you had
some RPKI related crash bugs at some point in time?
Yes, that was the first time we were deploying RPKI in 2014 and the code
back then crashed the ME3600X.

No such problem this time around.
...
- there is no ROA, so prefixes are supposed to be UNKNOWN on all nodes
- but IOS-XE prefers VALID over UNKNOWN (changing best path selection)
- iBGP is *always* VALID (even if it's really UNKNOWN), eBGP is
showing UNKNOWN, so iBGP is preferred over eBGP which breaks a lot of
assumptions and "hot potato" concepts (possible temporary routing
loops, other than of course different egress behavior)
So your timing on this is ominous.

In the last day or so, we had an issue with a customer on one of our
ASR1006 edge routers that fell victim to this IOS XE stupidity. An
alternate path toward them learned from a peer was sent back to the edge
router they are connected to, which chose it over the local one because,
well, it was an iBGP route. We didn't notice this issue with this
customer since enabling ROV on this box weeks ago, which means the
alternate route became available in the last 2 - 3 days, e.g., perhaps
they turned up an alternative provider, or changed their routing toward
them for us to see another path.

Since this IOS XE stupidity is not configurable, what we've decided to
do is disable ROV on all ASR1006 boxes for now. This is not a big issue
for us. We've only got 2 customers using them as these boxes only carry
non-Ethernet customers.

While this should be an issue for our ASR920 and ME3600X routers also,
it isn't because we run BGP-SD on those, i.e., even if the RIB will have
all iBGP routes marked as Valid, they won't be installed in FIB, whereas
the eBGP routes learned locally from the customer will.

Having to create a ROA to solve this, while feasible, is inappropriate
for a solution, especially when Juniper do it correctly.

Randy and I complained to Cisco about this years back, and AFAIK, it was
only fixed in IOS XR. That this is still going on in 2020 is silly,
especially when it's clear that they are in violation of the RFC.
...
Apparently there is an IOS feature "Announce RPKI Validation State to
Neighbors" to transmit the *real* RPKI state in iBGP (so as opposed to
defaulting to VALID for all iBGP neighbors), I'm not sure if that
fixes this problem or not. It doesn't really address the root cause
(which is: unwanted and not configurable interference with the best
path selection algorithm) - but at it can at least hide it's symptoms.
I've not tried communicating RPKI state between routers via BGP communities.

One of the reasons I like RPKI is because it is a feature that works on
each router independent of another. Each router has a discrete RTR
session to a validator, and can make its own RPKI decisions without any
regard for the rest of the network. And yet all routers in the network
can do this and equally have a converged RPKI state, without ever
speaking to one another.

So the idea of having routers co-ordinate RPKI information through
communities is one I am not so keen on, if I'm honest. Not only do you
need to worry about inter-op issues between vendors, there is potential
for problems when code changes over time. I'd rather not deal with that,
especially since what Cisco are doing with IOS XE is simply a broken
implementation.

That said, if there is anyone out there who has done this and sees it as
a solution to the problem, I'm sure this list would like to hear about it.
...
RPKI implementations should not touch best path selection. Dropping
RPKI invalids is the real use-case here, and if someones wants to
loc-pref based on RPKI status we should allow it (even if it doesn't
make a lot of sense), but having the RPKI implementation intervene in
the best path selection without the possibility to disable it is ...
frustrating.
Agreed.

At least, if IOS XE had a knob that could "set rpki
[valid|notfound|invalid]" this would somewhat help. But alas, they don't
:-(.

You can only match on existing RPKI state. You cannot manually set RPKI
state in IOS XE routing policy. I mean, how dumb is that?

It's pretty presumptuous of Cisco to automatically apply policy for you
re: RPKI in IOS XE, but then show they can do it right in IOS XR. Unreal!
...
How much do you rely on "hot potato" routing for peers/transit and
customers? How does that work for you with RPKI unkowns?
As above, we've disabled RPKI on all ASR1006 edge routers. It affects
only 2 of our customers, so not an issue.

I'll go through a few bottles of wine and some music this weekend as I
summon up the energy to write to Cisco again to fix this. If I don't,
I'll just keep sending money Juniper's way. Simple, and simpler!

Mark.

Re: Starting to Drop Invalids for Customers

Jakob Heitz (jheitz)

Mark Tinka

Lukas Tribus

Jakob Heitz (jheitz)

tags

participants (3)