Re: Cogent RPKI invalid filtering

26 Apr 2021

      Hi Robert, NANOG,

On Mon, Apr 26, 2021 at 09:29:27AM -0400, Robert Blayzor via NANOG wrote:
...
According to Cloudflares isbgpsafeyet.com, Cogent has been considered "safe"
and is filtering invalids.
But I have found that to be untrue (mostly). It appears that some days they
filter IPv4, sometimes not, and IPv6 invalids are always coming through. I
know it's Cogent, but curious as to what others are seeing.
[ Disclaimer: I'm not affiliated with the companies referenced in the
     above message. But as I love talking about RPKI, I'd like to share
     some perspective based on my own experience with both small and
     large scale RPKI deployments. ]

TL;DR - RPKI Route Origin Validation (ROV) is incrementally deployed
inside networks, and incrementally across the Default-Free Zone. This
means right now (and for years to come), operators will see RPKI invalid
routes spill through the cracks of the global routing system.
This is expected and unavoidable.

Details ---

There are a few caveats to consider when using the isbgpsafeyet.com
testing utility to determine whether a network is doing RPKI ROV with
'invalid == reject' EBGP policies. The isbgpsafeyet.com beacon prefixes
are anycasted from many vantage points, this 'skews' the testing results
in some ways.  Imagine the prefixes being anycasted from (hypothetical)
a 100 POPs, this essentially is a 100 attempts to propagate RPKI invalid
routes into the default-free zone. Only a single route (out of the 100)
needs to slip past any potential 'invalid == reject' barriers between
the testsite and the visitor. The Cloudflare test essentially goes out
of its way to circumvent RPKI filters, but at the same time is easily
fooled in the presence of default routes (0.0.0.0/0 + ::/0).

To get a broader sense of how one's local internet connection is impacted
by RPKI, is to compare traceroutes to 103.21.244.15 versus traceroutes
to 1.1.1.1 - if the first trace takes a bit of a detour compared to the
latter IP, it might be indicative of only one (or a few) routers in a
global IP backbone are not RPKI-capable.

In addition to the CF test, I recommend also testing similar but
alternative tools, such as https://sg-pub.ripe.net/jasper/rpki-web-test/
The ripe.net test is *not* anycasted and single-homed behind a
transit-free carrier, this too skews the results in some way.

Another test can be done by pinging the RIPE RIS "Resource Certification
(RPKI) Routing Beacons" at the bottom of this page:
https://www.ripe.net/analyse/internet-measurements/routing-information-servi...

And yet another way of measuring to what degree RPKI ROV has been
deployed in an individual AS or the DFZ as a whole, is by looking at BGP
data. The NLNOG RING LG (AS 199036, http://lg.ring.nlnog.net/summary/lg01/ipv4)
receives tens of full table feeds from various BGP speakers around the
planet. Every few hours a script takes a snapshot of the LG's Local RIB
and applies the RFC 6811 Origin Validation procedure to all paths, and
for a select few ASNs stores the list of prefixes.

Cecilia Testart et al. did a thorough study using similar methodology:
https://www.caida.org/publications/papers/2020/filter_not_filter/filter_not_...
This paper is a fun friday afternoon read!

Below is the current top ten "RPKI invalid distributor" ASNs as seen
from AS 199036:

   RPKI invalid routes | Transiting Autonomous System
   --------------------+-----------------------------
                 2,224 | AS6461 - Zayo
                 2,094 | AS3320 - Deutsche Telekom
                 1,989 | AS8220 - Colt
                 1,976 | AS5511 - Orange
                 1,924 | AS6762 - Telecom Italia
                 1,613 | AS1273 - Vodafone
                   573 | AS6453 - Tata
                   436 | AS6939 - Hurricane Electric
                   425 | AS6830 - Liberty Global
                   355 | AS3491 - PCCW
     (rough estimates as of April 26th, 2021)

Cogent (AS 174) isn't even in the global top ten RPKI Invalids
distributors! :-) Banana for scale: in 2018-2019 the top ten was
distributing between 5,000 and 6,000 unique RPKI invalid routes.

Many in the community deploying RPKI consider a RPKI deployment
'functionally complete' when a transit network dives below propagating
~ 30% of the total of DFZ invalids (and manages to stay there).

The gap of ~ 1,600 prefixes between Zayo/Deutsche Telekom - and the group
of ASNs propagating less than 600 - is the difference between not
rejecting invalids on any EBGP session, and rejecting invalids on most
EBGP sessions.

How does one end up deploying RPKI ROV on most, but not all EBGP sessions?

In the last few years HUNDREDS of RPKI-related software defects have
been uncovered in BGP implementations. Some bugs are cosmetic in nature,
other bugs are of the "if you enable RPKI, the entire router crashes"
severity level. When bugs are identified and fixed, it'll take
additional time for the QA process to complete and deployment to be
scheduled. On top of that some operators only have one or two software
maintenance windows per router per year. Sometimes workarounds are
available, but often those aren't always as seamless or proactive as one
would want them to be.

At that point a backbone operator has to make a choice: do they proceed
to deploy RPKI on all remaining (non-crashing) routers, or
rollback/postpone/cancel their plans for RPKI ROV?

As RPKI ROV is an optional incrementally deployable mechanism, many
backbone operators arrived at the conclusion that a 95% deployment
offers more benefits than no RPKI deployment at all. :-)

Simply put: in any sufficiently large network, there will always be a
bunch of routers that (temporarily) can't do RPKI ROV for some reason or
technical caveat. It is quite rare (and unlikely) to see a global
transit provider propagate zero RPKI invalid routes at all times.

Kind regards,

Job

Re: Cogent RPKI invalid filtering

Job Snijders