Hi Robert, NANOG, On Mon, Apr 26, 2021 at 09:29:27AM -0400, Robert Blayzor via NANOG wrote:
According to Cloudflares isbgpsafeyet.com, Cogent has been considered "safe" and is filtering invalids.
But I have found that to be untrue (mostly). It appears that some days they filter IPv4, sometimes not, and IPv6 invalids are always coming through. I know it's Cogent, but curious as to what others are seeing.
[ Disclaimer: I'm not affiliated with the companies referenced in the above message. But as I love talking about RPKI, I'd like to share some perspective based on my own experience with both small and large scale RPKI deployments. ] TL;DR - RPKI Route Origin Validation (ROV) is incrementally deployed inside networks, and incrementally across the Default-Free Zone. This means right now (and for years to come), operators will see RPKI invalid routes spill through the cracks of the global routing system. This is expected and unavoidable. Details --- There are a few caveats to consider when using the isbgpsafeyet.com testing utility to determine whether a network is doing RPKI ROV with 'invalid == reject' EBGP policies. The isbgpsafeyet.com beacon prefixes are anycasted from many vantage points, this 'skews' the testing results in some ways. Imagine the prefixes being anycasted from (hypothetical) a 100 POPs, this essentially is a 100 attempts to propagate RPKI invalid routes into the default-free zone. Only a single route (out of the 100) needs to slip past any potential 'invalid == reject' barriers between the testsite and the visitor. The Cloudflare test essentially goes out of its way to circumvent RPKI filters, but at the same time is easily fooled in the presence of default routes (0.0.0.0/0 + ::/0). To get a broader sense of how one's local internet connection is impacted by RPKI, is to compare traceroutes to 103.21.244.15 versus traceroutes to 1.1.1.1 - if the first trace takes a bit of a detour compared to the latter IP, it might be indicative of only one (or a few) routers in a global IP backbone are not RPKI-capable. In addition to the CF test, I recommend also testing similar but alternative tools, such as https://sg-pub.ripe.net/jasper/rpki-web-test/ The ripe.net test is *not* anycasted and single-homed behind a transit-free carrier, this too skews the results in some way. Another test can be done by pinging the RIPE RIS "Resource Certification (RPKI) Routing Beacons" at the bottom of this page: https://www.ripe.net/analyse/internet-measurements/routing-information-servi... And yet another way of measuring to what degree RPKI ROV has been deployed in an individual AS or the DFZ as a whole, is by looking at BGP data. The NLNOG RING LG (AS 199036, http://lg.ring.nlnog.net/summary/lg01/ipv4) receives tens of full table feeds from various BGP speakers around the planet. Every few hours a script takes a snapshot of the LG's Local RIB and applies the RFC 6811 Origin Validation procedure to all paths, and for a select few ASNs stores the list of prefixes. Cecilia Testart et al. did a thorough study using similar methodology: https://www.caida.org/publications/papers/2020/filter_not_filter/filter_not_... This paper is a fun friday afternoon read! Below is the current top ten "RPKI invalid distributor" ASNs as seen from AS 199036: RPKI invalid routes | Transiting Autonomous System --------------------+----------------------------- 2,224 | AS6461 - Zayo 2,094 | AS3320 - Deutsche Telekom 1,989 | AS8220 - Colt 1,976 | AS5511 - Orange 1,924 | AS6762 - Telecom Italia 1,613 | AS1273 - Vodafone 573 | AS6453 - Tata 436 | AS6939 - Hurricane Electric 425 | AS6830 - Liberty Global 355 | AS3491 - PCCW (rough estimates as of April 26th, 2021) Cogent (AS 174) isn't even in the global top ten RPKI Invalids distributors! :-) Banana for scale: in 2018-2019 the top ten was distributing between 5,000 and 6,000 unique RPKI invalid routes. Many in the community deploying RPKI consider a RPKI deployment 'functionally complete' when a transit network dives below propagating ~ 30% of the total of DFZ invalids (and manages to stay there). The gap of ~ 1,600 prefixes between Zayo/Deutsche Telekom - and the group of ASNs propagating less than 600 - is the difference between not rejecting invalids on any EBGP session, and rejecting invalids on most EBGP sessions. How does one end up deploying RPKI ROV on most, but not all EBGP sessions? In the last few years HUNDREDS of RPKI-related software defects have been uncovered in BGP implementations. Some bugs are cosmetic in nature, other bugs are of the "if you enable RPKI, the entire router crashes" severity level. When bugs are identified and fixed, it'll take additional time for the QA process to complete and deployment to be scheduled. On top of that some operators only have one or two software maintenance windows per router per year. Sometimes workarounds are available, but often those aren't always as seamless or proactive as one would want them to be. At that point a backbone operator has to make a choice: do they proceed to deploy RPKI on all remaining (non-crashing) routers, or rollback/postpone/cancel their plans for RPKI ROV? As RPKI ROV is an optional incrementally deployable mechanism, many backbone operators arrived at the conclusion that a 95% deployment offers more benefits than no RPKI deployment at all. :-) Simply put: in any sufficiently large network, there will always be a bunch of routers that (temporarily) can't do RPKI ROV for some reason or technical caveat. It is quite rare (and unlikely) to see a global transit provider propagate zero RPKI invalid routes at all times. Kind regards, Job