Hello, On Mon, 26 Jul 2021 at 11:40, Mark Tinka <mark@tinka.africa> wrote:
I can count, on my hands, the number of RPKI-related outages that we have experienced, and all of them have turned out to be a misunderstanding of how ROA's work, either by customers or some other network on the Internet. The good news is that all of those cases were resolved within a few hours of notifying the affected party.
That's good, but the understanding of operational issues in the RPKI systems in the wild is underwhelming, we are bound to make the same mistakes of DNS all over again. Yes, a complete failure of an RTR server theoretically does not have big negative effects in networks. But failure of RPKI validation with a separate RTR server can lead to outdated VRP's on the routers, just as RTR server bugs will, which is why monitoring not only for availability but also whether the data is actually not outdated is *very* necessary. Here some examples (both of operators POV as well as actual failure scenarios): https://mailman.nanog.org/pipermail/nanog/2020-August/208982.html
we are at fault for not deploying the validation service in a redundant setup and for failing at monitoring the service. But we did so because we thought it not to be too important, because a failed validation service should simply lead to no validation, not a crashed router.
In this case a RTR client bug crashed the router. But the point is that it is not clear that setting up RPKI validators and RTR servers is a serious endeavor and monitoring it is not optional. https://github.com/cloudflare/gortr/issues/82
we noticed that one the ROAs was wrong. When I pulled output.json from octorpki (/output.json), it had the correct value. However when I ran rtrdump, it had different ASN value for the prefix. Restarting gortr process did fix it. Sending SIGHUP did not.
https://github.com/RIPE-NCC/rpki-validator-3/issues/264
yesterday we saw a unexpected ROA propagation delay.
After updating a ROA in the RIPE lirportal, NTT, Telia and Cogent saw the update within an hour, but a specific rpki validator 3.1-2020.08.06.14.39 in a third party network did not converge for more than 4 hours.
I wrote a naive nagios script to check for stalled serials on a RTR server: https://github.com/lukastribus/rtrcheck and talked about it in this his blog post (shameless plug): https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-a... This is on the validation/network side. On the CA side, similar issues apply. I believe we still lack a few high level outages caused by insufficient reliability in the RPKI stacks for people to start taking it seriously. Some specific failure scenarios are currently being addressed, but this doesn't make monitoring optional: rpki-client 7.1 emits a new per VRP attribute: expires, which makes it possible for RTR servers to stop considering outdated VRP's: https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac... stayrtr (a gortr fork), will consider this attribute in the future: https://github.com/bgp/stayrtr/issues/3 cheers, lukas