Hello, On Tue, 27 Jul 2021 at 21:02, heasley <heas@shrubbery.net> wrote:
But I have to emphasize that all those are just examples. Unknown bugs or corner cases can lead to similar behavior in "all in one" daemons like Fort and Routinator. That's why specific improvements absolutely do not mean we don't have to monitor the RTR servers.
I am not convinced that I want the RTR server to be any smarter than necessary, and I think expiration handling is too smart. I want it to the load the VRPs provided and serve them, no more.
Leave expiration to the validator and monitoring of both to the NMS and other means.
While I'm all for KISS, the expiration feature makes sure that the cryptographic validity in the ROA's is respected not only on the validator, but also on the RTR server. This is necessary, because there is nothing in the RTR protocol that indicates the expiration and this change brings it at least into the JSON exchange between validator and RTR server. It's like TTL in DNS, and it's about respecting the wishes of the authority (CA and ROA ressource holder).
The delegations should not be changing quickly[1] enough
How do you come to this conclusion? If I decide I'd like to originate a /24 out of my aggregate, for DDoS mitigation purposes, why shouldn't I be able to update my ROA and expect quasi-complete convergence in 1 or 2 hours?
for me to prefer expiration over the grace period to correct a validator problem. That does not prevent an operator from using other means to share fate; eg: if the validator does fails completely for 2 hours, stop the RTR server.
I perceive this to be choosing stability in the RTR sessions over timeliness of updates. And, if a 15 - 30 minute polling interval is reasonable, why isnt 8 - 24 hours.
Well for one, I'd like my ROAs to propagate in 1 or 2 hours. If I need to wait for 24 hours, then this could cause operational issues for me (the DDoS mitigation case above for example, or just any other normal routing change). The entire RPKI system is designed to fail, so if you have multiple failures and *all* your RTR servers go down, the worst case is that the routes on the BGP routers turn NotFound, so you'd lose the benefit of RPKI validation. It's *way* *way* more harmful to have obsolete VRP's on your routers. If it's just a few hours, then the impact will probably not be catastrophic. But what if it's 36 hours, 72 hours? What if the rpki-validation started failing 2 weeks ago, when Jerry from IT ("the linux guy") started it's vacation? On the other hand, if only one (of multiple) validator/rtr instances has a problem and the number of VRP's slowly goes down, nothing will happen at all on your routers, as they just use the union of the RTR endpoints and the VRP's from the broken RTR server will slowly be withdrawn. Your router will keep using healthy RTR servers, as opposed to considering erroneous data from a poisoned RTR server. I define stability not as "RTR session uptime and VRP count", but whether or not my BGP routers are making correct or wrong decisions.
I too prefer an approach where the validator and RTR are separate but co-habitated, but this naturally increases the possibility that the two might serve different data due to reachability, validator run-time, .... To what extend differences occur, I have not measured.
[1] The NIST ROA graph confirms the rate of change is low, as I would expect. But, I have no statistic for ROA stability, considering only the prefix and origin.
I don't see how the rate of global ROA changes is in any way related to this issue. The operational issue a hung RTR endpoint creates for other people's networks can't be measured with this. lukas