plea for comcast/sprint handoff debug help
tl;dr: comcast: does your 50.242.151.5 westin router receive the announcement of 147.28.0.0/20 from sprint's westin router 144.232.9.61? details: 3130 in the westin announces 147.28.0.0/19 and 147.28.0.0/20 to sprint, ntt, and the six and we want to remove the /19 when we stop announcing the /19, a traceroute to comcast through sprint dies at the handoff from sprint to comcast. r0.sea#traceroute 73.47.196.134 source 147.28.7.1 Type escape sequence to abort. Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134) VRF info: (vrf in name/id, vrf out name/id) 1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec 2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 1 msec 0 msec 3 * * * 4 * * * 5 * * * 6 * * * this would 'normally' (i.e. when the /19 is announced) be r0.sea#traceroute 73.47.196.134 source 147.28.7.1 Type escape sequence to abort. Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134) VRF info: (vrf in name/id, vrf out name/id) 1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec 2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 0 msec 1 msec 3 be-207-pe02.seattle.wa.ibone.comcast.net (50.242.151.5) [AS 7922] 1 msec 0 msec 0 msec 4 be-10847-cr01.seattle.wa.ibone.comcast.net (68.86.86.225) [AS 7922] 1 msec 1 msec 2 msec etc specifically, when 147.28.0.0/19 is announced, traceroute from 147.28.7.2 through sprint works to comcast. withdraw 147.28.0.0/19, leaving only 147.28.0.0/20, and the traceroute enters sprint but fails at the handoff to comcast. Bad next-hop? not propagated? covid? magic? which is why we wonder what comcast (50.242.151.5) hears from sprint at that handoff note that, at the minute, both the /19 and the /20 are being announced, as we want things to work. so you will not be able to reproduce. so, comcast, are you receiving the announcement of the /20 from sprint? with a good next-hop? randy
tl;dr:
comcast: does your 50.242.151.5 westin router receive the announcement of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
tl;dr: diagnosed by comcast. see our short paper to be presented at imc tomorrow https://archive.psg.com/200927.imc-rp.pdf lesson: route origin relying party software may cause as much damage as it ameliorates randy
Hello, On Wed, 28 Oct 2020 at 16:58, Randy Bush <randy@psg.com> wrote:
tl;dr: diagnosed by comcast. see our short paper to be presented at imc tomorrow https://archive.psg.com/200927.imc-rp.pdf
lesson: route origin relying party software may cause as much damage as it ameliorates
There is a myth that ROV is inherently fail-safe (it isn't if your production routers have stale VRP's) which leads to the assumption that proper monitoring is neglectable. I'm working on a shell script using rtrdump to detect stale RTR servers (based on serial changes and the actual data). Of course this would never detect partial failures that affect only some child-CAs, but it does detect a hung RTR server (or a standalone RTR server where the validator validates no more). lukas
On 28 Oct 2020, at 16:58, Randy Bush <randy@psg.com> wrote:
tl;dr:
comcast: does your 50.242.151.5 westin router receive the announcement of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
tl;dr: diagnosed by comcast. see our short paper to be presented at imc tomorrow https://archive.psg.com/200927.imc-rp.pdf
lesson: route origin relying party software may cause as much damage as it ameliorates
randy
To clarify this for the readers here: there is an ongoing research experiment where connectivity to the RRDP and rsync endpoints of several RPKI publication servers is being purposely enabled and disabled for prolonged periods of time. This is perfectly fine of course. While the resulting paper presented at IMC is certainly interesting, having relying party software fall back to rsync when RRDP is unavailable is not a requirement specified in any RFC, as the paper seems to suggest. In fact, we argue that it's actually a bad idea to do so: https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/ We're interested to hear views on this from both an operational and security perspective. -Alex
tl;dr:
comcast: does your 50.242.151.5 westin router receive the announcement of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
tl;dr: diagnosed by comcast. see our short paper to be presented at imc tomorrow https://archive.psg.com/200927.imc-rp.pdf
lesson: route origin relying party software may cause as much damage as it ameliorates
randy
To clarify this for the readers here: there is an ongoing research experiment where connectivity to the RRDP and rsync endpoints of several RPKI publication servers is being purposely enabled and disabled for prolonged periods of time. This is perfectly fine of course.
While the resulting paper presented at IMC is certainly interesting, having relying party software fall back to rsync when RRDP is unavailable is not a requirement specified in any RFC, as the paper seems to suggest. In fact, we argue that it's actually a bad idea to do so:
https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
We're interested to hear views on this from both an operational and security perspective.
in fact, <senior op at an isp> has found your bug. if you find an http server, but it is not serving the new and not-required rrdp protocol, it does not then use the mandatory to implement rsync. randy
i'll see your blog post and raise you a peer reviewed academic paper and two rfcs :) in dnssec, we want to move from the old mandatory to implement (mti) rsa signatures to the more modern ecdsa. how would the world work out if i fielded a validating dns cache server which *implemented* rsa, because it is mti, but chose not to actually *use* it for validation on odd numbered wednesdays because of my religious belief that ecdsa is superior? perhaps go over to your unbound siblings and discuss this analog. but thanks for your help in getting jtk's imc paper accepted. :) randy
On 30 Oct 2020, at 01:10, Randy Bush <randy@psg.com> wrote:
i'll see your blog post and raise you a peer reviewed academic paper and two rfcs :)
For the readers wondering what is going on here: there is a reason there is only a vague mention to two RFCs instead of the specific paragraph where it says that Relying Party software must fall back to rsync immediately if RRDP is temporarily unavailable. That is because this section doesn’t exist. The point is that there is no bug and in fact, Routinator has a carefully thought out strategy to deal with transient outages. Moreover, we argue that our strategy is the better choice, both operationally and from a security standpoint. The paper shows that Routinator is the most used RPKI relying party software, and we know many of you here rely on it for route origin validation in a production environment. We take this responsibility and therefore this matter very seriously, and would not want you to think we have been careless in our software design. Quite the opposite. We have made several attempts within the IETF to have a discussion on technical merit, where aspects such as overwhelming an rsync server with traffic, or using aggressive fallback to rsync as an entry point to a downgrade attack have been brought forward. Our hope was that our arguments would be considered on technical merit, but that did not happen yet. Be that as it may, operators can rest assured that if consensus goes against our logic, we will change our design.
perhaps go over to your unbound siblings and discuss this analog.
The mention of Unbound DNS resolver in this context is interesting, because we have in fact discussed our strategy with the developers on this team as there is a lot to be learned from other standards and operational experiences. We feel very strongly about this matter because the claim that using our software negatively affects Internet routing robustness strikes at the core of NLnet Labs’ existence: our reputation and our mission to work for the good of the Internet. They are the core values that make it possible for a not-for-profit foundation like ours to make free, liberally licensed open source software. We’re proud of what we’ve been able to achieve and look forward to a continued open discussion with the community. Respectfully, Alex
Alex: When I follow the RFC rabbit hole : RFC6481 : A Profile for Resource Certificate Repository Structure The publication repository MUST be available using rsync
[RFC5781 <https://tools.ietf.org/html/rfc5781>] [RSYNC <https://tools.ietf.org/html/rfc6481#ref-RSYNC>]. Support of additional retrieval mechanisms is the choice of the repository operator. The supported retrieval mechanisms MUST be consistent with the accessMethod element value(s) specified in the SIA of the associated CA or EE certificate.
Then : RFC8182 : The RPKI Repository Delta Protocol (RRDP) This document allows the use of RRDP as an additional repository
distribution mechanism for RPKI. In time, RRDP may replace rsync [RSYNC <https://tools.ietf.org/html/rfc8182#ref-RSYNC>] as the only mandatory-to-implement repository distribution mechanism. However, this transition is outside of the scope of this document.
Is this not the case then that currently rsync is still mandatory, even if RRDP is in place? Or is there a more current RFC that has defined the transition that I did not locate? On Fri, Oct 30, 2020 at 7:49 AM Alex Band <alex@nlnetlabs.nl> wrote:
On 30 Oct 2020, at 01:10, Randy Bush <randy@psg.com> wrote:
i'll see your blog post and raise you a peer reviewed academic paper and two rfcs :)
For the readers wondering what is going on here: there is a reason there is only a vague mention to two RFCs instead of the specific paragraph where it says that Relying Party software must fall back to rsync immediately if RRDP is temporarily unavailable. That is because this section doesn’t exist. The point is that there is no bug and in fact, Routinator has a carefully thought out strategy to deal with transient outages. Moreover, we argue that our strategy is the better choice, both operationally and from a security standpoint.
The paper shows that Routinator is the most used RPKI relying party software, and we know many of you here rely on it for route origin validation in a production environment. We take this responsibility and therefore this matter very seriously, and would not want you to think we have been careless in our software design. Quite the opposite.
We have made several attempts within the IETF to have a discussion on technical merit, where aspects such as overwhelming an rsync server with traffic, or using aggressive fallback to rsync as an entry point to a downgrade attack have been brought forward. Our hope was that our arguments would be considered on technical merit, but that did not happen yet. Be that as it may, operators can rest assured that if consensus goes against our logic, we will change our design.
perhaps go over to your unbound siblings and discuss this analog.
The mention of Unbound DNS resolver in this context is interesting, because we have in fact discussed our strategy with the developers on this team as there is a lot to be learned from other standards and operational experiences.
We feel very strongly about this matter because the claim that using our software negatively affects Internet routing robustness strikes at the core of NLnet Labs’ existence: our reputation and our mission to work for the good of the Internet. They are the core values that make it possible for a not-for-profit foundation like ours to make free, liberally licensed open source software.
We’re proud of what we’ve been able to achieve and look forward to a continued open discussion with the community.
Respectfully,
Alex
On Fri, Oct 30, 2020 at 12:47:44PM +0100, Alex Band wrote:
On 30 Oct 2020, at 01:10, Randy Bush <randy@psg.com> wrote: i'll see your blog post and raise you a peer reviewed academic paper and two rfcs :)
For the readers wondering what is going on here: there is a reason there is only a vague mention to two RFCs instead of the specific paragraph where it says that Relying Party software must fall back to rsync immediately if RRDP is temporarily unavailable. That is because this section doesn’t exist.
*skeptical face* Alex, you got it backwards: the section that does not exist, is to *not* fall back to rsync. But on the other hand, there are ample RFC sections which outline rsync is the mandatory-to-implement protocol. Starts at RFC 6481 Section 3: "The publication repository MUST be available using rsync". Even the RRDP RFC itself (RFC 8182) describes that RSYNC and RRDP *co-exist*. I think this co-existence was factored into both the design of RPKIoverRSYNC and subsequently RPKIoverRRDP. An rsync publication point does not become invalid because of the demise of an once-upon-a-time valid RRDP publication point. Only a few weeks ago a large NIR (IDNIC) disabled their RRDP service because somehow the RSYNC and RRDP repositories were out-of-sync with each other. The RRDP service remained disabled for a number of days until they repaired their RPKI Certificate Authority service. I suppose that during this time, Routinator was unable to receive any updates related to the IDNIC CA (pinned to RRDP -> because of a prior successful fetch prior to the partial IDNIC RPKI outage). This in turn deprived the IDNIC subordinate Resource Holders the ability to update their Route Origin Authorization attestations (from Routinator's perspective). Given that RRDP is an *optional* protocol in the RPKI stack, it doesn't make sense to me to strictly pin fetching operations to RRDP: Over time (months, years), a CA could enable / disable / enable / disable RRDP service, while listing the RRDP URI as a valid SIA, amongst other valid SIAs. An analogy to DNS: A website operator may add AAAA records to indicate IPv6 reachability, but over time may also remove the AAAA record if there (temporarily) is some kind of issue with the IPv6 service. The Internet operations community of course encourages everyone to add AAAA records, and IPv6 Happy Eyeballs were a concept to for a long time even *favor* IPv6 over IPv4 to help improve IPv6 adoption, but a dual-stack browser will always try to make benefit of the redundancy that exists through the two address families. RSYNC and RRDP should be viewed in a similar context as v4 vs v6, but unlike with IPv4 and IPv6, I am convinced that RSYNC can be deprecated in the span of 3 or 4 years, the draft-sidrops-bruijnzeels-deprecate-rsync document is helping towards that goal!
Be that as it may, operators can rest assured that if consensus goes against our logic, we will change our design.
Please change the implementation a little bit (0.8.1). I think it is too soon for the internet wide 'rsync to RRDP' migration project to be declared complete and successfull, and this actually hampers the transition to RRDP. Pinning to RRDP *forever* violates the principle-of-least-astonishment in a world where draft-sidrops-bruijnzeels-deprecate-rsync-00 was published only as recent as November 2019. That draft now is a working group document, and it will probably take another 1 or 2 years before it is published as RFC. Section 5 of 'draft-deprecate-rsync' says RRDP *SHOULD* be used when it is available. Thus it logically follows, when it is not available, the lowest common denominator is to be used: rsync. After all, the Issuing CA put an RSYNC URI in the 'Subject Information Access' (SIA). Who knows better than the CA? The ability to publish routing intentions, and for others to honor the intentions of the CA is what RPKI is all about. When the CA says delegated RPKI data is available at both an RSYNC URI and an RRDP URI, both are valid network entrypoints to the publication point. The resource holder's X.509 signature even is on those 'reference to there' directions (URIs)! :-) If I can make a small suggestion: make 0.8.1 fall back to rsync after waiting an hour or so, (meanwhile polling to see if the the RRDP service restores). This way the network operator takes advantage of both transport protocols, whichever is available, with a clear preference to try RRDP first, then eventually rsync. RPKI was designed in such a way that it can be transported even over printed paper, usb stick, bluetooth, vinyl, rsync, and also https (as rrdp). Because RPKI data is signed using the X.509 framework, the transportation method really is irrelevant. IP holders can publish RPKI data via horse + cart, and still make productive use of it! Routinator's behavior is not RFC compliant, and has tangible effects in the default-free zone. Regards, Job
On Thu, Oct 29, 2020 at 09:14:16PM +0100, Alex Band wrote:
In fact, we argue that it's actually a bad idea to do so:
https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
We're interested to hear views on this from both an operational and security perspective.
I don't see a compelling reason to not use rsync when RRDP is unavailable. Quoting from the blog post: "While this isn’t threatening the integrity of the RPKI – all data is cryptographically signed making it really difficult to forge data – it is possible to withhold information or replay old data." RRDP does not solve the issue of withholding data or replaying old data. The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP protocol basically is rsync wrapped in XML over HTTPS. Withholding of information is detected through verification of RPKI manifests (something Routinator didn't verify up until last week!), and replaying of old data is addressed by checking validity dates and CRLs (something Routinator also didn't do until last week!). Of course I see advantages to this industry mainly using RRDP, but those are not security advantages. The big migration towards RRDP can happen somewhere in the next few years. The arguments brought forward in the blog post don't make sense to me. The '150,000' number in the blog post seems a number pulled from thin air. Regards, Job
Hi Job, all,
On 30 Oct 2020, at 11:06, Job Snijders <job@ntt.net> wrote:
On Thu, Oct 29, 2020 at 09:14:16PM +0100, Alex Band wrote:
In fact, we argue that it's actually a bad idea to do so:
https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
We're interested to hear views on this from both an operational and security perspective.
I don't see a compelling reason to not use rsync when RRDP is unavailable.
Quoting from the blog post:
"While this isn’t threatening the integrity of the RPKI – all data is cryptographically signed making it really difficult to forge data – it is possible to withhold information or replay old data."
RRDP does not solve the issue of withholding data or replaying old data. The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP protocol basically is rsync wrapped in XML over HTTPS.
Withholding of information is detected through verification of RPKI manifests (something Routinator didn't verify up until last week!), and replaying of old data is addressed by checking validity dates and CRLs (something Routinator also didn't do until last week!).
Of course I see advantages to this industry mainly using RRDP, but those are not security advantages. The big migration towards RRDP can happen somewhere in the next few years.
Routinator does TLS verification when it encounters an RRDP repository. If the repository cannot be reached, or its HTTPS certificate is somehow invalid, it will use rsync instead. It's only after it found a *valid* HTTPS connection, that it refuses to fall back. There is a security angle here. Malicious-in-the-middle attacks can lead an RP to a bogus HTTPS server and force the software to downgrade to rsync, which has no channel security. The software can then be given old data (new ROAs can be withheld), or the attacker can simply withhold a single object. With the stricter publication point completeness validation introduced by RFC6486-bis this will lead to the rejecting of all ROAs published there. The result is the exact same problem that Randy et al.'s research pointed at. If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes. The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load), but it increases the attack surface for repositories that keep their RRDP server available. Regards, Tim
The arguments brought forward in the blog post don't make sense to me. The '150,000' number in the blog post seems a number pulled from thin air.
Regards,
Job
If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes.
i.e. the upstream kills the customer. not a wise business model.
The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load)
folk try different software, try different configurations, realize that having their CA gooey exposed because they wanted to serve rrdp and block, ... randy, finding the fort rp to be pretty solid!
As I've pointed out to Randy and others and I'll share here. We planned, but hadn't yet upgraded our Routinator RP (Relying Party) software to the latest v0.8 which I knew had some improvements. I assumed the problems we were seeing would be fixed by the upgrade. Indeed, when I pulled down the new SW to a test machine, loaded and ran it, I could get both Randy's ROAs. I figured I was good to go. Then we upgraded the prod machine to the new version and the problem persisted. An hour or two of analysis made me realize that the "stickiness" of a particular PP (Publication Point) is encoded in the cache filesystem. Routinator seems to build entries in its cache directory under either rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but moving the cache directory aside and forcing it to rebuild fixed the issue. A couple of points seem to follow: - Randy says: "finding the fort rp to be pretty solid!" I'll say that if you loaded a fresh Fort and fresh Routinator install, they would both have your ROAs. - The sense of "stickiness" is local only; hence to my mind the protection against "downgrade" attack is somewhat illusory. A fresh install knows nothing of history. Tony On Fri, Oct 30, 2020 at 11:57 PM Randy Bush <randy@psg.com> wrote:
If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes.
i.e. the upstream kills the customer. not a wise business model.
The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load)
folk try different software, try different configurations, realize that having their CA gooey exposed because they wanted to serve rrdp and block, ...
randy, finding the fort rp to be pretty solid!
- Randy says: "finding the fort rp to be pretty solid!" I'll say that if you loaded a fresh Fort and fresh Routinator install, they would both have your ROAs. - The sense of "stickiness" is local only; hence to my mind the protection against "downgrade" attack is somewhat illusory. A fresh install knows nothing of history.
fort running enabled rrdp on server router reports r0.sea#sh ip bgp rpki table | i 3130 147.28.0.0/20 20 3130 0 147.28.0.84/323 147.28.0.0/19 19 3130 0 147.28.0.84/323 147.28.64.0/19 19 3130 0 147.28.0.84/323 147.28.96.0/19 19 3130 0 147.28.0.84/323 147.28.128.0/19 19 3130 0 147.28.0.84/323 147.28.160.0/19 19 3130 0 147.28.0.84/323 147.28.192.0/19 19 3130 0 147.28.0.84/323 192.83.230.0/24 24 3130 0 147.28.0.84/323 198.180.151.0/24 24 3130 0 147.28.0.84/323 198.180.153.0/24 24 3130 0 147.28.0.84/323 disabled rrdp on server added new roa 198.180.151.0/25 waited a while router reports r0.sea#sh ip bgp rpki table | i 3130 147.28.0.0/20 20 3130 0 147.28.0.84/323 147.28.0.0/19 19 3130 0 147.28.0.84/323 147.28.64.0/19 19 3130 0 147.28.0.84/323 147.28.96.0/19 19 3130 0 147.28.0.84/323 147.28.128.0/19 19 3130 0 147.28.0.84/323 147.28.160.0/19 19 3130 0 147.28.0.84/323 147.28.192.0/19 19 3130 0 147.28.0.84/323 192.83.230.0/24 24 3130 0 147.28.0.84/323 198.180.151.0/25 25 3130 0 147.28.0.84/323 <<<=== 198.180.151.0/24 24 3130 0 147.28.0.84/323 198.180.153.0/24 24 3130 0 147.28.0.84/323 as i said, fort seems solid randy
r0.sea#sh ip bgp rpki table | i 3130 147.28.0.0/20 20 3130 0 147.28.0.84/323 147.28.0.0/19 19 3130 0 147.28.0.84/323 147.28.64.0/19 19 3130 0 147.28.0.84/323 147.28.96.0/19 19 3130 0 147.28.0.84/323 147.28.128.0/19 19 3130 0 147.28.0.84/323 147.28.160.0/19 19 3130 0 147.28.0.84/323 147.28.192.0/19 19 3130 0 147.28.0.84/323 192.83.230.0/24 24 3130 0 147.28.0.84/323 198.180.151.0/25 25 3130 0 147.28.0.84/323 <<<=== 198.180.151.0/24 24 3130 0 147.28.0.84/323 198.180.153.0/24 24 3130 0 147.28.0.84/323
note rov ops: if you do not see that /25 in your router(s), the RP software you are running can be damaging to your customers and to others. randy
Hi Tony, I realise there are quite some moving parts so I'll try to summarise our design choices and reasoning as clearly as possible. Rsync was the original transport for RPKI and is still mandatory to implement. RRDP (which uses HTTPS) was introduced to overcome some of the shortcomings of rsync. Right now, all five RIRs make their Trust Anchors available over HTTPS, all but two RPKI repositories support RRDP and all but one relying party software packages support RRDP. There is currently an IETF draft to deprecate the use of rsync. As a result, the bulk of RPKI traffic is currently transported over RRDP and only a small amount relies on rsync. For example, our RPKI repository is configured accordingly: rrdp.rpki.nlnetlabs.nl is served by a CDN and rsync.rpki.nlnetlabs.nl runs rsyncd on a simple, small VM to deal with the remaining traffic. When operators deploying our Krill Delegated RPKI software ask us what to expect and how to provision their services, this is how we explain the current state of affairs. With this is mind, Routinator currently has this fetching strategy: 1. It starts by connecting to the Trust Anchors of the RIRs over HTTPS, if possible, and otherwise use rsync. 2. It follows the certificate tree, following several pointers to publication servers along the way. These pointers can be rsync only or there can be two pointers, one to rsync and one to RRDP. 3. If an RRDP pointer is found, Routinator will try to connect to the service, verify if there is a valid TLS certificate and data can be successfully fetched. If it can, the server is marked as usable and it'll prefer it. If the initial check fails, Routinator will use rsync, but verify RRDP works on the next validation run. 4. If RRDP worked before but is unavailable for any reason, Routinator will used cached data and try again on the next run instead of immediately falling back to rsync. 5. If the RPKI publication server operator takes away the pointer to RRDP to indicate they no longer offer this communication protocol, Routinator will use rsync. 6. If Routinator's cache is cleared, the process will start fresh This strategy was implemented with repository server provisioning in mind. We are assuming that if you actively indicate that you offer RRDP, you actually provide a monitored service there. As such, an outage would be assumed to be transient in nature. Routinator could fall back immediately, of course. But our thinking was that if the RRDP service would have a small hiccup, currently a 1000+ Routinator instances would be hammering a possibly underprovisioned rsync server, perhaps causing even more problems for the operator. "Transient" is currently the focus. In Randy's experiment, he is actively advertising he offers RRDP, but doesn't offer a service there for weeks at a time. As I write this, ca.rg.net. cb.rg.net and cc.rg.net have been returning a 404 on their RRDP endpoint several weeks and counting. cc.rg.net was unavailable over rsync for several days this week as well. I would assume this is not how operators would run their RPKI publication server normally. Not having an RRDP service for weeks when you advertise you do is fine for an experiment but constitutes pretty bad operational practice for a production network. If a service becomes unavailable, the operator would swiftly be contacted and the issue would be resolved, like Randy and I have done in happier times: https://twitter.com/alexander_band/status/1209365918624755712 https://twitter.com/enoclue/status/1209933106720829440 On a personal note, I realise the situation has a dumpster fire feel to it. I have contacted Randy about his outages months ago, not knowing they were a research project. I never got a reply. Instead of discussing his research and the observed effects, it feels like a 'gotcha' to present the findings in this way. It could even be considered irresponsible, if the fallout is as bad as he claims. The notion that using our software is quote, "a disaster waiting to happen", is disingenuous at best: https://www.ripe.net/ripe/mail/archives/members-discuss/2020-September/00423... Routinator design was to try to deal with outages in a responsible manner for all actors involved. Again, of course we can change our strategy as a result of this discussion, which I'm happy we're now actually having. In that case I would advise operators who offer an RPKI publication server to ensure that they provision their rsyncd service so that it is capable of handling all of the traffic that their RRDP service normally handles, in case RRDP has a glitch. And, even if people will scale their rsync service accordingly, they will only ever find out if it actually does in a time of crisis. Kind regards, -Alex
On 31 Oct 2020, at 07:17, Tony Tauber <ttauber@1-4-5.net> wrote:
As I've pointed out to Randy and others and I'll share here. We planned, but hadn't yet upgraded our Routinator RP (Relying Party) software to the latest v0.8 which I knew had some improvements. I assumed the problems we were seeing would be fixed by the upgrade. Indeed, when I pulled down the new SW to a test machine, loaded and ran it, I could get both Randy's ROAs. I figured I was good to go. Then we upgraded the prod machine to the new version and the problem persisted. An hour or two of analysis made me realize that the "stickiness" of a particular PP (Publication Point) is encoded in the cache filesystem. Routinator seems to build entries in its cache directory under either rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but moving the cache directory aside and forcing it to rebuild fixed the issue.
A couple of points seem to follow: • Randy says: "finding the fort rp to be pretty solid!" I'll say that if you loaded a fresh Fort and fresh Routinator install, they would both have your ROAs. • The sense of "stickiness" is local only; hence to my mind the protection against "downgrade" attack is somewhat illusory. A fresh install knows nothing of history. Tony
On Fri, Oct 30, 2020 at 11:57 PM Randy Bush <randy@psg.com> wrote:
If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes.
i.e. the upstream kills the customer. not a wise business model.
The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load)
folk try different software, try different configurations, realize that having their CA gooey exposed because they wanted to serve rrdp and block, ...
randy, finding the fort rp to be pretty solid!
Hi Randy, all,
On 31 Oct 2020, at 04:55, Randy Bush <randy@psg.com> wrote:
If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes.
i.e. the upstream kills the customer. not a wise business model.
I did not say it was. But this is the problematic case. For the vast majority of ROAs the sustained loss of the repository would lead to invalid ROA *objects*, which will not be used in Route Origin Validation anymore leading to the state 'Not Found' for the associated announcements. This is not the case if there are other ROAs for the same prefixes published by others (most likely the parent). Quick back of the envelope analysis: this affects about 0.05% of ROA prefixes.
The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load)
folk try different software, try different configurations, realize that having their CA gooey exposed because they wanted to serve rrdp and block, ...
We are talking here about the HTTPS server being unavailable, while rsync *is*. So this means, your HTTPS server is down, unreachable, or has an issue with its HTTPS certificate. Your repository could use a CDN if they don't want to do all this themselves. They could monitor, and fix things.. there is time. Thing is even if HTTPs becomes unavailable this still leaves hours (8 by default for the Krill CA, configurable) to fix things. Routinator (and the RIPE NCC Validator, and others) will use cached data if they cannot retrieve new data. It's only when manifests and CRLs start to expire that the objects would become invalid. So the fallback helps in case of incidents with HTTPS that were not fixed within 8 hours for 0.05% of prefixes. On the other hand, the fallback exposes a Malicious-in-the-Middle replay attack surface for 100% of the prefixes published using RRDP, 100% of the time. This allows attackers to prevent changes in ROAs to be seen. This is a tradeoff. I think that protecting against replay should be considered more important here, given the numbers and time to fix HTTPS issue.
randy, finding the fort rp to be pretty solid!
Unrelated, but sure I like Fort too. Tim
On Mon, Nov 02, 2020 at 09:13:16AM +0100, Tim Bruijnzeels wrote:
On the other hand, the fallback exposes a Malicious-in-the-Middle replay attack surface for 100% of the prefixes published using RRDP, 100% of the time. This allows attackers to prevent changes in ROAs to be seen.
This is a mischaracterization of what is going on. The implication of what you say here is that RPKI cannot work reliably over RSYNC, which is factually incorrect and an injustice to all existing RSYNC based deployment. Your view on the security model seems to ignore the existence of RPKI manifests and the use of CRLs, which exist exactly to mitigate replays. Up until 2 weeks ago Routintar indeed was not correctly validating RPKI data, fortunately this has now been fixed: https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html Also via the RRDP protocol old data be replayed, because because just like RSYNC, the RRDP protocol does not have authentication. When RPKI data is transported from Publication Point (RP) to Relying Party, the RP cannot assume there was an unbroken 'chain of custody' and therefor has to validate all the RPKI signatures. For example, if a CDN is used to distribute RRDP data, the CDN is the MITM (that is literally what CDNs are: reverse proxies, in the middle). The CDN could accidentally serve up old (cached) content or misserve current content (swap 2 filenames with each other).
This is a tradeoff. I think that protecting against replay should be considered more important here, given the numbers and time to fix HTTPS issue.
The 'replay' issue you perceive is also present in RRDP. The RPKI is a *deployed* system on the Internet and it is important for Routinator to remain interopable with other non-nlnetlabs implementations. Routinator not falling back to rsync does *not* offer a security advantage, but does negatively impact our industry's ability to migrate to RRDP. We are in 'phase 0' as described in Section 3 of https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync Regards, Job
I hate to jump in late. but... :) After reading this a few times it seems like what's going on is: o a set of assumptions were built into the software stack this seems fine, hard to build with some assumptions :) o the assumptions seem to include: "if rrdp fails <how?> feel free to jump back/to rsync" I think SOME of the problem is the 'how' there. Admittedly someone (randy) injected a pretty pathological failure mode into the system and didn't react when his 'monitoring' said: "things are broke yo!" o absent a 'failure' the software kept on getting along as it had before. Afterall, maybe the operator here intentionally put their repository into this whacky state? How is an RP software stack supposed to know what the PP's management is meaning to do? o lots of debate about how we got to where we are, I don't know that much of it is really helpful. I think a way forward here is to offer a suggestion for the software folk to cogitate on and improve? "What if (for either rrdp or rsync) there is no successful update[0] in X of Y attempts, attempt the other protocol to sync down to bring the remote PP back to life in your local view." This both allows the RP software to pick their primary path (and stick to that path as long as things work) AND helps the PP folk recover a bit quicker if their deployment runs into troubles. 0: I think 'failure' here is clear (to me): 1) the protocol is broken (rsync no connect, no http connect) 2) the connection succeeds but there is no sync-file (rrdp) nor valid MFT/CRL The 6486-bis rework effort seems to be getting to: "No MFT? no CRL? you r busted!" so I think if you don't get MFT/CRL in X of Y attempts it's safe to say the PP over that protocol is busted, and attempting the other proto is acceptable. thanks! -chris On Mon, Nov 2, 2020 at 4:37 AM Job Snijders <job@ntt.net> wrote:
On Mon, Nov 02, 2020 at 09:13:16AM +0100, Tim Bruijnzeels wrote:
On the other hand, the fallback exposes a Malicious-in-the-Middle replay attack surface for 100% of the prefixes published using RRDP, 100% of the time. This allows attackers to prevent changes in ROAs to be seen.
This is a mischaracterization of what is going on. The implication of what you say here is that RPKI cannot work reliably over RSYNC, which is factually incorrect and an injustice to all existing RSYNC based deployment. Your view on the security model seems to ignore the existence of RPKI manifests and the use of CRLs, which exist exactly to mitigate replays.
Up until 2 weeks ago Routintar indeed was not correctly validating RPKI data, fortunately this has now been fixed: https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html
Also via the RRDP protocol old data be replayed, because because just like RSYNC, the RRDP protocol does not have authentication. When RPKI data is transported from Publication Point (RP) to Relying Party, the RP cannot assume there was an unbroken 'chain of custody' and therefor has to validate all the RPKI signatures.
For example, if a CDN is used to distribute RRDP data, the CDN is the MITM (that is literally what CDNs are: reverse proxies, in the middle). The CDN could accidentally serve up old (cached) content or misserve current content (swap 2 filenames with each other).
This is a tradeoff. I think that protecting against replay should be considered more important here, given the numbers and time to fix HTTPS issue.
The 'replay' issue you perceive is also present in RRDP. The RPKI is a *deployed* system on the Internet and it is important for Routinator to remain interopable with other non-nlnetlabs implementations.
Routinator not falling back to rsync does *not* offer a security advantage, but does negatively impact our industry's ability to migrate to RRDP. We are in 'phase 0' as described in Section 3 of https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync
Regards,
Job
On Fri, Nov 6, 2020 at 5:47 AM Randy Bush <randy@psg.com> wrote:
Admittedly someone (randy) injected a pretty pathological failure mode into the system
really? could you be exact, please? turning an optional protocol off is not a 'failure mode'.
I suppose it depends on how you think you are serving the data. If you thought you were serving it on both protocols, but 'suddenly' the RRDP location was empty that would be a failure. Same if your RRDP location's tls certificate dies... One of my points was that it appeared that the software called 'bad tls cert' (among other things I'm sure) a failure, but not 'empty directory' (or no diff file). It's possible that ALSO 'no diff' is considered a failure but that swapping to alternate transport after a few failures was not implemented. (I don't know, I have not looked at that part of the code, and I don't think alex/tim said either way). I don't think alex is wrong in stating that 'ideally the operator monitors/alerts on health of their service', I think it's shockingly often that this isn't actually done though. (and isn't germaine in the case of the test / research in question) My suggestion is that checking the alternate transport is helpful. -chris
really? could you be exact, please? turning an optional protocol off is not a 'failure mode'. I suppose it depends on how you think you are serving the data. If you thought you were serving it on both protocols, but 'suddenly' the RRDP location was empty that would be a failure.
not necessarily. it could merely be a decision to stop serving rrdp. perhaps a security choice; perhaps a software change; perhaps a phase of the moon.
One of my points was that it appeared that the software called 'bad tls cert' (among other things I'm sure) a failure, but not 'empty directory' (or no diff file). It's possible that ALSO 'no diff' is considered a failure
what the broken client software called what is not my probem. every http[s] server in the universe is not necessarily an rrdp server. if the client has some belief, for whatever reason, that it should be is a brokenness.
I don't think alex is wrong in stating that 'ideally the operator monitors/alerts on health of their service'
i do. i run clients.
My suggestion is that checking the alternate transport is helpful.
as i do not see rrdp as a critical service, after all it is not mti, but i am quite aware of whether it is running or not. the problem is that rotinator seems not to be. randy
On Fri, Nov 6, 2020 at 3:09 PM Randy Bush <randy@psg.com> wrote:
really? could you be exact, please? turning an optional protocol off is not a 'failure mode'. I suppose it depends on how you think you are serving the data. If you thought you were serving it on both protocols, but 'suddenly' the RRDP location was empty that would be a failure.
not necessarily. it could merely be a decision to stop serving rrdp. perhaps a security choice; perhaps a software change; perhaps a phase of the moon.
right this is all in the same set of: "failure modes not caught" (I think, I don't care so much WHY you stopped serving RRDP, just that after a few failures the caller should try my other number (rsync))
as i do not see rrdp as a critical service, after all it is not mti, but i am quite aware of whether it is running or not. the problem is that routinator seems not to be.
sure... it's just made one set of decisions. I was hoping with some discussion we'd get to: Welp, sure we can fallback and try rsync if we don't see success in <some> time.
Hi Chris, list,
On 10 Nov 2020, at 05:22, Christopher Morrow <morrowc.lists@gmail.com> wrote:
sure... it's just made one set of decisions. I was hoping with some discussion we'd get to: Welp, sure we can fallback and try rsync if we don't see success in <some> time.
We will implement fallback in the next release of routinator. We still believe that there are concerns why one may not want to fall back, but we also believe that it will be more constructive to have the technical discussion on this as part of the ongoing deprecate rsync effort in the sidrops working group in the IETF. Regards, Tim
On Wed, Nov 11, 2020 at 9:06 AM Tim Bruijnzeels <tim@nlnetlabs.nl> wrote:
Hi Chris, list,
On 10 Nov 2020, at 05:22, Christopher Morrow <morrowc.lists@gmail.com> wrote:
sure... it's just made one set of decisions. I was hoping with some discussion we'd get to: Welp, sure we can fallback and try rsync if we don't see success in <some> time.
We will implement fallback in the next release of routinator.
cool thanks!
We still believe that there are concerns why one may not want to fall back, but we also believe that it will be more constructive to have the technical discussion on this as part of the ongoing deprecate rsync effort in the sidrops working group in the IETF.
I look forward to chatting about this :) I think, yes with the coming (so soon!) deprecation of rsync having a smooth transition of power from rsync -> rrdp would be great. thanks for reconsidering! -chris
Yes, to Tim and the NLnet Labs folks, Thanks for responding to the community concerns and experiences. Tony On Wed, Nov 11, 2020 at 10:48 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Wed, Nov 11, 2020 at 9:06 AM Tim Bruijnzeels <tim@nlnetlabs.nl> wrote:
Hi Chris, list,
On 10 Nov 2020, at 05:22, Christopher Morrow <morrowc.lists@gmail.com>
wrote:
sure... it's just made one set of decisions. I was hoping with some discussion we'd get to: Welp, sure we can fallback and try rsync if we don't see success in
<some> time.
We will implement fallback in the next release of routinator.
cool thanks!
We still believe that there are concerns why one may not want to fall back, but we also believe that it will be more constructive to have the technical discussion on this as part of the ongoing deprecate rsync effort in the sidrops working group in the IETF.
I look forward to chatting about this :) I think, yes with the coming (so soon!) deprecation of rsync having a smooth transition of power from rsync -> rrdp would be great.
thanks for reconsidering! -chris
i may understand one place you could get confused. unlike a root CA which publishes a TAL which describes transports, a non-root CA does not publish a TAL describing what transports it supports. of course, rsync is mandatory to provide; but anything else is "if it works, enjoy it. otherwise use rsync." randy
On Fri, Nov 6, 2020 at 1:28 AM Christopher Morrow <morrowc.lists@gmail.com> wrote: <snip>
I think a way forward here is to offer a suggestion for the software folk to cogitate on and improve? "What if (for either rrdp or rsync) there is no successful update[0] in X of Y attempts, attempt the other protocol to sync down to bring the remote PP back to life in your local view."
100% Please do this. I also agree with Job's pleas to consider this work as part of the plath outlined in the RSYNC->RRDP transition draft mentioned below. Tony
This both allows the RP software to pick their primary path (and stick to that path as long as things work) AND helps the PP folk recover a bit quicker if their deployment runs into troubles.
<more snip>
This is a tradeoff. I think that protecting against replay should be considered more important here, given the numbers and time to fix HTTPS issue.
The 'replay' issue you perceive is also present in RRDP. The RPKI is a *deployed* system on the Internet and it is important for Routinator to remain interopable with other non-nlnetlabs implementations.
Routinator not falling back to rsync does *not* offer a security advantage, but does negatively impact our industry's ability to migrate to RRDP. We are in 'phase 0' as described in Section 3 of https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync
Regards,
Job
participants (8)
-
Alex Band
-
Christopher Morrow
-
Job Snijders
-
Lukas Tribus
-
Randy Bush
-
Tim Bruijnzeels
-
Tom Beecher
-
Tony Tauber