In a message written on Mon, Oct 03, 2011 at 12:38:25PM -0400, Danny McPherson wrote:
1) ask other authorities? how many? how frequently? impact? 2) consider implications on _entire chain of trust? 3) tell the client something? 4) cache what (e.g., zone cut from who you asked)? how long? 5) other?
"minimal" is not what I was thinking..
I'm asking the BIND team for a better answer, however my best understanding is this will query a second root server (typically next best by RTT) when it gets a non-validating answer, and assuming the second best one validates just fine there are no further follow on effects. So you're talking one extra query when a caching resolver hits the root. We can argue if that is minimal or not, but I suspect most end users behind that resolver would never notice.
You miss the point here Leo. If the operator of a network service can't detect issues *when they occur* in the current system in some automated manner, whether unintentional or malicious, they won't be alerted, they certainly can't "fix" the problem, and the potential exposure window can be significant.
In a message written on Mon, Oct 03, 2011 at 01:09:17PM -0400, Christopher Morrow wrote:
Does ISC (or any other anycast root/*tld provider) have external polling methods that can reliably tell when, as was in this case, local-anycast-instances are made global? (or when the cone of silence widens?)
Could ISC (or any other root operator) do more monitoring? I'm sure, but let's scope the problem first. We're dealing here with a relatively wide spread leak, but that is in fact the rare case. There are 39,000 ASN's active in the routing system. Each one of those ASN's can affect it's path to the root server by: 1) Bringing up an internal instance of a root server, injecting it into its IGP, and "hijacking" the route. 2) Turning up or down a peer that hosts a root server. 3) Turning up or down a transit provider. 4) Adding or removing links internal to their network that change their internal selection to use a different external route. The only way to make sure a route was correct, everywhere, would be to have 39,000+ probes, one on every ASN, and check the path to the root server. Even if you had that, how do you define when any of the changes in 1-4 are legitimate? You could DNSSEC verify to rule out #1, but #2-4 are local decisions made by the ASN (or one of its upstreams). I suppose, if someone had all 39,000+ probes, we could attempt to write algorythms that determined if too much "change" was happening at once; but I'm reminded of events like the earthquake that took out many asian cables a few years back. There's a very real danger in such a system shutting down a large number of nodes during such an event due to the magnitude of changes which I'd suggest is the exact opposite of what the Internet needs to have happen in that event.
(I suppose i'm not prescribing solutions above, just wondering if something like these is/could-be done feasibly)
Not really. Look, I chase down several dozen F-Root leaks a year. You never hear about them on NANOG. Why? Well, it's some small ISP in the middle of nowhere leaking to a peer who believes them, and thus they get a 40ms response time when they should have a 20ms response time by believing the wrong route. Basically, almost no one cares, generally it takes some uber-DNS nerd at a remote site to figure this out and contact us for help. This has tought me that viewpoints are key. You have to be on the right network to detect it has hijacked all 13 root servers, you can't probe that from the outside. You also have to be on the right network to see you're getting the F-Root 1000 miles away rather than the one 500. Those 39,000 ASN's are providing a moving playing field, with relationships changing quite literally every day, and every one of them may be a "leak". This one caught attention not because it was a bad leak. It was IPv6 only. Our monitoring suggests this entire leak siphoned away 40 queries per second, at it's peak, across all of F-Root. In terms of a percentage of queries it doesn't even show visually on any of our graphs. No, it drew attention for totally non-technical reasons, US users panicing that the Chinese goverment was hijacking the Internet which is just laughable in this context. There really is nothing to see here. DNSSEC fixes any security implications from these events. My fat fingers have dropped more than 40qps on the floor more than once this year, and you didn't notice. Bad events (like earthquakes and fiber cuts) have taken any number of servers from any number of operators multiple times this year. Were it not for the fact that someone posted to NANOG, I bet most of the people here would have never noticed their 99.999% working system kept working just fine. I think all the root ops can do better, use more monitoring services, detect more route hijacks faster, but none of us will ever get 100%. None will ever be instantaneous. Don't make that the goal, make the system robust in the face of that reality. My own resolution is better IPv6 monitoring for F-root. :) -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/