Re: F.ROOT-SERVERS.NET moved to Beijing?

3 Oct 2011

      In a message written on Mon, Oct 03, 2011 at 12:38:25PM -0400, Danny McPherson wrote:
...
1) ask other authorities?  how many?  how frequently?  impact?
2) consider implications on _entire chain of trust?
3) tell the client something?  
4) cache what (e.g., zone cut from who you asked)? how long? 
5) other?
"minimal" is not what I was thinking..
I'm asking the BIND team for a better answer, however my best
understanding is this will query a second root server (typically
next best by RTT) when it gets a non-validating answer, and assuming
the second best one validates just fine there are no further follow
on effects.  So you're talking one extra query when a caching
resolver hits the root.  We can argue if that is minimal or not,
but I suspect most end users behind that resolver would never notice.
...
You miss the point here Leo.  If the operator of a network service 
can't detect issues *when they occur* in the current system in some 
automated manner, whether unintentional or malicious, they won't be 
alerted, they certainly can't "fix" the problem, and the potential 
exposure window can be significant.
In a message written on Mon, Oct 03, 2011 at 01:09:17PM -0400, Christopher Morrow wrote:
...
Does ISC (or any other anycast root/*tld provider) have external
polling methods that can reliably tell when, as was in this case,
local-anycast-instances are made global? (or when the cone of silence
widens?)
Could ISC (or any other root operator) do more monitoring?  I'm sure,
but let's scope the problem first.  We're dealing here with a relatively
wide spread leak, but that is in fact the rare case.

There are 39,000 ASN's active in the routing system.  Each one of those
ASN's can affect it's path to the root server by:

1) Bringing up an internal instance of a root server, injecting it into
   its IGP, and "hijacking" the route.
2) Turning up or down a peer that hosts a root server.
3) Turning up or down a transit provider.
4) Adding or removing links internal to their network that change their
   internal selection to use a different external route.

The only way to make sure a route was correct, everywhere, would
be to have 39,000+ probes, one on every ASN, and check the path to
the root server.  Even if you had that, how do you define when any
of the changes in 1-4 are legitimate?  You could DNSSEC verify to
rule out #1, but #2-4 are local decisions made by the ASN (or one
of its upstreams).

I suppose, if someone had all 39,000+ probes, we could attempt to
write algorythms that determined if too much "change" was happening
at once; but I'm reminded of events like the earthquake that took
out many asian cables a few years back.  There's a very real danger
in such a system shutting down a large number of nodes during such
an event due to the magnitude of changes which I'd suggest is the
exact opposite of what the Internet needs to have happen in that
event.
...
(I suppose i'm not prescribing solutions above, just wondering if
something like these is/could-be done feasibly)
Not really.  Look, I chase down several dozen F-Root leaks a year.
You never hear about them on NANOG.  Why?  Well, it's some small
ISP in the middle of nowhere leaking to a peer who believes them,
and thus they get a 40ms response time when they should have a 20ms
response time by believing the wrong route.  Basically, almost no
one cares, generally it takes some uber-DNS nerd at a remote site
to figure this out and contact us for help.

This has tought me that viewpoints are key.  You have to be on the
right network to detect it has hijacked all 13 root servers, you
can't probe that from the outside.  You also have to be on the right
network to see you're getting the F-Root 1000 miles away rather
than the one 500.  Those 39,000 ASN's are providing a moving playing
field, with relationships changing quite literally every day, and
every one of them may be a "leak".

This one caught attention not because it was a bad leak.  It was
IPv6 only.  Our monitoring suggests this entire leak siphoned away
40 queries per second, at it's peak, across all of F-Root.  In terms
of a percentage of queries it doesn't even show visually on any of
our graphs.  No, it drew attention for totally non-technical reasons,
US users panicing that the Chinese goverment was hijacking the
Internet which is just laughable in this context.

There really is nothing to see here.  DNSSEC fixes any security
implications from these events.  My fat fingers have dropped more
than 40qps on the floor more than once this year, and you didn't
notice.  Bad events (like earthquakes and fiber cuts) have taken
any number of servers from any number of operators multiple times
this year.  Were it not for the fact that someone posted to NANOG, I bet
most of the people here would have never noticed their 99.999% working
system kept working just fine.

I think all the root ops can do better, use more monitoring services,
detect more route hijacks faster, but none of us will ever get 100%.
None will ever be instantaneous.  Don't make that the goal, make the
system robust in the face of that reality.

My own resolution is better IPv6 monitoring for F-root. :)

-- 
       Leo Bicknell - bicknell@ufp.org - CCIE 3440
        PGP keys at http://www.ufp.org/~bicknell/