Anycast reliability (was: Re: verizon.net and other email grief)

13 Dec 2004

      On Mon, 13 Dec 2004, Simon Waters wrote:
...
Inspection suggests that the anycast announcements in the UK were
pointing to a server that wasn't accepting email.
I believe here the problem is using anycast, and not providing a backup
system not using anycast. The previous case I'm aware of was when bits
of the NE USA lost ".org" because they only had anycast DNS servers (and
still do AFAIK), and the announcement messed up.
Whilst I plead ignorant of the technical details of anycast, strikes me
that it is clearly more complex, and thus more prone to failure, and
these failures are potentially less obvious.
(for anybody reading this who doesn't know, anycast is multiple servers in
multiple locations announcing routes and accepting connections to the same
IP address).

Off the top of my head, there are two major forms of DNS failures:

- DNS servers stop responding
- DNS servers respond with incorrect answers.

Both of these are problems that can affect Anycast systems, but neither is
unique to Anycast.

In the first case, the DNS protocol has its own failover mechanism, and
good anycast implementations also provide a failover mechanism.  If a DNS
server doesn't get a response from the first IP address it sends a DNS
query to, it will send its query to another IP address listed as a DNS
server for the domain it's trying to query.  This process gets repeated
until a working DNS server is found, or until it runs out of IP addresses
of DNS servers to try.  Anycast adds a second failover mechanism, whereby
if a DNS server knows it's stopped responding, it can withdraw the
announcement of its IP address, and routing protocols will redirect
queries that it would have rejected to a different (hopefully working)
server.

Since the anycast routing protocol based failover mechanism can fail (for
example, if a server stops answering queries but doesn't withdraw its
routing announcements), it is important for there to be multiple IP
addresses to query when looking for information on a domain.  This allows
DNS's internal failover mechanism to be used.  However, this doesn't mean
that the other listed DNS server IP addresses for a domain shouldn't also
be addresses of anycast networks; they just shouldn't be part of the same
anycast network.

If .org has had failures (I don't know whether it has, and it's not an
argument I want to wander into the middle of), it may be instructive to
look at the differences between .org's anycast setup and that of the root
servers.  .org has two IP addresses listed, which I believe are for two
different anycast clouds.  That means that to get .org responses from any
given location, at least one of two DNS servers must be working properly.
There are thirteen root server IP addresses, most of which are at least
somewhat anycasted.  So, to get a response from the roots from any given
location, at least one of thirteen servers needs to be doing the right
thing.  Again, that's not an issue of anycast vs. non-anycast.  It's an
issue of whether n+1 redundancy works as well as n+12 redundancy.

The case being talked about here was the other sort of DNS failure -- a
DNS server responding with incorrect information.  That's a problem that
becomes harder to notice as the number of DNS servers for a domain
increases, but has nothing to do with whether those servers are in a
anycast configuration.  Indeed, Mr. Waters' suggested non-anycast backup
would just add yet another server that could potentially get out of sync.
Instead, this is the standard monitoring issue for big networks -- when a
network gets big enough that failures won't be immediately obvious to a
network's operator, there needs to be a monitoring system to detect
failures.  In the DNS case, the monitoring system should probably be
trying DNS queries against all the servers, and making sure it gets
consistent results.

-Steve

Anycast reliability (was: Re: verizon.net and other email grief)

Steve Gibbard