On Mon, 13 Dec 2004, Simon Waters wrote:
Inspection suggests that the anycast announcements in the UK were pointing to a server that wasn't accepting email.
I believe here the problem is using anycast, and not providing a backup system not using anycast. The previous case I'm aware of was when bits of the NE USA lost ".org" because they only had anycast DNS servers (and still do AFAIK), and the announcement messed up.
Whilst I plead ignorant of the technical details of anycast, strikes me that it is clearly more complex, and thus more prone to failure, and these failures are potentially less obvious.
(for anybody reading this who doesn't know, anycast is multiple servers in multiple locations announcing routes and accepting connections to the same IP address). Off the top of my head, there are two major forms of DNS failures: - DNS servers stop responding - DNS servers respond with incorrect answers. Both of these are problems that can affect Anycast systems, but neither is unique to Anycast. In the first case, the DNS protocol has its own failover mechanism, and good anycast implementations also provide a failover mechanism. If a DNS server doesn't get a response from the first IP address it sends a DNS query to, it will send its query to another IP address listed as a DNS server for the domain it's trying to query. This process gets repeated until a working DNS server is found, or until it runs out of IP addresses of DNS servers to try. Anycast adds a second failover mechanism, whereby if a DNS server knows it's stopped responding, it can withdraw the announcement of its IP address, and routing protocols will redirect queries that it would have rejected to a different (hopefully working) server. Since the anycast routing protocol based failover mechanism can fail (for example, if a server stops answering queries but doesn't withdraw its routing announcements), it is important for there to be multiple IP addresses to query when looking for information on a domain. This allows DNS's internal failover mechanism to be used. However, this doesn't mean that the other listed DNS server IP addresses for a domain shouldn't also be addresses of anycast networks; they just shouldn't be part of the same anycast network. If .org has had failures (I don't know whether it has, and it's not an argument I want to wander into the middle of), it may be instructive to look at the differences between .org's anycast setup and that of the root servers. .org has two IP addresses listed, which I believe are for two different anycast clouds. That means that to get .org responses from any given location, at least one of two DNS servers must be working properly. There are thirteen root server IP addresses, most of which are at least somewhat anycasted. So, to get a response from the roots from any given location, at least one of thirteen servers needs to be doing the right thing. Again, that's not an issue of anycast vs. non-anycast. It's an issue of whether n+1 redundancy works as well as n+12 redundancy. The case being talked about here was the other sort of DNS failure -- a DNS server responding with incorrect information. That's a problem that becomes harder to notice as the number of DNS servers for a domain increases, but has nothing to do with whether those servers are in a anycast configuration. Indeed, Mr. Waters' suggested non-anycast backup would just add yet another server that could potentially get out of sync. Instead, this is the standard monitoring issue for big networks -- when a network gets big enough that failures won't be immediately obvious to a network's operator, there needs to be a monitoring system to detect failures. In the DNS case, the monitoring system should probably be trying DNS queries against all the servers, and making sure it gets consistent results. -Steve