On Fri, 6 Jan 2006, william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritative servers. ETA is unknown.
You forgot to mention that they only have two authoritative servers for most of their domains...
I didn't look at this while it was happening, and haven't talked to anybody else about it, so I don't know if this was a systems or routing issue. But, in the spirit of trying to learn lessons from incomplete information... Qwest.net and Qwest.com have two authoritative name server addresses listed, dca-ans-01.inet.qwest.net and svl-ans-01.inet.qwest.net. As the names imply, traceroutes to these two servers appear to go to somewhere in the DC area and somewhere in proximity to Sunnyvale, California. It appears they're really just two servers or single location load-balanced clusters, and not an anycast cloud with two addresses. It may be that two simultaneous server failures would take out the whole thing, or they may be in less visible load balancing configurations. Even if it's two individual servers, that's the standard n+1 redundancy that's generally considered sufficient for most things. There is a fair amount of geographic diversity between the two sites, which is a good thing. The two servers have the IP addresses 205.171.9.242 and 205.171.14.195. These both appear in global BGP tables as part of 205.168.0.0/14, so any outage affecting that single route (flapping, getting withdrawn, getting announced from somewhere without working connectivity to the two name servers, etc.) would take out both of them. So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy. While it's tempting to make fun of Qwest here, variations on this theme -- working hard on one area of design while ignoring another that's also critical -- are really common. It's something we all need to be careful of. Or, not having seen what happened here, the problem could have been something completely different, perhaps even having nothing to do with routing or network topology. In that case, my general point would remain the same, but this would be a bad example to use. -Steve