QWest is having some pretty nice DNS issues right now
Apparently they have lost two authoritive servers. ETA is unknown. -Wil
Well, that would explain it, make me feel better that they took themselves out as well: -bash-2.05b$ dig qwest.com ; <<>> DiG 9.3.1 <<>> qwest.com ;; global options: printcmd ;; connection timed out; no servers could be reached -Wil william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritive servers. ETA is unknown.
You forgot to mention that they only have two authoritive servers for most of their domains...
Partially back up, I can resolve everthing here... They tell me that it's not quite over yet. -Wil Wil Schultz wrote:
Well, that would explain it, make me feel better that they took themselves out as well:
-bash-2.05b$ dig qwest.com ; <<>> DiG 9.3.1 <<>> qwest.com ;; global options: printcmd ;; connection timed out; no servers could be reached
-Wil
william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritive servers. ETA is unknown.
You forgot to mention that they only have two authoritive servers for most of their domains...
On Fri, 6 Jan 2006, Wil Schultz wrote:
Well, that would explain it, make me feel better that they took themselves out as well:
-bash-2.05b$ dig qwest.com ; <<>> DiG 9.3.1 <<>> qwest.com ;; global options: printcmd ;; connection timed out; no servers could be reached
not anycasted then eh? bummer :(
-Wil
william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritive servers. ETA is unknown.
You forgot to mention that they only have two authoritive servers for most of their domains...
On Fri, 6 Jan 2006, william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritative servers. ETA is unknown.
You forgot to mention that they only have two authoritative servers for most of their domains...
I didn't look at this while it was happening, and haven't talked to anybody else about it, so I don't know if this was a systems or routing issue. But, in the spirit of trying to learn lessons from incomplete information... Qwest.net and Qwest.com have two authoritative name server addresses listed, dca-ans-01.inet.qwest.net and svl-ans-01.inet.qwest.net. As the names imply, traceroutes to these two servers appear to go to somewhere in the DC area and somewhere in proximity to Sunnyvale, California. It appears they're really just two servers or single location load-balanced clusters, and not an anycast cloud with two addresses. It may be that two simultaneous server failures would take out the whole thing, or they may be in less visible load balancing configurations. Even if it's two individual servers, that's the standard n+1 redundancy that's generally considered sufficient for most things. There is a fair amount of geographic diversity between the two sites, which is a good thing. The two servers have the IP addresses 205.171.9.242 and 205.171.14.195. These both appear in global BGP tables as part of 205.168.0.0/14, so any outage affecting that single route (flapping, getting withdrawn, getting announced from somewhere without working connectivity to the two name servers, etc.) would take out both of them. So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy. While it's tempting to make fun of Qwest here, variations on this theme -- working hard on one area of design while ignoring another that's also critical -- are really common. It's something we all need to be careful of. Or, not having seen what happened here, the problem could have been something completely different, perhaps even having nothing to do with routing or network topology. In that case, my general point would remain the same, but this would be a bad example to use. -Steve
On 1/6/06 9:54 PM, "Steve Gibbard" <scg@gibbard.org> wrote:
On Fri, 6 Jan 2006, william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritative servers. ETA is unknown.
You forgot to mention that they only have two authoritative servers for most of their domains...
[snip]
So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy. While it's tempting to make fun of Qwest here, variations on this theme -- working hard on one area of design while ignoring another that's also critical -- are really common. It's something we all need to be careful of.
Or, not having seen what happened here, the problem could have been something completely different, perhaps even having nothing to do with routing or network topology. In that case, my general point would remain the same, but this would be a bad example to use.
-Steve
At some point in a carrier's growth, Anycast DNS has got to become a best practice. Are there many major carriers that don't do it today, or am I just a starry-eyed idealist? - Dan
Steve Gibbard wrote:
So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy.
I didn't get to play detective at the time of the outage, but configutation (which is automatically replicated) may also have been enough to take out both nameservers. It also makes good management sense to run your nameservers with the same software and versions, but perhaps it doesn't make good continuity sense.. ? cheers -a
Steve Gibbard wrote:
So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy.
I didn't get to play detective at the time of the outage, but configutation (which is automatically replicated) may also have been enough to take out both nameservers.
It also makes good management sense to run your nameservers with the same software and versions, but perhaps it doesn't make good continuity sense.. ?
That may not be necessarily true. Vendor diversity is not a bad idea. It's expensive support wise, but you could run different h/w and bind at two locations. This is perfectly acceptable operationally, AFAIK. Security is another story. That depends largely on people these days so YMMV. Anyhow, does anyone know what really happened? -M<
On Saturday 07 Jan 2006 02:54, you wrote:
While it's tempting to make fun of Qwest here, variations on this theme -
I'll happily make fun of them. If the authoritative DNS servers were in the same logical network, even if one was in Washington, and one in California, they'd deserve it. Use to do basic audit networks for end user companies (and one small ISP who bought the service), this was a standard checklist item. Literally are the authoritative name servers on different logical networks. GX networks did it. Demon Internet did it, we do it for our own hosting despite being a relatively small company, I'm sure most of NANOG readership are careful to do this. I think the comments on anycast are misplaced, most big ISPs use it, or similar, for internal recursive resolvers, but I don't think it is that crucial for authoritative servers. Of course placing all your authoritative nameservers in the same anycast group is one of the things I've complained about here before (not mentioning any TLD by name since they seem to have learnt from that one), so of itself anycast doesn't avoid the issue. You can make the same mistake in many different systems. Also some scope for longer TTL at Qwest, although I can't throw any stones as we have been busy migrating stuff to new addresses and using very short TTLs ourselves at the moment. But we'll be back to 86400 seconds just as soon as I finish the migration work. I do agree the management issue with DNS are far harder, and here longer TTL are a double edged sword. But it is hard to design a system where the mistakes don't propagate to every DNS server, although some of the common tools do make it easier to check things are okay before updates are unleased. I think there is scope for saying the DNS TTLs should be related (and greater than) the time it takes to get clue onto any DNS problem.
On Mon, 9 Jan 2006, Simon Waters wrote:
On Saturday 07 Jan 2006 02:54, you wrote:
While it's tempting to make fun of Qwest here, variations on this theme -
I do agree the management issue with DNS are far harder, and here longer TTL are a double edged sword. But it is hard to design a system where the mistakes don't propagate to every DNS server, although some of the common tools do make it easier to check things are okay before updates are unleased.
What's interesting to me, atleast, is that this is about the 5th time someone has said similar things in the last 6 months: "DNS is harder than I thought it was" (or something along that line...) So, do most folks think: 1) get domain-name 2) get 2 machines for DNS servers 3) put ips in TLD system and roll! It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind) Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format) -Chris
On Mon, Jan 09, 2006 at 05:30:12PM +0000, Christopher L. Morrow wrote:
On Mon, 9 Jan 2006, Simon Waters wrote:
On Saturday 07 Jan 2006 02:54, you wrote:
While it's tempting to make fun of Qwest here, variations on this theme -
I do agree the management issue with DNS are far harder, and here longer TTL are a double edged sword. But it is hard to design a system where the mistakes don't propagate to every DNS server, although some of the common tools do make it easier to check things are okay before updates are unleased.
What's interesting to me, atleast, is that this is about the 5th time someone has said similar things in the last 6 months: "DNS is harder than I thought it was" (or something along that line...)
So, do most folks think: 1) get domain-name 2) get 2 machines for DNS servers 3) put ips in TLD system and roll!
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind)
Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format)
-Chris
IETF tech transfer failure... see RFC 2870 (mislabled as root-server) for TLD zone machine best practices from several years ago... for even older guidelines ... RFC 1219. --bill
On Mon, 09 Jan 2006 17:30:12 GMT, "Christopher L. Morrow" said:
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind)
Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format)
Will somebody who has the OReilly DNS book handy check and see if Chapter 8 doesn't already cover this? http://www.oreilly.com/catalog/dns4/ If it doesn't, maybe we need to hint to the authors that an update is needed for the 5th edition. If it does, I suspect the basic problem runs much deeper, and can't be solved by a NANOG presentation put online...
--On January 9, 2006 5:30:12 PM +0000 "Christopher L. Morrow" <christopher.morrow@mci.com> wrote:
What's interesting to me, atleast, is that this is about the 5th time someone has said similar things in the last 6 months: "DNS is harder than I thought it was" (or something along that line...)
So, do most folks think: 1) get domain-name 2) get 2 machines for DNS servers 3) put ips in TLD system and roll!
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind)
Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format)
Also it should be noted that there's a general lack of understanding about how very crucial DNS resolver performance is in the end user/customer perception of a network's performance. I can't tell you how many times I've used a local resolver, even on a modem mind you, and seen a dramatic improvement in the end user experience, which is, the web browser. Other applications are pretty DNS bound too anymore. And many large ISPs overload their resolvers, or have resolvers not prepared/configured to handle the amount of queries they're getting. I'm not saying I know the answers there, I'm just saying that I've seen quite a few times where DNS (or even other central directories, LDAP, ActiveDirectory come to mind) have been the 'bottleneck' from a user standpoint since name resolution would take so long.
-Chris
-- "Genius might be described as a supreme capacity for getting its possessors into trouble of all kinds." -- Samuel Butler
On Mon, 9 Jan 2006, Randy Bush wrote:
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?
2182
yes, yes.. people who care (a lot) have read this I'm sure... I was aiming a little lower :) like folks that have enterprise networks :) Or, maybe even registrars offering 'authoritative dns services' like say 'worldnic' who had most of their DNS complex shot in the head for 3 straight days :(
On Monday 09 Jan 2006 21:26, Christopher L. Morrow wrote:
On Mon, 9 Jan 2006, Randy Bush wrote:
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?
2182
yes, yes.. people who care (a lot) have read this I'm sure... I was aiming a little lower :) like folks that have enterprise networks :) Or, maybe even registrars offering 'authoritative dns services' like say 'worldnic' who had most of their DNS complex shot in the head for 3 straight days :(
It is the old story of ignorance and cost, plus with DNS a "perceived loss of control". In the UK many domains are registered with a couple of the cheapest providers, who do not do off network DNS, and in the past one offered non-RFC compliant mail forwarding as a bonus. I've seen people switch the DNS part of a hosting arrangement to these guys to save about 10 USD a year. Of course people competing at those sort of price levels offer practically no service component, so even if nothing dreadful happens it still turns into a false economy. It reminds me of the firewall market, when the average punter had no idea how to assess the "security" aspects of a firewall, and so firewall vendors ended up pushing throughput, and price, as the major selling points. I know people who bought firewalls capable of handling 160Mbps of traffic, who still have it filtering a 2Mbps Internet connection, badly. By and large the big ISPs do a good job with DNS, the end users do a terrible job. I think once you get to the size where you need a person (or team) doing DNS work fulltime, it probably gets a lot easier to do it right. Perhaps I should dust off my report on the quality of DNS configurations in the South West of England, and turn it into a buyers guide? That said I don't think doing DNS right is easy. I know pretty much exactly what my current employer is doing wrong, but these failures to conform to best practice aren't as much of a priority as the other things we are doing wrong. At least in our case it is done with knowledge of what can (and likely will eventually) go wrong.
On Mon, Jan 09, 2006 at 10:36:11AM -1000, Randy Bush wrote:
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?
2182
in deference to the previous RFC editor, who was particular about these things, the proper form is; RFC 2182. --bill
participants (12)
-
Andy Davidson
-
bmanning@vacation.karoshi.com
-
Christopher L. Morrow
-
Daniel Golding
-
Martin Hannigan
-
Michael Loftis
-
Randy Bush
-
Simon Waters
-
Steve Gibbard
-
Valdis.Kletnieks@vt.edu
-
Wil Schultz
-
william(at)elan.net