QWest is having some pretty nice DNS issues right now - Test - lists.nanog.org

newer
Re: QWest is having some pretty...

QWest is having some pretty nice DNS issues right now

older
Sprint fixed

Wil Schultz

7 Jan 2006 7 Jan '06

12:21 a.m.

Apparently they have lost two authoritive servers. ETA is unknown. -Wil

Reply

Sign in to reply online Use email software

Show replies by date

william(at)elan.net

7 Jan 7 Jan

12:25 a.m.

On Fri, 6 Jan 2006, Wil Schultz wrote:

Apparently they have lost two authoritive servers. ETA is unknown.

You forgot to mention that they only have two authoritive servers for most of their domains... -- William Leibzon Elan Networks william@elan.net

Reply

Sign in to reply online Use email software

Wil Schultz

12:31 a.m.

Well, that would explain it, make me feel better that they took themselves out as well: -bash-2.05b$ dig qwest.com ; <<>> DiG 9.3.1 <<>> qwest.com ;; global options: printcmd ;; connection timed out; no servers could be reached -Wil william(at)elan.net wrote:

On Fri, 6 Jan 2006, Wil Schultz wrote:

...
Apparently they have lost two authoritive servers. ETA is unknown.

You forgot to mention that they only have two authoritive servers for most of their domains...

Reply

Sign in to reply online Use email software

Wil Schultz

1:04 a.m.

Partially back up, I can resolve everthing here... They tell me that it's not quite over yet. -Wil Wil Schultz wrote:

Well, that would explain it, make me feel better that they took themselves out as well:

-bash-2.05b$ dig qwest.com ; <<>> DiG 9.3.1 <<>> qwest.com ;; global options: printcmd ;; connection timed out; no servers could be reached

-Wil

william(at)elan.net wrote:

...
On Fri, 6 Jan 2006, Wil Schultz wrote:

...
Apparently they have lost two authoritive servers. ETA is unknown.

You forgot to mention that they only have two authoritive servers for most of their domains...

Reply

Sign in to reply online Use email software

Christopher L. Morrow

2:24 a.m.

On Fri, 6 Jan 2006, Wil Schultz wrote:

Well, that would explain it, make me feel better that they took themselves out as well:

-bash-2.05b$ dig qwest.com ; <<>> DiG 9.3.1 <<>> qwest.com ;; global options: printcmd ;; connection timed out; no servers could be reached

not anycasted then eh? bummer :(

-Wil

william(at)elan.net wrote:

...
On Fri, 6 Jan 2006, Wil Schultz wrote:

...
Apparently they have lost two authoritive servers. ETA is unknown.

You forgot to mention that they only have two authoritive servers for most of their domains...

Reply

Sign in to reply online Use email software

Steve Gibbard

2:54 a.m.

On Fri, 6 Jan 2006, william(at)elan.net wrote:

On Fri, 6 Jan 2006, Wil Schultz wrote:

...
Apparently they have lost two authoritative servers. ETA is unknown.

You forgot to mention that they only have two authoritative servers for most of their domains...

I didn't look at this while it was happening, and haven't talked to anybody else about it, so I don't know if this was a systems or routing issue. But, in the spirit of trying to learn lessons from incomplete information... Qwest.net and Qwest.com have two authoritative name server addresses listed, dca-ans-01.inet.qwest.net and svl-ans-01.inet.qwest.net. As the names imply, traceroutes to these two servers appear to go to somewhere in the DC area and somewhere in proximity to Sunnyvale, California. It appears they're really just two servers or single location load-balanced clusters, and not an anycast cloud with two addresses. It may be that two simultaneous server failures would take out the whole thing, or they may be in less visible load balancing configurations. Even if it's two individual servers, that's the standard n+1 redundancy that's generally considered sufficient for most things. There is a fair amount of geographic diversity between the two sites, which is a good thing. The two servers have the IP addresses 205.171.9.242 and 205.171.14.195. These both appear in global BGP tables as part of 205.168.0.0/14, so any outage affecting that single route (flapping, getting withdrawn, getting announced from somewhere without working connectivity to the two name servers, etc.) would take out both of them. So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy. While it's tempting to make fun of Qwest here, variations on this theme -- working hard on one area of design while ignoring another that's also critical -- are really common. It's something we all need to be careful of. Or, not having seen what happened here, the problem could have been something completely different, perhaps even having nothing to do with routing or network topology. In that case, my general point would remain the same, but this would be a bad example to use. -Steve

Reply

Sign in to reply online Use email software

Daniel Golding

11:02 p.m.

On 1/6/06 9:54 PM, "Steve Gibbard" <scg@gibbard.org> wrote:

On Fri, 6 Jan 2006, william(at)elan.net wrote:

...
On Fri, 6 Jan 2006, Wil Schultz wrote:

...
Apparently they have lost two authoritative servers. ETA is unknown.

You forgot to mention that they only have two authoritative servers for most of their domains...

[snip]

So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy. While it's tempting to make fun of Qwest here, variations on this theme -- working hard on one area of design while ignoring another that's also critical -- are really common. It's something we all need to be careful of.

Or, not having seen what happened here, the problem could have been something completely different, perhaps even having nothing to do with routing or network topology. In that case, my general point would remain the same, but this would be a bad example to use.

-Steve

At some point in a carrier's growth, Anycast DNS has got to become a best practice. Are there many major carriers that don't do it today, or am I just a starry-eyed idealist? - Dan

Reply

Sign in to reply online Use email software

Randy Bush

11:19 p.m.

having authoritative data secondaried off-net is pretty important. randy

Reply

Sign in to reply online Use email software

Andy Davidson

8 Jan 8 Jan

10:58 a.m.

Steve Gibbard wrote:

So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy.

I didn't get to play detective at the time of the outage, but configutation (which is automatically replicated) may also have been enough to take out both nameservers. It also makes good management sense to run your nameservers with the same software and versions, but perhaps it doesn't make good continuity sense.. ? cheers -a

Reply

Sign in to reply online Use email software

Martin Hannigan

3:09 p.m.

Steve Gibbard wrote:

...
So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy.

I didn't get to play detective at the time of the outage, but configutation (which is automatically replicated) may also have been enough to take out both nameservers.

It also makes good management sense to run your nameservers with the same software and versions, but perhaps it doesn't make good continuity sense.. ?

That may not be necessarily true. Vendor diversity is not a bad idea. It's expensive support wise, but you could run different h/w and bind at two locations. This is perfectly acceptable operationally, AFAIK. Security is another story. That depends largely on people these days so YMMV. Anyhow, does anyone know what really happened? -M<

Reply

Sign in to reply online Use email software

Simon Waters

9 Jan 9 Jan

8:34 a.m.

On Saturday 07 Jan 2006 02:54, you wrote:

While it's tempting to make fun of Qwest here, variations on this theme -

I'll happily make fun of them. If the authoritative DNS servers were in the same logical network, even if one was in Washington, and one in California, they'd deserve it. Use to do basic audit networks for end user companies (and one small ISP who bought the service), this was a standard checklist item. Literally are the authoritative name servers on different logical networks. GX networks did it. Demon Internet did it, we do it for our own hosting despite being a relatively small company, I'm sure most of NANOG readership are careful to do this. I think the comments on anycast are misplaced, most big ISPs use it, or similar, for internal recursive resolvers, but I don't think it is that crucial for authoritative servers. Of course placing all your authoritative nameservers in the same anycast group is one of the things I've complained about here before (not mentioning any TLD by name since they seem to have learnt from that one), so of itself anycast doesn't avoid the issue. You can make the same mistake in many different systems. Also some scope for longer TTL at Qwest, although I can't throw any stones as we have been busy migrating stuff to new addresses and using very short TTLs ourselves at the moment. But we'll be back to 86400 seconds just as soon as I finish the migration work. I do agree the management issue with DNS are far harder, and here longer TTL are a double edged sword. But it is hard to design a system where the mistakes don't propagate to every DNS server, although some of the common tools do make it easier to check things are okay before updates are unleased. I think there is scope for saying the DNS TTLs should be related (and greater than) the time it takes to get clue onto any DNS problem.

Reply

Sign in to reply online Use email software

Christopher L. Morrow

5:30 p.m.

On Mon, 9 Jan 2006, Simon Waters wrote:

On Saturday 07 Jan 2006 02:54, you wrote:

...
While it's tempting to make fun of Qwest here, variations on this theme -

I do agree the management issue with DNS are far harder, and here longer TTL are a double edged sword. But it is hard to design a system where the mistakes don't propagate to every DNS server, although some of the common tools do make it easier to check things are okay before updates are unleased.

What's interesting to me, atleast, is that this is about the 5th time someone has said similar things in the last 6 months: "DNS is harder than I thought it was" (or something along that line...) So, do most folks think: 1) get domain-name 2) get 2 machines for DNS servers 3) put ips in TLD system and roll! It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind) Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format) -Chris

Reply

Sign in to reply online Use email software

bmanning＠vacation.karoshi.com

5:40 p.m.

On Mon, Jan 09, 2006 at 05:30:12PM +0000, Christopher L. Morrow wrote:

On Mon, 9 Jan 2006, Simon Waters wrote:

...
On Saturday 07 Jan 2006 02:54, you wrote:

...
While it's tempting to make fun of Qwest here, variations on this theme -

I do agree the management issue with DNS are far harder, and here longer TTL are a double edged sword. But it is hard to design a system where the mistakes don't propagate to every DNS server, although some of the common tools do make it easier to check things are okay before updates are unleased.

What's interesting to me, atleast, is that this is about the 5th time someone has said similar things in the last 6 months: "DNS is harder than I thought it was" (or something along that line...)

So, do most folks think: 1) get domain-name 2) get 2 machines for DNS servers 3) put ips in TLD system and roll!

It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind)

Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format)

-Chris

IETF tech transfer failure... see RFC 2870 (mislabled as root-server) for TLD zone machine best practices from several years ago... for even older guidelines ... RFC 1219. --bill

Reply

Sign in to reply online Use email software

Valdis.Kletnieks＠vt.edu

6:06 p.m.

On Mon, 09 Jan 2006 17:30:12 GMT, "Christopher L. Morrow" said:

It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind)

Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format)

Will somebody who has the OReilly DNS book handy check and see if Chapter 8 doesn't already cover this? http://www.oreilly.com/catalog/dns4/ If it doesn't, maybe we need to hint to the authors that an update is needed for the 5th edition. If it does, I suspect the basic problem runs much deeper, and can't be solved by a NANOG presentation put online...

Reply

Sign in to reply online Use email software

Michael Loftis

6:47 p.m.

--On January 9, 2006 5:30:12 PM +0000 "Christopher L. Morrow" <christopher.morrow@mci.com> wrote:

What's interesting to me, atleast, is that this is about the 5th time someone has said similar things in the last 6 months: "DNS is harder than I thought it was" (or something along that line...)

So, do most folks think: 1) get domain-name 2) get 2 machines for DNS servers 3) put ips in TLD system and roll!

It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central? Are they just not well publicized? Do registrars offer this information for end-users/clients? Do they show how their hosted solutions are better/works/in-compliance-with these best practices? (worldnic comes to mind)

Should this perhaps be better documented and presented at a future NANOG meeting? (and thus placed online in presentation format)

Also it should be noted that there's a general lack of understanding about how very crucial DNS resolver performance is in the end user/customer perception of a network's performance. I can't tell you how many times I've used a local resolver, even on a modem mind you, and seen a dramatic improvement in the end user experience, which is, the web browser. Other applications are pretty DNS bound too anymore. And many large ISPs overload their resolvers, or have resolvers not prepared/configured to handle the amount of queries they're getting. I'm not saying I know the answers there, I'm just saying that I've seen quite a few times where DNS (or even other central directories, LDAP, ActiveDirectory come to mind) have been the 'bottleneck' from a user standpoint since name resolution would take so long.

-Chris

-- "Genius might be described as a supreme capacity for getting its possessors into trouble of all kinds." -- Samuel Butler

Reply

Sign in to reply online Use email software

Randy Bush

8:36 p.m.

It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?

2182

Reply

Sign in to reply online Use email software

Christopher L. Morrow

9:26 p.m.

On Mon, 9 Jan 2006, Randy Bush wrote:

...
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?

2182

yes, yes.. people who care (a lot) have read this I'm sure... I was aiming a little lower :) like folks that have enterprise networks :) Or, maybe even registrars offering 'authoritative dns services' like say 'worldnic' who had most of their DNS complex shot in the head for 3 straight days :(

Reply

Sign in to reply online Use email software

Simon Waters

10 Jan 10 Jan

9:12 a.m.

On Monday 09 Jan 2006 21:26, Christopher L. Morrow wrote:

On Mon, 9 Jan 2006, Randy Bush wrote:

...
...
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?

2182

yes, yes.. people who care (a lot) have read this I'm sure... I was aiming a little lower :) like folks that have enterprise networks :) Or, maybe even registrars offering 'authoritative dns services' like say 'worldnic' who had most of their DNS complex shot in the head for 3 straight days :(

It is the old story of ignorance and cost, plus with DNS a "perceived loss of control". In the UK many domains are registered with a couple of the cheapest providers, who do not do off network DNS, and in the past one offered non-RFC compliant mail forwarding as a bonus. I've seen people switch the DNS part of a hosting arrangement to these guys to save about 10 USD a year. Of course people competing at those sort of price levels offer practically no service component, so even if nothing dreadful happens it still turns into a false economy. It reminds me of the firewall market, when the average punter had no idea how to assess the "security" aspects of a firewall, and so firewall vendors ended up pushing throughput, and price, as the major selling points. I know people who bought firewalls capable of handling 160Mbps of traffic, who still have it filtering a 2Mbps Internet connection, badly. By and large the big ISPs do a good job with DNS, the end users do a terrible job. I think once you get to the size where you need a person (or team) doing DNS work fulltime, it probably gets a lot easier to do it right. Perhaps I should dust off my report on the quality of DNS configurations in the South West of England, and turn it into a buyers guide? That said I don't think doing DNS right is easy. I know pretty much exactly what my current employer is doing wrong, but these failures to conform to best practice aren't as much of a priority as the other things we are doing wrong. At least in our case it is done with knowledge of what can (and likely will eventually) go wrong.

Reply

Sign in to reply online Use email software

bmanning＠vacation.karoshi.com

9 Jan 9 Jan

9:42 p.m.

On Mon, Jan 09, 2006 at 10:36:11AM -1000, Randy Bush wrote:

...
It seems like maybe that is all too common. Are the 'best practices' documented for Authoritative DNS somewhere central?

2182

in deference to the previous RFC editor, who was particular about these things, the proper form is; RFC 2182. --bill

Reply

Sign in to reply online Use email software

7114

Age (days ago)

7117

Last active (days ago)

Download

18 comments

12 participants

tags

participants (12)

Andy Davidson
bmanning＠vacation.karoshi.com
Christopher L. Morrow
Daniel Golding
Martin Hannigan
Michael Loftis
Randy Bush
Simon Waters
Steve Gibbard
Valdis.Kletnieks＠vt.edu
Wil Schultz
william(at)elan.net