possible ORG problems, maybe?

Joe Abley

16 Oct 2003 16 Oct '03

4:05 a.m.

I think I'm seeing problems performing recursive queries for names under ORG against tld[12].ultradns.net at the moment, which is causing resolvers without cached data to behave as if domains don't exist. It's not trivial to tell whether this is just a local problem, since all the authoritative nameservers for ORG are anycast instances (i.e. I might just be unlucky in that my local tld[12].ultradns.net nodes are behaving unexpectedly). Is anybody else seeing issues?

Show replies by date

Jeff Wasilko

16 Oct 16 Oct

4:52 a.m.

On Thu, Oct 16, 2003 at 12:05:25AM -0400, Joe Abley wrote:

...

I think I'm seeing problems performing recursive queries for names under ORG against tld[12].ultradns.net at the moment, which is causing resolvers without cached data to behave as if domains don't exist.

It's not trivial to tell whether this is just a local problem, since all the authoritative nameservers for ORG are anycast instances (i.e. I might just be unlucky in that my local tld[12].ultradns.net nodes are behaving unexpectedly). Is anybody else seeing issues?

Yes. My personal system's in the .org domain, and I just had outbound mail bounced claiming: Domain of sender address jeffw@smoe.org does not exist My nameservers are very diverse, so I'm wondering what's going on with .org this week. I'll be happy to provide details to interested parties. -j

Rodney Joffe

4:57 a.m.

Hello Joe, Joe Abley wrote:

...

I think I'm seeing problems performing recursive queries for names under ORG against tld[12].ultradns.net at the moment, which is causing resolvers without cached data to behave as if domains don't exist.

It's not trivial to tell whether this is just a local problem, since all the authoritative nameservers for ORG are anycast instances (i.e. I might just be unlucky in that my local tld[12].ultradns.net nodes are behaving unexpectedly). Is anybody else seeing issues?

If you or any other folks ever see any oddness with the UltraDNS nameservers, it would be helpful if you could provide traceroutes. This will allow us to try and isolate the nameservers that are being affected, as well as the routes that may be involved. If anyone else is seeing a problem, please send email *with* symptoms to noc@ultradns.net. Thanks... -- Rodney Joffe CenterGate Research Group, LLC. http://www.centergate.com "Technology so advanced, even we don't understand it!"(R)

Randy Bush

8:35 a.m.

...

If you or any other folks ever see any oddness with the UltraDNS nameservers, it would be helpful if you could provide traceroutes.

and what assurance do you have that the traceroute is to the same server to which the original query failed? difficulty debugging anycast dns was the major reason for sceptisim re anycast auth servers. have fun. randy

Rodney Joffe

12:30 p.m.

Randy Bush wrote:

...

and what assurance do you have that the traceroute is to the same server to which the original query failed?

difficulty debugging anycast dns was the major reason for sceptisim re anycast auth servers.

You're right, Randy. However, things are never black or white. In a non-anycast implementation, a typical failure like this would not immediately tell which of the masters or slaves was at fault, if any. The application would just fail. When troubleshooting began, there is no guarantee which slave was queried originally. However as the dns was walked, if indeed a server had a problem, in a non-anycast implementation you could tell which server ip address had the problem. But you could not always tell which actual machine had a problem if it was behind a load balancer of any kind, something increasingly common in large installations. Anycast is no different. Notwithstanding all of this, it would appear that given the large scale ddos attacks against networks, and dns in particular over the last year, an anycast implementation is the *only* way that dns has a chance of surviving. So hopefully you'll be involved actively and positively in the dns WG in developing some BCPs and standards for operating anycast implementations for anycast, rather than dismissing anycast out-of-hand. In terms of UltraDNS, we try to make it easier by having the following two records on every server: dig @[UltraDNS Anycast name or ip address] whoareyou.ultradns.net A and dig @[UltraDNS Anycast name or ip address] whoami.ultradns.net A where whoareyou.ultradns.net provides the unique ip address of the machine being queried, and whoami.ultradns.net provides the ip address of the machine doing the querying (so that a user querying a recursive server can identify which recursive server is actually querying the UltraDNS server). Dan Senie has suggest the inclusion of a TXT record with teh same data so that the actual ip address of the actual server that responded to the query that had a problem was available. Certainly more standardized and elegant, but a subject for the WG mailing lists. I believe that there is an anycast tutorial or session in Chicago, if anyone wants to weigh in. -- Rodney Joffe CenterGate Research Group, LLC. http://www.centergate.com "Technology so advanced, even we don't understand it!"(R)

Bruce Campbell

3:25 p.m.

On Thu, 16 Oct 2003, Rodney Joffe wrote:

...

Randy Bush wrote:

...
and what assurance do you have that the traceroute is to the same server to which the original query failed?

difficulty debugging anycast dns was the major reason for sceptisim re anycast auth servers.

However as the dns was walked, if indeed a server had a problem, in a non-anycast implementation you could tell which server ip address had the problem. But you could not always tell which actual machine had a problem if it was behind a load balancer of any kind, something increasingly common in large installations.

Anycast is no different.

Anycast is subtly different, as effectively the internal workings of the load balancer are spread around the world for all to marvel at, rather than at one end site.

...

In terms of UltraDNS, we try to make it easier by having the following two records on every server:

dig @[UltraDNS Anycast name or ip address] whoareyou.ultradns.net A and dig @[UltraDNS Anycast name or ip address] whoami.ultradns.net A

For the end-user, where is this documented ? I know to look for 'version.bind', 'id.server', 'version.server' and a few others, but I hadn't considered asking for 'whoareyou.arbitary.domain'. Why would other people consider it? Also, did the query that I'm debugging really go to the same host that I just got the real IP address from? Does the nameserver on the 'real' IP address function the same way as the anycasted virtual IP address? ( Incidentally, exposing the actual IP address of the back-end server is good for debugging, but really, really bad if the point of using anycast is to protect from attacks. You only want to expose an identifier thats only useful in-house. )

...

I believe that there is an anycast tutorial or session in Chicago, if anyone wants to weigh in.

I'll be there. --==-- Bruce.

Rodney Joffe

7:30 p.m.

Bruce Campbell wrote:

...

On Thu, 16 Oct 2003, Rodney Joffe wrote:

...

...
However as the dns was walked, if indeed a server had a problem, in a non-anycast implementation you could tell which server ip address had the problem. But you could not always tell which actual machine had a problem if it was behind a load balancer of any kind, something increasingly common in large installations.

Anycast is no different.

Anycast is subtly different, as effectively the internal workings of the load balancer are spread around the world for all to marvel at, rather than at one end site.

Perhaps I was not clear, once again, although you made the point even better than I was able to... Behind a load balancer, there are n nameservers. The queries all go to a single ip address, which hits the load balancer. Behind the load balancer, either in the same location, or perhaps even spread globally, all of these nameservers have different ip addresses, none of which is visible or used by the querying public. In that case, how would *you* know which of the myriad of marvelous nameservers you were being answered by? And because you wouldn't, this mirrors the situation with UltraDNS, and the only point I was making.

...

...
In terms of UltraDNS, we try to make it easier by having the following two records on every server:

dig @[UltraDNS Anycast name or ip address] whoareyou.ultradns.net A and dig @[UltraDNS Anycast name or ip address] whoami.ultradns.net A

For the end-user, where is this documented ?

It is not. It is not an end-user feature. It is an UltraDNS troubleshooting feature, shared with customers during trouble-shooting, and available for them to use. I shared it here because as I understand it, some small portion of the folks who read UltraDNS run networks, have customers, and experienced apparent issues, and I felt it was an appropriate thing to mention so that when there appears to be an issue in the future with UltraDNS (not just .org) these tools might answer, partially, the concern that Randy correctly raised. And as more nameservers are migrated to using anycast (in its various implementations) it seemed appropriate that a mechanism *like* ours would be a "good idea"(tm).

...

I know to look for 'version.bind', 'id.server', 'version.server' and a few others, but I hadn't considered asking for 'whoareyou.arbitary.domain'.

You are not a customer. And you have never had an issue with UltraDNS where you contacted us to troubleshoot the same.

...

Why would other people consider it?

I agree. Your point being?

...

Also, did the query that I'm debugging really go to the same host that I just got the real IP address from?

I believe I covered that in my initial response to Randy which you snipped. I said: "> Dan Senie has suggest the inclusion of a TXT record with the same data

...

so that the actual ip address of the actual server that responded to the query that had a problem was available. Certainly more standardized and elegant, but a subject for the WG mailing lists."

The RFCs governing dns do not currently allow for the standard return of any record type in a dns answer that would indicate the ip address of the server being queried. So this valid issue should be addressed, of course through the appropriate process.

...

Does the nameserver on the 'real' IP address function the same way as the anycasted virtual IP address?

It is the same nameserver, so, yes.

...

( Incidentally, exposing the actual IP address of the back-end server is good for debugging, but really, really bad if the point of using anycast is to protect from attacks. You only want to expose an identifier thats only useful in-house. )

There is an additional series of addresses that drives the nameservers themselves, and there is a process of nat which produces the answers we are discussing here. Additionally, security by obscurity is inappropriate, given the extreme level of sophistication of the current group of hackers-for-hire. Be afraid. be *very* afraid. But that's a different issue I have with many of the "ostriches" in the NANOG community.

...

...
I believe that there is an anycast tutorial or session in Chicago, if anyone wants to weigh in.

I'll be there.

Great. It has nothing to do with me, or UltraDNS - although I assume UltraDNS will be dissected ;-) -- Rodney Joffe CenterGate Research Group, LLC. http://www.centergate.com "Technology so advanced, even we don't understand it!"(R)

Daniel Senie

8:21 p.m.

At 03:30 PM 10/16/2003, Rodney Joffe wrote:

...

Bruce Campbell wrote:

[much snipped]

...

...
Also, did the query that I'm debugging really go to the same host that I just got the real IP address from?

I believe I covered that in my initial response to Randy which you snipped. I said:

"> Dan Senie has suggest the inclusion of a TXT record with the same data

...
so that the actual ip address of the actual server that responded to the query that had a problem was available. Certainly more standardized and elegant, but a subject for the WG mailing lists."

The RFCs governing dns do not currently allow for the standard return of any record type in a dns answer that would indicate the ip address of the server being queried. So this valid issue should be addressed, of course through the appropriate process.

I am working on an I-D submission on this topic. One of the advantages of the mechanism I'm writing about is that it'd be applicable to both anycast environments AND to load balancer environments. I don't know if I'll make the I-D cutoff or not. If not, it'll get submitted as soon as the gates reopen.

Joe Abley

8:14 p.m.

On 16 Oct 2003, at 11:25, Bruce Campbell wrote:

...

I know to look for 'version.bind', 'id.server', 'version.server' and a few others, but I hadn't considered asking for 'whoareyou.arbitary.domain'. Why would other people consider it?

Incidentally, there is a similar mechanism available for the F root nameserver, in case people are not aware: dig @f.root-servers.net hostname.bind chaos txt For most people this will reveal a nameserver hostname with a "PAO" or an SFO in it. People within the catchment of a local anycast node of F will see different site codes. "hostname.bind CH TXT" is a general feature of BIND 9, and not a special feature of F. Joe

Randy Bush

17 Oct 17 Oct

7:47 a.m.

...

Incidentally, there is a similar mechanism available for the F root nameserver, in case people are not aware:

dig @f.root-servers.net hostname.bind chaos txt

For most people this will reveal a nameserver hostname with a "PAO" or an SFO in it. People within the catchment of a local anycast node of F will see different site codes.

but one has little assurance that the response is from the same server as the one from which one had the dns response one is debugging. randy

Daniel Karrenberg

9:34 a.m.

On 17.10 09:47, Randy Bush wrote:

...

but one has little assurance that the response is from the same server as the one from which one had the dns response one is debugging.

That is true. However this only matters if the operator of the server allows them to be inconsistent *and* routing so volatile that queries are routed to different instances over short periods of time. In my opinion the increased DDoS resilience alone outweighs this drawback. In addition the service quality can be increased as the number of places at which the service can be provided is independent of the number of server addresses available due to DNS protocol limitations. Hard data: We probe DNS servers from 60+ points across the internet once a minute on average. We log the id.server or hostname.bind value they return. I have not completed the colour picture version of analysing this part of the data but here is a quick perl script version: For the period from 0000UTC to 2359UTC yesterday 60 out of 63 probes (95+%) got *all* of their 1400+ answers from the *same instance* of k.root-servers.net. The three probes that talked to different instances showed 1, 2 and 4 change events respectively. I consider this stable enough for debugging purposes. Data for f.root-servers.net shows a similar picture. Both data files are attached. We will provide this data in full colour form at dnsmon.ripe.net sometime in the coming weeks. Daniel

Randy Bush

4:16 p.m.

...

Hard data:

see Subject: ORG was broken with serious customer impact, and for a while. and it took a while to debug. qed randy

Joe Abley

2:45 p.m.

On 17 Oct 2003, at 03:47, Randy Bush wrote:

...

...
Incidentally, there is a similar mechanism available for the F root nameserver, in case people are not aware:

dig @f.root-servers.net hostname.bind chaos txt

For most people this will reveal a nameserver hostname with a "PAO" or an SFO in it. People within the catchment of a local anycast node of F will see different site codes.

but one has little assurance that the response is from the same server as the one from which one had the dns response one is debugging.

In general, that is correct. F is served by local clusters of nameservers, each arranged within a node. If successive queries use the same source port they will land on the same server, since nameserver selection within a node is done using a CEF-style (src_*, dst_*) hash (it's exactly CEF in a node which uses cisco routers, and it's "forwarding-table export per-packet-load-balance" in a node which uses Juniper routers). It is usually unlikely that two successive queries will be answered by different local nodes of F. Certainly if you do dig @f.root-servers.net hostname.bind chaos txt dig @f.root-servers.net ${some_other_query} dig @f.root-servers.net hostname.bind chaos txt and the first and last queries indicate your query is being answered by a particular node, it is extrordinarily unlikely that the middle query is answered somewhere else (since that would indicate abnormally swift BGP reconvergence on the selected path to 192.5.5.0/24, the covering route for F's v4 address). So the answer to "hostname.bind chaos txt" can give a high-confidence identity of the node answering the query, a low-confidence identity of the server within the node answering the query in the case where source ports on successive queries are different, and a higher confidence identity of the server if care is taken (for the purposes of specific measurement exercises) to keep source ports the same and execute the two queries in quick succession. Joe

bmanning＠karoshi.com

5:26 p.m.

...

...
...
dig @f.root-servers.net hostname.bind chaos txt

Joe

leads to the question that should occur elsewhere, BUT, why are there all these different ways to ID DNS servers? granted, the ISC reference implementation was first out, with the "version.bind" string, which they still use, even though the IETF has specified the use of "version.server" ... now we have server.id, hostname.bind, ... and the ultra "... check my special zone..." - grump. --bill

William Astle

16 Oct 16 Oct

5:03 a.m.

On Thu, 16 Oct 2003, Joe Abley wrote:

...

I think I'm seeing problems performing recursive queries for names under ORG against tld[12].ultradns.net at the moment, which is causing resolvers without cached data to behave as if domains don't exist.

It's not trivial to tell whether this is just a local problem, since all the authoritative nameservers for ORG are anycast instances (i.e. I might just be unlucky in that my local tld[12].ultradns.net nodes are behaving unexpectedly). Is anybody else seeing issues?

That might explain the complaints some of my customers are getting from people unable to send email to them at their .org domains. What I'm seeing is this: Even though my name servers are working correctly and answering up the correct information for the .org domains in question, a rather disturbing number of people are receiving bounce messages which indicate that the domain doesn't exist. Note that the domains in question are all .org domains and are not expired. Unfortunately, I cannot identify a pattern to where these people who are having problems are since I do not have enough information. I am also not seeing any unusual behaviour with .org domain resolution from my name servers. -- William Astle finger lost@l-w.net for further information Geek Code V3.12: GCS/M/S d- s+:+ !a C++ UL++++$ P++ L+++ !E W++ !N w--- !O !M PS PE V-- Y+ PGP t+@ 5++ X !R tv+@ b+++@ !DI D? G e++ h+ y?

Rodney Joffe

5:10 a.m.

William, William Astle wrote:

...

That might explain the complaints some of my customers are getting from people unable to send email to them at their .org domains. What I'm seeing is this:

Even though my name servers are working correctly and answering up the correct information for the .org domains in question, a rather disturbing number of people are receiving bounce messages which indicate that the domain doesn't exist. Note that the domains in question are all .org domains and are not expired.

Unfortunately, I cannot identify a pattern to where these people who are having problems are since I do not have enough information. I am also not seeing any unusual behaviour with .org domain resolution from my name servers.

Joe sent a note that identified a possible common thread in the version of bind the recursive servers were using. Could you perhaps look at that and see if there is any commonality? Thanks... -- Rodney Joffe CenterGate Research Group, LLC. http://www.centergate.com "Technology so advanced, even we don't understand it!"(R)

William Astle

3:27 p.m.

On Wed, 15 Oct 2003, Rodney Joffe wrote:

...

Joe sent a note that identified a possible common thread in the version of bind the recursive servers were using. Could you perhaps look at that and see if there is any commonality?

I'll see what I can do about that. Unfortunately, the folks complaining haven't the necessary skills to acquire the information nor do they provide enough information to identify the ISP in question to query them. And, as I noted previously, I am not experiencing any difficulties with .org resolution via my name servers. -- William Astle finger lost@l-w.net for further information Geek Code V3.12: GCS/M/S d- s+:+ !a C++ UL++++$ P++ L+++ !E W++ !N w--- !O !M PS PE V-- Y+ PGP t+@ 5++ X !R tv+@ b+++@ !DI D? G e++ h+ y?

8150

Age (days ago)

8151

Last active (days ago)

List overview

Download

16 comments

9 participants

participants (9)

bmanning＠karoshi.com
Bruce Campbell
Daniel Karrenberg
Daniel Senie
Jeff Wasilko
Joe Abley
Randy Bush
Rodney Joffe
William Astle