On Thu, 18 Sep 2003, John Fraizer wrote: : As has been stated by others, UltraDNS, like the roots and other TLD hosts : is under nearly constant attack. Perhaps your local nodes were effected : by an attack. IE; the pipe was full but the service was still alive so the : anycast prefix wasn't retracted. Bummer. Sucks to be you. Sucks to be anyone trying to use the service whose routers pick those nodes as the only ones available. That's the fault of the implementor, not the client. The major issue here is that no *gTLD*, particularly one of the Big Three, should be subject to a SPOF -- even if it's only a regionally visible SPOF due to anycast selection. It should *always* be possible to attempt queries to more than one physical location's servers for a gTLD. Yet last night, I could not query .ORG from several different locations in the continental US, even though there were perfectly functional servers available (in the same country, no less). BGP errors happen (everyone here should be able to attest to that readily), and they did. What's to stop some other boneheaded DoS or oversight from causing this again? And again? This particular outage was in the late evening in what appeared to be the affected area from my probing, which is why people like you don't appear to care; it "didn't affect you". What about when it happens in the middle of the day in your neck of the woods? : Doesn't really matter to me though. Bitch and moan all you like. : Demonstrate your lack of experience and understanding. Uh-huh. Quite a few people here know better; they also know I am surrounded by <cloak/> on this list and others. If my public resume were up to date and filled in more detail, you'd know otherwise. Don't try to speak for my experience from your pedestal when you don't have the information to make that kind of baseless judgment. On the other hand, if you can't see the fatal flaw in a major Internet infrastructure service depending on a single point of failure, I can point you at a few books that could enlighten you. -- -- Todd Vierling <tv@duh.org> <tv@pobox.com>
TV> Date: Thu, 18 Sep 2003 14:22:19 -0400 (EDT) TV> From: Todd Vierling TV> Sucks to be anyone trying to use the service whose routers TV> pick those nodes as the only ones available. That's the TV> fault of the implementor, not the client. Yes. TV> The major issue here is that no *gTLD*, particularly one of TV> the Big Three, should be subject to a SPOF -- even if it's TV> only a regionally visible SPOF Yes. TV> due to anycast selection. Which would be due to a broken implementation. Broken unicast is bad. Not all unicast is bad. Broken anycast is bad. Not all anycast is bad. TV> It should *always* be possible to attempt queries to more TV> than one physical location's servers for a gTLD. _Or_ guarantee that the physical location selected was indeed up. Again, it smells an awful lot like plain old multihoming... if you advertise the route, you'd better be ready to handle the traffic. (Did someone say "7007"?) TV> BGP errors happen (everyone here should be able to attest to TV> that readily), and they did. What's to stop some other TV> boneheaded DoS or oversight from causing this again? And TV> again? I've had problems with unicast when a link went down, yet the upstream continued advertising the routes. BGP stupidity happens with unicast service, too. Yes, anycast requires some additional thought and out-of-box thinking. But that doesn't make it inherently unstable. Eddy -- Brotsman & Dreger, Inc. - EverQuick Internet Division Bandwidth, consulting, e-commerce, hosting, and network building Phone: +1 785 865 5885 Lawrence and [inter]national Phone: +1 316 794 8922 Wichita _________________________________________________________________ DO NOT send mail to the following addresses : blacklist@brics.com -or- alfra@intc.net -or- curbjmp@intc.net Sending mail to spambait addresses is a great way to get blocked.
On Thu, Sep 18, 2003 at 02:22:19PM -0400, Todd Vierling wrote:
Sucks to be anyone trying to use the service whose routers pick those nodes as the only ones available. That's the fault of the implementor, not the client.
I have a sneaking suspicion that if UltraDNS's tld cluster that is apparently located in Equinix-Ashburn stopped responding to queries for two hours last night, a lot more people would have noticed. A *lot* more people. I think it's out of line to speculate on how UltraDNS has configured these clusters, particularly in terms of how reachability information is verified and propagated without any knowledge of their configuration.
The major issue here is that no *gTLD*, particularly one of the Big Three, should be subject to a SPOF -- even if it's only a regionally visible SPOF due to anycast selection. It should *always* be possible to attempt queries to more than one physical location's servers for a gTLD. Yet last night, I could not query .ORG from several different locations in the continental US, even though there were perfectly functional servers available (in the same country, no less).
First it was two locations, one of which you can't tell us about (Deep inside OSPF Area 51?) -- now it's several? I've tried myself from many different hosts today, and they all route to different clusters. I'm having trouble finding more than one, geographically diverse host that routes to the same cluster.
BGP errors happen (everyone here should be able to attest to that readily), and they did. What's to stop some other boneheaded DoS or oversight from causing this again? And again?
Are you absolutely, positively sure this cluster was responding to 0 queries, but still propagating those two /24's?
This particular outage was in the late evening in what appeared to be the affected area from my probing, which is why people like you don't appear to care; it "didn't affect you". What about when it happens in the middle of the day in your neck of the woods?
The reason for this is simple -- given the query volume a tld like .org receives, and given just how "close" this cluster is to so many millions of users in the eastern US, the odds of you being the *only* person, even amongst the few thousand on this list, to notice a problem... are incredibly slim. Since you won't tell us where these "several" hosts you tried to query from are addressed, and you won't tell us exactly which queries you tried, and how...it is incredibly hard to look into. This is the equivalent of calling every fire department in the nation and telling them that there is a fire, but refusing to tell them where you are, or what you've witnessed.
Uh-huh. Quite a few people here know better; they also know I am surrounded by <cloak/> on this list and others. If my public resume were up to date and filled in more detail, you'd know otherwise. Don't try to speak for my experience from your pedestal when you don't have the information to make that kind of baseless judgment.
On the other hand, if you can't see the fatal flaw in a major Internet infrastructure service depending on a single point of failure, I can point you at a few books that could enlighten you.
It isn't a single point of failure, but even if it were, I can assure you that the collective experience of this list would fill quite a few more volumes then you are capable of referring us to. You ask that we make no assumptions as to your experience -- grant us the same courtesy. --msa
On Thu, 18 Sep 2003, Majdi S. Abbas wrote: : > Sucks to be anyone trying to use the service whose routers pick those nodes : > as the only ones available. That's the fault of the implementor, not the : > client. : I think it's out of line to speculate on how UltraDNS has configured : these clusters, I don't care what the underlying implementation is. I care about the effect: that for at least one hour, possibly up to two last night, one of the physical locations went dead but was still considered available via BGP, while being considered the best.available path to both nets. : First it was two locations, one of which you can't tell us about : (Deep inside OSPF Area 51?) I can't provide all the exact source machines for reasons I can discuss offlist, but I'm happy to do so to a representative of UltraDNS. My home machine, though, is 66.56.93.94. : now it's several? Three to be exact that I verified last night to be unable to query DNS from either IP address: one at my home (Atlanta GA), one at my employer (Atlanta GA), and one in Chicago IL. However, here's three straw examples of both IPs going to the same place from spot checks right now (funny, my home machine actually gets two different ones at this moment): ===== Southern CA ===== traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets ... . p4-1-0-0.r00.lsanca01.us.bb.verio.net (129.250.16.80) 16.9 ms (ttl=251!) . p16-1-1-0.r21.lsanca01.us.bb.verio.net (129.250.2.10) 19.5 ms (ttl=250!) . ge-1-0.a01.lsanca02.us.ra.verio.net (129.250.29.131) 3.44 ms . 66.238.50.26.ptr.us.xo.net (66.238.50.26) 13.2 ms (ttl=248!) . dellfwisi.ultradns.net (204.74.98.2) 13.8 ms (ttl=57!) !H traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets ... . p5-1-0-0.RAR1.LA-CA.us.xo.net (65.106.5.13) 2.64 ms (ttl=250!) . p0-0-0.MAR1.LA-CA.us.xo.net (65.106.5.6) 2.73 ms (ttl=249!) . p1-0.CHR1.LA-CA.us.xo.net (207.88.81.166) 2.78 ms . 66.238.50.26.ptr.us.xo.net (66.238.50.26) 35.0 ms . dellfwisi.ultradns.net (204.74.98.2) 29.7 ms (ttl=57!) !H ===== Dallas TX ===== traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets ... . p16-0-0-0.r01.atlnga03.us.bb.verio.net (129.250.4.195) 25.3 ms (ttl=250!) . p16-2-0-0.r00.atlnga03.us.bb.verio.net (129.250.5.16) 25.3 ms (ttl=249!) . p16-1-0-0.r01.mclnva02.us.bb.verio.net (129.250.2.48) 40.8 ms (ttl=247!) . ge-1-0-0.a00.mclnva02.us.ra.verio.net (129.250.31.170) 40.8 ms (ttl=246!) . 168.143.247.38 (168.143.247.38) 44.1 ms (ttl=246!) . 64.124.112.141.ultradns.com (64.124.112.141) 45.0 ms (ttl=244!) . dellfwpxvn.ultradns.net (204.74.104.2) 43.7 ms (ttl=53!) !H traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets ... . sl-bb26-fw-5-1.sprintlink.net (144.232.20.147) 7.54 ms . sl-bb25-fw-15-0.sprintlink.net (144.232.11.89) 32.0 ms . sl-bb23-atl-10-0.sprintlink.net (144.232.20.60) 36.4 ms . sl-bb26-rly-14-1.sprintlink.net (144.232.20.65) 33.3 ms . sl-st21-ash-14-2.sprintlink.net (144.232.20.3) 34.8 ms . sl-xocomm-5-0.sprintlink.net (144.223.246.50) 34.2 ms . p5-0-0.RAR1.Washington-DC.us.xo.net (65.106.3.133) 35.3 ms (ttl=245!) . p6-1-0.MAR1.Washington-DC.us.xo.net (65.106.3.182) 35.7 ms (ttl=244!) . p0-0.CHR1.Washington-DC.us.xo.net (207.88.87.10) 35.7 ms . 64.124.112.141.ultradns.com (64.124.112.141) 39.7 ms (ttl=244!) . dellfwpxvn.ultradns.net (204.74.104.2) 40.0 ms (ttl=53!) !H ===== Chicago IL ===== traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets ... . gige3-2.core2.Chicago1.Level3.net (209.244.8.185) 0.796 ms . so-4-1-0.bbr1.Chicago1.level3.net (209.247.10.165) 0.905 ms (ttl=250!) . so-6-0-0.edge1.Chicago1.Level3.net (209.244.8.10) 1.01 ms (ttl=249!) . verio-level3-oc12.Chicago1.Level3.net (209.0.227.66) 0.860 ms (ttl=251!) . ge-1-2.a00.chcgil07.us.ra.verio.net (129.250.25.136) 0.967 ms (ttl=253!) . fa-2-1.a00.chcgil07.us.ce.verio.net (128.242.186.134) 1.04 ms (ttl=251!) . dellfweqch.ultradns.net (204.74.102.2) 0.881 ms (ttl=60!) !H traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets ... . 0.so-1-0-0.XL2.CHI13.ALTER.NET (152.63.69.182) 1.58 ms (ttl=251!) . POS7-0.BR1.CHI13.ALTER.NET (152.63.73.22) 1.29 ms . a11-0d114.IR1.Chicago2-IL.us.xo.net (206.111.2.73) 1.11 ms (ttl=251!) . p5-0-0.RAR1.Chicago-IL.us.xo.net (65.106.6.133) 1.40 ms . p4-0-0.MAR1.Chicago-IL.us.xo.net (65.106.6.142) 2.03 ms . p0-0.CHR1.Chicago-IL.us.xo.net (207.88.84.10) 1.80 ms (ttl=248!) . * . dellfweqch.ultradns.net (204.74.102.2) 1.48 ms (ttl=60!) !H === : Are you absolutely, positively sure this cluster was responding to 0 : queries, Yes. My mail server was more or less dead (it's a .org) for an hour, and I was trying frantically to get DNS to resolve with all kinds of "dig" requests directly to the IPs and traceroute tests until I gave up after an hour. : but still propagating those two /24's? Both traceroutes went to the same place. I might have had more information available, had I known this was a more complicated problem; my original post was just a "did anyone else see this problem?" query. I had thought, at first (because my spot checks noted above also timed out), that the zone's "only" servers may have in fact been dead and happened to be located in the same place -- I didn't know they were anycasted until I posted here and received responses. Effectively, of course, those *were* the only servers for the zone. : > On the other hand, if you can't see the fatal flaw in a major Internet : > infrastructure service depending on a single point of failure, I can point : > you at a few books that could enlighten you. : : It isn't a single point of failure, It's a single point of failure -- or a blackhole, if you will -- when both anycast addresses point to the same destination from any site, and that destination is dead in the water. The perspective of whether there is redundant failover is from the querying site, not the provider. What else should I call it? -- -- Todd Vierling <tv@duh.org> <tv@pobox.com>
participants (3)
-
E.B. Dreger
-
Majdi S. Abbas
-
Todd Vierling