Daniel Golding suggested that the problem was that many folks are sharing Akamai's magic DNS algorithms. This doesn't appear to be a problem with magic algorithms - it appears that they're sharing the _servers_, and that the reported attack on the servers means that it doesn't matter how magic the algorithms are. Good luck to them on developing a longer-term workaround for the next attack. Bill Stewart, bill.stewart@pobox.com Disclaimer: This note is, as usual, my personal opinion, not my employer's.
On 15 Jun 2004, at 21:28, Stewart, William C (Bill), RTSLS wrote:
Daniel Golding suggested that the problem was that many folks are sharing Akamai's magic DNS algorithms. This doesn't appear to be a problem with magic algorithms - it appears that they're sharing the _servers_, and that the reported attack on the servers means that it doesn't matter how magic the algorithms are. Good luck to them on developing a longer-term workaround for the next attack.
Workarounds and defences already exist, and have been in use for a long time. The chance of catastrophic, systematic operator error (e.g. rdist gone wild, RIF-frenzied, root-wielding, caffeine-crazed sysadmins run amok) problems can be avoided by including nameservers managed by different organisations in the NS set. Distributed (and non-distributed) denial of service attacks can be mitigated using dispersed anycast nameserver deployment. Network partition/isolation events (e.g. under sea cable failures which isolate an economy) can be mitigated by strategic location of (anycast instances of) locally-relevant nameservers. Operational routing and instrumentation challenges with managing a dispersed anycast deployment can be mitigated by including non-anycast nameservers in the NS set alongside the anycast nameservers. Failures due to ancillary equipment failure can be avoided by eliminating single points of failure (e.g. wide geographic disperson of nameservers into topologically-distant infrastructure). Failures due to political interference can be avoided by deploying nameservers in complementary regions of governance. Failures or vulnerabilities in individual DNS implementations can be mitigated by ensuring that not all nameservers in the NS set run the same DNS software (or similar software, developed from a common code base). Failures or vulnerabilities in ancillary software (routers, switches, operating systems, etc) can be mitigated by ensuring that different nameservers rely on different brands of routers, switches and operating systems. Failures in master servers can be mitigated by having several of them; simultaneous failure of all master servers can be managed to some degree using appropriate SOA timers, so that slave servers provide coverage while master servers are brought back into service. Different styles of attack can be mitigated by different DNS hosting strategies. A robustly-hosted zone will have an NS set that exhibits several or all of these approaches (and others too). The hosting of the root zone provides guidance, here. Joe
Workarounds and defences already exist, and have been in use for a long time.
<long list removed>
Failures in master servers can be mitigated by having several of them; simultaneous failure of all master servers can be managed to some degree using appropriate SOA timers, so that slave servers provide coverage while master servers are brought back into service.
Different styles of attack can be mitigated by different DNS hosting strategies. A robustly-hosted zone will have an NS set that exhibits several or all of these approaches (and others too).
The hosting of the root zone provides guidance, here.
Joe
But you don't say how to avoid failures caused by massive confusion when maintaining a excessively complicated system.... Mark
On 16 Jun 2004, at 10:13, Mark Radabaugh wrote:
But you don't say how to avoid failures caused by massive confusion when maintaining a excessively complicated system....
By isolating the complexity to small pockets, each of which is largely invisible to the rest of the system, and reducing the coordination required between the different autonomous operators involved to managable levels, much as the high complexity of the global routing system is managed. This isn't just handwaving -- the root zone has been served with enormous reliability for a long time, accommodating all of the precautions on that list. That reliability is a feature of prudence and simplicity, not needless complexity and confusion. Joe
Mark Radabaugh wrote:
But you don't say how to avoid failures caused by massive confusion when maintaining a excessively complicated system....
I don't have much to offer for the "excessively complicated" case (which I think the instant case is an example of), but there are cases as complex and complicated with some justification in my history. For those, the best solutions involved concepts like "canned, tested, documented procedures", "quality control", "change management" (which included "staging", "testing and verification", and so on. We were not fond, in the "production" and "system test" environments, of people who made ad hoc changes of any kind. Many years ago, I hand carried a patch through the approvals process, group leader reviewed the purpose, urgency, test methods, test results, and signed the sheet. District manager looked it over and asked "what are the chances that this patch could fail?" I flippantly replied "One in a million!". He handed the documents back unsigned with the words "Seven times in the Metro (Los Angeles, California) office tonight. -- Requiescas in pace o email Ex turpi causa non oritur actio http://members.cox.net/larrysheldon/
On 6/15/04 9:28 PM, "Stewart, William C (Bill), RTSLS" <billstewart@att.com> wrote:
Daniel Golding suggested that the problem was that many folks are sharing Akamai's magic DNS algorithms. This doesn't appear to be a problem with magic algorithms - it appears that they're sharing the _servers_, and that the reported attack on the servers means that it doesn't matter how magic the algorithms are. Good luck to them on developing a longer-term workaround for the next attack.
Bill Stewart, bill.stewart@pobox.com
Disclaimer: This note is, as usual, my personal opinion, not my employer's.
Bill, The point still holds - when too much high value content shares anything - algorithm, infrastructure, etc you get vulnerability. The problem I was highlighting was excessive sharing, not AkaDNS magic. (Of course, everything shares the general DNS infrastructure, but the numerous roots (some of which are anycast-ed) plus the distributed nature make that tougher to completely take out. ) It looks like this was an attack on the Akamai DNS redirection infrastructure rather than the Akamai hosting infrastructure. Their DNS servers present far fewer points to attack. It would be interesting to hear a detailed analysis of the attack at some point. Maybe a good topic for the next NANOG? (Patrick? :) Part of the difficulty of discussing this is, that by bringing up points of potential vulnerability in a public forum, it provides hints for those who would wreak havoc. I'm sure many of us can come up with other bits of vulnerable shared infrastructure, but it seems inappropriate to discuss this on such an open forum. I can only wonder if the more private forums being hosted by government organizations are effective, or simply boondoggles designed to provide political cover. - Dan
participants (5)
-
Daniel Golding
-
Joe Abley
-
Laurence F. Sheldon, Jr.
-
Mark Radabaugh
-
Stewart, William C (Bill), RTSLS