NSI bulletin 097-004 | Root Server Problems
On Wednesday night, July 16, during the computer-generation of the Internet top-level domain zone files, an Ingres database failure resulted in corrupt .COM and .NET zone files. Despite alarms raised by Network Solutions' quality assurance schemes, at approximately 2:30 a.m. (Eastern Time), a system administrator released the zone file without regenerating the file and verifying its integrity. Network Solutions corrected the problem and reissued the zone file by 6:30 a.m. (Eastern Time). Thank you. David H. Holtzman Sr VP Engineering, Network Solutions dholtz@internic.net
Despite alarms raised by Network Solutions' quality assurance schemes, at approximately 2:30 a.m. (Eastern Time), a system administrator released the zone file without regenerating the file and verifying its integrity.
You allow mere humans to affect the process on which the whole net relies? Oh my gawd! I demand my money back! :-) But seriously. Thanks for the explanation, David. These things happen, though rarely, one hopes. We all kinda wonder, but don't want to add to the problem by calling, whining, ... Shall we have a gaffe of the week contest, this week's major entries being this one and probably MAE-West/MFS? We could call it the Metcalf award. But, David, please don't emulate the competiton by doing it twice. :-) randy
Randy Bush writes:
Despite alarms raised by Network Solutions' quality assurance schemes, at approximately 2:30 a.m. (Eastern Time), a system administrator released the zone file without regenerating the file and verifying its integrity.
You allow mere humans to affect the process on which the whole net relies? Oh my gawd! I demand my money back! :-)
Actually, I *do* think this is silly, Randy. At several of my clients, I have the release process for firm internal DNS completely automated. The DNS in these firms is generated from relational databases (in one case it is even an Ingres database!), a nearly identical problem to the one NSI describes. These are securities trading companies, and without their DNS service they'd have complete equipment shutdown resulting in the firms being unable to work -- and the result would likely be that I'd probably never get a job again in my life, so I take considerable care here. We have LOTS of backup servers, lots of backup network connections for all machines, so believe me, this isn't a question of worrying hard about something that isn't a weak link. In my automated release system, differences are examined between yesterdays and todays files, and if they are too large (a settable constant), the files are not released. About a dozen sanity checks are run as well. Just in case, old copies of the database are preserved in case a manual backout is needed. (By the way, all the tests run fast enough that I'd suspect that a similar system built for the root zones would be more than practical even given that they are several thousand times larger.) There has, in a number of years, never been a catastrophic failure -- the sanity checks have always stopped buggy DNS data from being put out to the clients. We've never needed to back out, and I've never worried a single night that I'd wake up and find my carreer was over. I admit that the problem at NSI is larger by three orders of magnitude, but essentially the same sort of scripts could be run. If such scripts were in place at NSI, such failures, which have occurred multiple times, would never have happened. Humans CANNOT be trusted with this sort of thing. Humans are fallible. You can't have humans involved in this sort of release process. It is my professional opinion as an engineer who has built systems almost exactly like the one described that NSI's excuses about its multiple failures in running the root zones are a reflection of poor design and management. If anyone wants to claim that I don't know what I'm talking about they are welcome to, but in this instance I know very well what I'm talking about down to the last detail. As I've said, I'VE BUILT THESE THINGS. Perry
Randy Bush writes:
Gosh, Perry. I wish I could be and make things that perfect.
Randy, we are engineers, not witch doctors. Real engineers can and do build things to far higher levels of safety than you seem to feel is acceptable. I work for Wall Street firms. I've been around to witness things like a head trader at a client order the purchase of ~$20 Billion in securities on heavy leverage and have a couple dozen people work at executing the trade for half the day in such a way as to avoid moving the market. In the sort of firms I work at, literally hundreds of millions to billions of dollars are on the line in our systems. When failures happen in this environment, people can lose money. If people lose money, YOU NEVER WORK AGAIN. No excuses. No "oh, what were you expecting, perfection?" You just don't work again. As a result of that, WE BUILD THINGS NOT TO FAIL. We do things like putting every machine is two nets, dual connecting all networks to every other network. We "red/black" checkerboard the workstations on our trading floors so that even if something really bad happens your neighbor's machine will still be up and you can get out of your positions. We engineer things so no single power cord, router, server or communications link is a potential source of failure. Its not trivial. You actually have to work to do this sort of thing. However, it can be done, and it is done. Moreover, we tend to build things such that dangerous infrastructure is treated with care. We automate our management systems. We don't have humans touch anything that could cause a network or system to go bye-bye because we can't afford to. Do we get failures? Sure, we get failures. Do we get catastrophic failures? Very, very, very rarely. No, not never -- but very, very rarely and when they happen we almost always recover very, very fast. I've heard tales of things not recovering fast. I've never heard tales of the people working on them after that, though. As I said, we are engineers here, NOT WITCH DOCTORS. Saying sarcastically "I wish I could be perfect, too" is to belittle your profession. When civil engineers build bridges, they almost never collapse. If they collapse, the engineer probably isn't going to work again, because he's just killed someone. Do collapses happen? Sure, once every great while. Do they happen often? No. When telephone network engineers build telephone networks, they rate their switches at two minutes downtime per year, and they work very, very carefully. Do failures happen? Sure, back in 1989 AT&T had a catastrophic failure for a couple of hours. Do they happen regularly? No, these things are goddamned rare. When aerospace guys build fly-by-wire computer systems, people die if they fail, so they build them so that they don't fail. Do they sometimes fail? Sure, very rarely -- no one is perfect -- but compared to the way the DNS has been failing of late it isn't even a contest. Internet guys are no more stupid than areospace guys or civil engineers, and frankly our job is no harder. We just have lots of lazy people who spit out lots of excuses. I'm used to an environment where excuses get you fired, though, so I find myself less tolerant of "oh, poor us" claims from others. There is no excuse for any system that will let you install a broken zone file without requiring some serious levels of manual override -- there is, in fact, little excuse for building such a system so that humans have to be involved in the first place. I'll say this for a third time: we are engineers, not witch doctors. These things *can* be built correctly. Perry
On Thu, Jul 17, 1997 at 01:11:00PM -0800, Randy Bush wrote:
Gosh, Perry. I wish I could be as important when I grow up.
You know, in light of the fact that I got bitched at a couple weeks ago for posting a message that was a) solicited, and b) informative (although questionably on topic; an assertion no one has _yet_ substantiated), I have to say that I think this stream of messages from Mr Bush are entirely uncalled for. The fact is, Randy, Perry is _right_. No amount of ad hominem attack on his well written and thought out opinions is going to change that. It's precisely this sort of thing that causes people like the NTIA to decide we can't handle things "on our own". The job is ours until we lose it, folks. Let's grow up and play nice. Cheers, -- jra -- Jay R. Ashworth jra@baylink.com Member of the Technical Staff Unsolicited Commercial Emailers Sued The Suncoast Freenet "People propose, science studies, technology Tampa Bay, Florida conforms." -- Dr. Don Norman +1 813 790 7592
On Thu, 17 Jul 97 13:11 PDT randy@psg.com (Randy Bush) wrote:
Gosh, Perry. I wish I could be as important when I grow up.
So you're actually planing on growing up the Randy? We all look forward to it. Neil -- Neil J. McRae. Alive and Kicking. Domino: In the glow of the night. neil@DOMINO.ORG NetBSD/sparc: 100% SpF (Solaris protection Factor) Free the daemon in your <A HREF="http://www.NetBSD.ORG/">computer!</A>
"Perry E. Metzger" writes:
I admit that the problem at NSI is larger by three orders of magnitude, but essentially the same sort of scripts could be run. If such scripts were in place at NSI, such failures, which have occurred multiple times, would never have happened.
Humans CANNOT be trusted with this sort of thing. Humans are fallible. You can't have humans involved in this sort of release process.
And who wrote the QA scripts you describe? Complex systems have complex failure modes. Yes, there are clearly steps that can be taken to minimize the problems, but anybody who claims that building "robustness" into complex systems is anything other than "hard" should spend some time reading the RISKS archives (http://www.CSL.sri.com/risksinfo.html).
I guess we'll be seeing a new system administration job popping up on jobs.netsol.com? Michael On Thu, 17 Jul 1997, David Holtzman wrote:
On Wednesday night, July 16, during the computer-generation of the Internet top-level domain zone files, an Ingres database failure resulted in corrupt .COM and .NET zone files. Despite alarms raised by Network Solutions' quality assurance schemes, at approximately 2:30 a.m. (Eastern Time), a system administrator released the zone file without regenerating the file and verifying its integrity. Network Solutions corrected the problem and reissued the zone file by 6:30 a.m. (Eastern Time).
Thank you. David H. Holtzman Sr VP Engineering, Network Solutions dholtz@internic.net
participants (7)
-
David Holtzman
-
Gary R Wright
-
Jay R. Ashworth
-
MFS
-
Neil J. McRae
-
Perry E. Metzger
-
randy@psg.com