Randy Bush writes:
Despite alarms raised by Network Solutions' quality assurance schemes, at approximately 2:30 a.m. (Eastern Time), a system administrator released the zone file without regenerating the file and verifying its integrity.
You allow mere humans to affect the process on which the whole net relies? Oh my gawd! I demand my money back! :-)
Actually, I *do* think this is silly, Randy. At several of my clients, I have the release process for firm internal DNS completely automated. The DNS in these firms is generated from relational databases (in one case it is even an Ingres database!), a nearly identical problem to the one NSI describes. These are securities trading companies, and without their DNS service they'd have complete equipment shutdown resulting in the firms being unable to work -- and the result would likely be that I'd probably never get a job again in my life, so I take considerable care here. We have LOTS of backup servers, lots of backup network connections for all machines, so believe me, this isn't a question of worrying hard about something that isn't a weak link. In my automated release system, differences are examined between yesterdays and todays files, and if they are too large (a settable constant), the files are not released. About a dozen sanity checks are run as well. Just in case, old copies of the database are preserved in case a manual backout is needed. (By the way, all the tests run fast enough that I'd suspect that a similar system built for the root zones would be more than practical even given that they are several thousand times larger.) There has, in a number of years, never been a catastrophic failure -- the sanity checks have always stopped buggy DNS data from being put out to the clients. We've never needed to back out, and I've never worried a single night that I'd wake up and find my carreer was over. I admit that the problem at NSI is larger by three orders of magnitude, but essentially the same sort of scripts could be run. If such scripts were in place at NSI, such failures, which have occurred multiple times, would never have happened. Humans CANNOT be trusted with this sort of thing. Humans are fallible. You can't have humans involved in this sort of release process. It is my professional opinion as an engineer who has built systems almost exactly like the one described that NSI's excuses about its multiple failures in running the root zones are a reflection of poor design and management. If anyone wants to claim that I don't know what I'm talking about they are welcome to, but in this instance I know very well what I'm talking about down to the last detail. As I've said, I'VE BUILT THESE THINGS. Perry