EXECUTIVE SUMMARY ----------------- At 1830 EDT on 8/23/2000, the Network Solutions Registry was notified of a problem with four root name servers--b.root-servers.net, g.root-servers.net, j.root-servers.net and m.root-servers.net. The root zone on each of these servers was missing delegation information (NS records) for the com zone. The NSI Registry took immediate action, working with the operators for each of these root name servers to correct the problem. B.root, g.root and j.root were corrected by 1900 EDT; m.root by 1950 EDT. At no point did NSI make any changes to the root zone. The absence of the com zone delegation information on these name servers resulted from BIND name server behavior that is described in greater detail below. NSI acted immediately to resolve the problem, implemented immediate procedures to avoid recurrence in the short term, and has an action plan to implement a more robust longer-term solution. ROOT CAUSE ---------- If a BIND name server is authoritative for both a child zone and its parent zone, and the name server is restarted or reloaded with a missing zone database file for the child zone, the name server removes the child zone's delegation information from the in-memory copy of the parent zone. As part of NSI's zone generation process, there is a small time interval when the com zone database is not present on the NSI name server in question (during the backup of the old database and copying of the new database). On Wednesday, the name server was manually restarted during this interval; therefore, the name server removed the com zone delegation information from the root zone held in memory. NSI had never encountered this BIND behavior before, nor were several BIND experts we contacted aware of its existence. Subsequent investigation by NSI has revealed that BIND versions 8.1.2 and 8.2.2-P5 exhibit this behavior, but BIND 9.0.0rc4 does not. Nominum has indicated it will pursue a patch to BIND 8. CORRECTIVE ACTIONS ------------------ To restore service, NSI immediately reinstated the com zone database file on our server, increased the root zone's serial number and restarted the name server. B.root, g.root, and j.root immediately loaded the new root zone and were correctly issuing referrals to the com zone within minutes. M.root (in Japan) took nearly an hour to load the new root zone. NSI is modifying the current zone generation process to eliminate the existing small interval during which the com zone database file is not present on this nameserver. Until then, NSI is manually querying the root zone to ensure no delegations have been automatically dropped. -------------------------------------- Brad Verd gTLD Operations Manager Network Solutions Registry Email - bverd@netsol.com ---------------------------------------
[ On Friday, August 25, 2000 at 10:24:46 (-0400), Verd, Brad wrote: ]
Subject: Follow-up to "ROOT SERVERS"
NSI is modifying the current zone generation process to eliminate the existing small interval during which the com zone database file is not present on this nameserver.
I thought database researchers had solved this very problem decades ago. I *know* that it's a trivial problem to solve in the most basic sense on any modern Unix or unix-like system with the rename(2) system call, and that call has also been widely available, and widely used, for many years. Assuming the .COM zone database file is not created by named-xfer (if it were then I'd expect from my cursory examination of the most recent code that named would be doing the right thing to ensure there is absolutely no window were the file does not exist), then I'm totally flabbergasted that whomever is responsible for writing, and especially those responsible for reviewing and approving, the zone file update process made such a fundamental and critical implementation error -- one that I'm sure even any competent CompSci student would be loathe to make! Oddly enough your response does not indicate whether or not this particular failure mode is related in any way to previous failures with root and/or TLD servers.
Until then, NSI is manually querying the root zone to ensure no delegations have been automatically dropped.
Now there's an indication of the real root of the problem (pardon the pun)!!!! If real-time 24x7 consistency checks haven't already been automated *YEARS* ago by those responsible for root and TLD servers then something's drastically wrong with the operation of the Internet! Indeed I'd naively assumed that by now any problems with the most critical servers would automatically be detected by redundant monitoring and at minimum ring the pager or cell phone of at least one person! Perhaps those of us who build such monitoring systems for far lesser systems should just give up -- obviously our efforts are fruitless in any more global failure.... Or do we all independently monitor all of the critical third-party servers we rely upon, such as all of the root and TLD servers, and then send you e-mail or page every time we spot an issue? -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
participants (2)
-
Verd, Brad
-
woods@weird.com