I'm confused again after reading Network Solutions press release. The press release makes it sound like the problem was with the four other root servers. On the other hand the technical description sounds like the original problem was with NSI's root-server. NSI's server dropped the COM zone from the root name file. The other servers simply copied the information provided by NSI's server to their servers, and are only cupable so far as NSI's server provided a root zone file missing information. Is this correct? If the mistake was four independent root servers made the same mistake at the same time, I'd like to know how. Further the Network Solutions press release statesthe impact was negligible because DNS resolvers look for multiple servers. This is only partially true. DNS resolvers look for other servers only when a server is unavailable. However, when a server has incorrect information, such as a root zone missing a delegated zone, won't the other servers return NXDOMAIN which resolvers will assume is an authoritative answer? Therefore any user of those four other servers would have received authoratative answers that .COM did not exist? How many queries do those four servers normally handle? Is this correct? I realize that press releases aren't intended to convey technical information, and therefore it is important to release the technical information through other channels. But a press release should keep close to what happened.
At 12:15 PM -0700 8/25/00, Sean Donelan wrote:
How many queries do those four servers normally handle?
Using the most basic of logic (slightly flawed admittedly)... r = number of daily root-server .com requests t = total number of root-servers b = 4 (number of bad root servers) (b/t)*r Now, since I'm suspecting that "r" is a sufficiently high value, I would say it would not be "negligible". If it was "negligible" and "users would not notice", then it wouldn't have been reported to NSI, as we are all end-users of their system, and we noticed. :)
But a press release should keep close to what happened.
... and not an outright lie? ;-) Sounds like NSI is starting to take some pointers from the Microsoft School of Press Releases.... ;-) d
Further the Network Solutions press release statesthe impact was negligible because DNS resolvers look for multiple servers. This is only partially true. DNS resolvers look for other servers only when a server is unavailable. However, when a server has incorrect information, such as a root zone missing a delegated zone, won't the other servers return NXDOMAIN which resolvers will assume is an authoritative answer? Therefore any user of those four other servers would have received authoratative answers that .COM did not exist? How many queries do those four servers normally handle?
The impact probably was rather small, both because resolvers look for multiple servers (not because they try another one after receiving an authoritative NXDOMAIN, but because only some fraction of resolvers would have tried a broken root server first -- other resolvers, trying other roots, would not have had the issue); and because the missing domain was ".COM" which is likely cached in most resolvers anyway. So the only impact was to clients of caching servers who (a) expired .COM from their cache (or restarted) during the interval of brokenness, and (b) proceeded to attempt to refresh it from a broken root. And even then, those clients would have only been impacted to the extent that they accessed a sub-domain of .COM that was not cached. (So, for example, clients of a server that met requirements (a) and (b) above could probably still get to, say, yahoo.com, because chances are Yahoo.com and .com wouldn't both expire from the cache during the period of brokeness. But obscurecompany.com would have probably not been reachable from the same caching server, because it likely wouldn't have been in cache.) The NSI posting indicated that troublereports came in at 18:30. If we assume it took at most half an hour for them to receive reports, the start time of the failure would be 18:00. According to the NSI post, 3 servers were corrected at 19:00, one at 19:50. There are 13 authoritative servers for ".". Assuming they are all "preferred" by 1/13 of the caching servers out there, than means that: 69% (9/13) of the caching servers would have been querying a non-broken root; 23% would have been querying a broken root for 1 hour 8% would have been querying a broken root for about 2 hours. The COM NS records have a TTI of 6 days, or 144 hours. So of the servers in the second group, about 1/144 would have expired COM duing the brokeness and thus actually have queried for .COM and received a NXDOMAIN; and about 2/144, or 1/72, of third group would have done the same. So the percentage of caching servers we can expect to have failed is: 3/13*1/144 + 1/13*2/144 = 0.27 percent. Not good, but certainly not catastrophic or widespread. The NSI release was technically inaccurate, but not far off the mark in terms of impact. It's also incorrect to say everything was OK at 19:50, though. All the root servers were apparently functioning properly then, but the NXDOMAIN for COM likely remained cahcned for considerably longer in those 0.27% of servers. -- Brett
participants (3)
-
Brett Frankenberger
-
Derek J. Balling
-
Sean Donelan