Re: [fjk].gtld-servers.net bogus for .com

12 Nov 1998

      Just a few clarifications... nothing new, just some explainations of
various things.

On Wed, 11 Nov 1998, Dean Robb wrote:
...
At 15:36 11/11/98 -0600, you wrote:
...
Fixed on the next daily update? So when the AOL problem happened, a special 
update was done, but when several hundred (anyone know how many really) 
entries are trashed, we all must wait until the next daily update?
Another Public Relations/Customer Service triumph for NSI/InterNIC. </sarcasm>
I suspect more than just the fjk servers are hosed...last night around
midnight I was surfing and had over ten sites disappear between one load
and the next.  The domain names ran the gamut from "f" to "u".  Given the
time frames, they likely disappeared as the update propagated.  Now, either
a whole lot of sites simultaneously had server crashes or....
[fjk] do _not_ serve domain names starting with [fjk].  All servers serve
all names.  Without knowing more, what you experienced could have had any
number of causes.

I don't know when people first were aware of this, and I would hope some
were aware before I complained ~1000PST and NSI should have been aware
right away when it happened, since if they don't have automated checking
of each server that has a very high notification priority they are even
worse than stupid, so I'm somewhat doubtful it started at midnight.  But
it is possible.  NSI does make it hard for anyone who may notice it to
contact them.

I can't understand, however, why it took over two hours to bring down all
the badly broken servers.  Some were corrected within 15 minutes or half
an hour after I complained (and who knows how long after the appropriate
people were first notified).  One wasn't.

On Wed, 11 Nov 1998, Michael P. Lucking wrote:
...
Fixed on the next daily update? So when the AOL problem happened, a special 
update was done, but when several hundred (anyone know how many really) 
entries are trashed, we all must wait until the next daily update?
Don't take that too literally.

It isn't entries that were trashed AFAIK, but servers.  A number (or all)
servers appear to have had trouble updating their zone file.  So far so
good.  Simply not being updated won't kill anything.  Some lost the zone
(on purpose or due to a bug, I don't know) and were acting mostly like a
lame delgation.  No huge problem.  Some lost all (or a very large %) of
.com yet were still thinking they were authoritative and returning various
false negatives.  I know of three that were like that, and have had
reports of more.  Anyone asking one of those servers would be incorrectly
told the domain doesn't exist.  

This is a VERY bad failure mode.

What is the impact?  Well, if 3/12 were doing this then ~1/4 of the
queries (probably not that evenly distributed, but in that ballpark) would
have got false negatives.  Now, that is only 1/4 of all queries to the
root servers.  Domains with a large TTL that were in caches wouldn't be as
impacted.  Domains with a small TTL (eg. 5 minutes) would be very impacted
because they would expire from caches so quicky.

A lot of email is particularily badly impacted, because not only does the
domain it is being sent to have to resolve, but on many systems the
sender's domain has to resolve.  

Any resolver implementations that do not put a short upper bound on
negative caching TTLs would be _VERY_ hard hit by this and could still be
having problems unless they were restarted.  I have heard that one of MS's
products is like this, but that is just a vague rumor.  

Getting back to your question, "the update being completed" refers to
servers being able to transfer the proper zone files and put them in
place.

Re: [fjk].gtld-servers.net bogus for .com

Marc Slemko