Re: Journal of Internet Disasters

14 Nov 1998


      On Fri, 13 Nov 1998, Sean Donelan wrote:
...
Yes and no.  It would be fairly 'easy' to become a editor, start Donelan's
Journal Of Internet Disasters, get a number of noted experts to contribute
articles analyzing failures with no cooperation of the organizations.  But
I can predict what the organizations in question would say about such an
endeavor:
Your predictions are wrong however they would be true if this journal was
edited by someone other than yourself. You have a significant amount of
credibility in the industry and if you did edit such a journal, then it
would be taken seriously.
...
I'm going to get pedantic.  The results may be obvious, but the cause isn't.
I would assert there are a number of large failures where the initial obvious
cause has turned out to be wrong (or only a contributing factor).
This is a prime example of why your cerdibility in regard to disaster and
disruption analysis is so high. You not only have the background knowledge
to understand it and the willingness to research the things you don't
know. but you also have the right sceptical attitude that does not stop
questioning the situation just because a nice answer has arrived.
...
difficult for an outside group to analyze the failure.  In particular I
think it would have been close to impossible for an outside group to
find the other contributing factors.
As an editor of a network outages journal, you wouldn't be expected to do
all the investigative legwork yourself. But I think that your evenhanded
treatment of the events would tend to draw out internal investigation
reports of the companies involved. I think that you could run such a
journal in a way that would largely evade the negative effects that people
fear from disclosure because of your ability to draw parallels with
disaster situation in other industries.
...
- Last month the .GOV domain was missing on a.root-servers.net due
  to a 'known bug' affecting zone transfers from GOV-NIC
    - Someone has been probing DNS ports for an unknown reason
- it is known that various individuals flood the Internic with packets
related to aatempts to suck down the whois database, one item at a time
and/or detect when a specific domain name goes off hold and becomes
available for re-registration

- pathshow indicated that the Internic circuit over which AXFR was being
attempted was congested.
...
- f.root-servers.net and NSI's servers reacted differently.  What
  are the differences between them (BIND versions, in-house source
  code changes, operating systems/run-time libraries/compilers)
Whatever was causing the Internic link to be congested could have
disrupted NSI's server. Wasn't vixie's server acting properly by answering
lame for the zones it could not retrieve? It seems like all the problems
revolve around NSI's server and network. Vixie's problems were merely a
symptom. On the other hand, I would classify the inability of AXCFR to
transfer the zone as a weakness in BIND that could be addressed.
Additionally, since it is known that zone transfers require a certain
amount of bandwidth, Vixie could improve his operations by implementing a
system that monitors the bandwidth with pathshow prior to intiating AXFR.
Also, he could monitor the progress of the AXFR and also alarm if it was
taking too long. This would have allowed a fallback to ftp sooner and
operationally, such a fallback might even be something that could be
automated. Of course, none of this means Vixie was at fault and I'd argue
that NSI is at fault for not being able to detect the problem
sooner and not being able to swap in a backup server sooner. Vixie knows
that he is one of 13 root nameservers. But NSI knows that they are the one
and only master root nameserver which puts more responsibility on them.


--
Michael Dillon                 -               E-mail: michael@memra.com
Check the website for my Internet World articles -  http://www.memra.com