Re: Journal of Internet Disasters

14 Nov 1998

      On Fri, 13 Nov 1998, Michael Dillon wrote:
...
...
- f.root-servers.net and NSI's servers reacted differently.  What
  are the differences between them (BIND versions, in-house source
  code changes, operating systems/run-time libraries/compilers)
Whatever was causing the Internic link to be congested could have
disrupted NSI's server. Wasn't vixie's server acting properly by answering
lame for the zones it could not retrieve? It seems like all the problems
revolve around NSI's server and network. Vixie's problems were merely a
symptom. On the other hand, I would classify the inability of AXCFR to
transfer the zone as a weakness in BIND that could be addressed.
Additionally, since it is known that zone transfers require a certain
amount of bandwidth, Vixie could improve his operations by implementing a
system that monitors the bandwidth with pathshow prior to intiating AXFR.
Also, he could monitor the progress of the AXFR and also alarm if it was
taking too long. This would have allowed a fallback to ftp sooner and
operationally, such a fallback might even be something that could be
automated. Of course, none of this means Vixie was at fault and I'd argue
that NSI is at fault for not being able to detect the problem
sooner and not being able to swap in a backup server sooner. Vixie knows
that he is one of 13 root nameservers. But NSI knows that they are the one
and only master root nameserver which puts more responsibility on them.
There have been no even remotely logical claims that f.root-servers.net
caused any problems at all.  If Paul's server had been working correctly
and had transferred the zone properly, the impact of NSI's screwups would
have been almost exactly the same.

What you are discussing is a problem, but not "the" problem and not a
problem that causes a significant impact over the short term.

It is important to keep that clear in messages; NSI has already spread
enough lies, so any confusion about the issue isn't wise.

In fact, the fact that at least three of NSI's servers were giving false
NXDOMAINs isn't really the issue either, from nanog's perspective.  It
needs to be figured out, is a major problem in BIND, etc. but isn't
necessarily something they could have or should have been able to prevent
before it happened: that is very difficult to figure out from the outside,
and I can certainly imagine situations where, despite the best operations
anywhere, they could not predict such things.

The big issue that needs to be addressed is why the heck it took NSI over
two hours after they were notified to fix it, especially in the middle of
the day, and why the didn't have any automated system that detected it and
notified them in minutes.  Whatever the exact problem was is important and
needs to be addressed, but addressing each instance is pointless without
knowing why NSI's operations procedures are so flawed.  In fact, they are
so flawed that the VP of engineering either had no idea what was going on
or chose to lie.

The problem is that NSI currently has no accountability (not even to their
customers), and doesn't even make a token effort to followup to their
screwups.

The organization that controls the root nameservers should have one of the
best operations departments, not one of the worst.

Re: Journal of Internet Disasters

Marc Slemko