On Fri, 13 Nov 1998, Michael Dillon wrote:
- f.root-servers.net and NSI's servers reacted differently. What are the differences between them (BIND versions, in-house source code changes, operating systems/run-time libraries/compilers)
Whatever was causing the Internic link to be congested could have disrupted NSI's server. Wasn't vixie's server acting properly by answering lame for the zones it could not retrieve? It seems like all the problems revolve around NSI's server and network. Vixie's problems were merely a symptom. On the other hand, I would classify the inability of AXCFR to transfer the zone as a weakness in BIND that could be addressed. Additionally, since it is known that zone transfers require a certain amount of bandwidth, Vixie could improve his operations by implementing a system that monitors the bandwidth with pathshow prior to intiating AXFR. Also, he could monitor the progress of the AXFR and also alarm if it was taking too long. This would have allowed a fallback to ftp sooner and operationally, such a fallback might even be something that could be automated. Of course, none of this means Vixie was at fault and I'd argue that NSI is at fault for not being able to detect the problem sooner and not being able to swap in a backup server sooner. Vixie knows that he is one of 13 root nameservers. But NSI knows that they are the one and only master root nameserver which puts more responsibility on them.
There have been no even remotely logical claims that f.root-servers.net caused any problems at all. If Paul's server had been working correctly and had transferred the zone properly, the impact of NSI's screwups would have been almost exactly the same. What you are discussing is a problem, but not "the" problem and not a problem that causes a significant impact over the short term. It is important to keep that clear in messages; NSI has already spread enough lies, so any confusion about the issue isn't wise. In fact, the fact that at least three of NSI's servers were giving false NXDOMAINs isn't really the issue either, from nanog's perspective. It needs to be figured out, is a major problem in BIND, etc. but isn't necessarily something they could have or should have been able to prevent before it happened: that is very difficult to figure out from the outside, and I can certainly imagine situations where, despite the best operations anywhere, they could not predict such things. The big issue that needs to be addressed is why the heck it took NSI over two hours after they were notified to fix it, especially in the middle of the day, and why the didn't have any automated system that detected it and notified them in minutes. Whatever the exact problem was is important and needs to be addressed, but addressing each instance is pointless without knowing why NSI's operations procedures are so flawed. In fact, they are so flawed that the VP of engineering either had no idea what was going on or chose to lie. The problem is that NSI currently has no accountability (not even to their customers), and doesn't even make a token effort to followup to their screwups. The organization that controls the root nameservers should have one of the best operations departments, not one of the worst.