This should go on the name-droppers list, but here goes....
these days it's not clear whether namedroppers is an operations list or a protocol list or still both. i think nanog is a fine forum for this:
What do we know about the events with the name servers
- f.root-servers.net was not able to transfer a copy of some of the zone files from a.root-servers.net - f.root-servers.net became lame for some zones
just COM.
- tcpdump showed odd AXFR from a.root-servers.net
just a lot of missed/retransmitted ACKs.
- [fjk].gtld-servers.net have been reported answering NXDOMAIN to some valid domains, NSI denies any problem
the nanog archives include some dig results that are hard for NSI to deny.
Other events which may or may not have been related - BGP routing bug disrupted connectivity for some backbones in the preceeding days
this turned up a performance problem in BIND's retry code, btw, but was not otherwise related to the COM lossage of yesterday (as far as i know).
- Last month the .GOV domain was missing on a.root-servers.net due to a 'known bug' affecting zone transfers from GOV-NIC
different bug. that one causes truncated zone transfers; the secondary zone files on [fjk].gtld-servers.net yesterday were not truncated and it just took a restart to make them stop behaving badly.
- Someone has been probing DNS ports for an unknown reason
Things I don't know - f.root-servers.net and NSI's servers reacted differently. What are the differences between them (BIND versions, in-house source code changes, operating systems/run-time libraries/compilers)
they are completely different systems (solaris vs. digital unix) running the same (unmodified) bind 8.1.2 sources, which had completely different failure modes for completely different reasons.
- how long were servers unable to transfer the zone? The SOA says a zone is good for 7 days. Why they expire/corrupt the old zone before getting a new copy?
damn good question. i'll look into that. shouldn't've happened.
- Routing between ISC and NSI for the preceeding period before the problem was discovered
there was asymmetry (they reached me via bbnplanet, i reached them via alternet). they are now preferring alternet to reach me, so we have better path symmetry now. but their first mile is still congested and i am still retransmitting a lot of ACKs.
Theories - Network connectivity was insufficient between NSI and ISC for long enough the zones timed out (why were other servers affected?)
other servers are more conservative, and had switched to manual daily FTP of the COM zone longer ago than F has done. (with manual daily FTP you get the advantages of gzip, and of the pretense of "zone master" status while you manually retry after timeouts. AXFR needs those properties.)
- Bug in BIND (or an in-house modified version) (why did vixie's and NSI's servers return different responses?)
there's definitely a bug in BIND if [fjk].gtld-servers.net were able to return different answers after restarts with no new zone transfers. (i'm sitting here wishing i had core dumps.)
- Bug in a support system (O/S, RTL, Compiler, etc) or its installation - Operator error (erroneous reports of failure) - Other malicious activity?
i think there were a goodly number of procedural errors. -- Paul Vixie <paul@vix.com>
participants (1)
-
Paul Vixie