eric.CArroll@acm.ORG (Eric M. Carroll) writes:
Do we actually need the cooperation of the organizations in question to effect this?
Yes and no. It would be fairly 'easy' to become a editor, start Donelan's Journal Of Internet Disasters, get a number of noted experts to contribute articles analyzing failures with no cooperation of the organizations. But I can predict what the organizations in question would say about such an endeavor: 1) Donelan is engaging in FUD to sell his journal. 2) They are making rash assumptions without knowing all the facts. 3) You know the Internet, you can't please everyone. Its just a small group of people with an axe to grind. 4) It didn't happen. If it did happen, it was minor. If it wasn't minor, not many people were affected. If many people were affected, it wasn't as bad as they said. If it was that bad, we would have known about it. Besides we fixed it, and it isn't a problem (anymore). Sure, sometimes a problem breaks through to the public even when the company tries all those things. Just ask INTEL's PR department about their handling of the Pentinum math bug. But that is relatively rare, and not really the most efficient way to handle problems.
For large enough failures, the results are obvious and the data is fairly clear. Perhaps a first stage of a Disruption Analysis Working Group would simply be for a coordinated group to gather the facts, sort through the impact, analyze the failure and report recommendations in a public forum.
I'm going to get pedantic. The results may be obvious, but the cause isn't. I would assert there are a number of large failures where the initial obvious cause has turned out to be wrong (or only a contributing factor). Was the triggering fault for the western power grid failure last year caused by a terrorist bombing or a tree growing too close to a high tension line. From just the results you can't tell the cause. It may have been possible for an outside group, with no cooperation from the power companies, to have discovered the blackened tree on the utility right of way. But without the utility's logs and access to their data, I think it would have been very difficult for an outside group to analyze the failure. In particular I think it would have been close to impossible for an outside group to find the other contributing factors. This should go on the name-droppers list, but here goes.... What do we know about the events with the name servers - f.root-servers.net was not able to transfer a copy of some of the zone files from a.root-servers.net - f.root-servers.net became lame for some zones - tcpdump showed odd AXFR from a.root-servers.net - [fjk].gtld-servers.net have been reported answering NXDOMAIN to some valid domains, NSI denies any problem Other events which may or may not have been related - BGP routing bug disrupted connectivity for some backbones in the preceeding days - Last month the .GOV domain was missing on a.root-servers.net due to a 'known bug' affecting zone transfers from GOV-NIC - Someone has been probing DNS ports for an unknown reason Things I don't know - f.root-servers.net and NSI's servers reacted differently. What are the differences between them (BIND versions, in-house source code changes, operating systems/run-time libraries/compilers) - how long were servers unable to transfer the zone? The SOA says a zone is good for 7 days. Why they expire/corrupt the old zone before getting a new copy? - Routing between ISC and NSI for the preceeding period before the problem was discovered Theories - Network connectivity was insufficient between NSI and ISC for long enough the zones timed out (why were other servers affected?) - Bug in BIND (or an in-house modified version) (why did vixie's and NSI's servers return different responses?) - Bug in a support system (O/S, RTL, Compiler, etc) or its installation - Operator error (erroneous reports of failure) - Other malicious activity? -- Sean Donelan, Data Research Associates, Inc, St. Louis, MO Affiliation given for identification not representation