eric.CArroll@acm.ORG (Eric M. Carroll) writes:
Do we actually need the cooperation of the organizations in question to effect this?
Yes and no. It would be fairly 'easy' to become a editor, start Donelan's Journal Of Internet Disasters, get a number of noted experts to contribute articles analyzing failures with no cooperation of the organizations. But I can predict what the organizations in question would say about such an endeavor: 1) Donelan is engaging in FUD to sell his journal. 2) They are making rash assumptions without knowing all the facts. 3) You know the Internet, you can't please everyone. Its just a small group of people with an axe to grind. 4) It didn't happen. If it did happen, it was minor. If it wasn't minor, not many people were affected. If many people were affected, it wasn't as bad as they said. If it was that bad, we would have known about it. Besides we fixed it, and it isn't a problem (anymore). Sure, sometimes a problem breaks through to the public even when the company tries all those things. Just ask INTEL's PR department about their handling of the Pentinum math bug. But that is relatively rare, and not really the most efficient way to handle problems.
For large enough failures, the results are obvious and the data is fairly clear. Perhaps a first stage of a Disruption Analysis Working Group would simply be for a coordinated group to gather the facts, sort through the impact, analyze the failure and report recommendations in a public forum.
I'm going to get pedantic. The results may be obvious, but the cause isn't. I would assert there are a number of large failures where the initial obvious cause has turned out to be wrong (or only a contributing factor). Was the triggering fault for the western power grid failure last year caused by a terrorist bombing or a tree growing too close to a high tension line. From just the results you can't tell the cause. It may have been possible for an outside group, with no cooperation from the power companies, to have discovered the blackened tree on the utility right of way. But without the utility's logs and access to their data, I think it would have been very difficult for an outside group to analyze the failure. In particular I think it would have been close to impossible for an outside group to find the other contributing factors. This should go on the name-droppers list, but here goes.... What do we know about the events with the name servers - f.root-servers.net was not able to transfer a copy of some of the zone files from a.root-servers.net - f.root-servers.net became lame for some zones - tcpdump showed odd AXFR from a.root-servers.net - [fjk].gtld-servers.net have been reported answering NXDOMAIN to some valid domains, NSI denies any problem Other events which may or may not have been related - BGP routing bug disrupted connectivity for some backbones in the preceeding days - Last month the .GOV domain was missing on a.root-servers.net due to a 'known bug' affecting zone transfers from GOV-NIC - Someone has been probing DNS ports for an unknown reason Things I don't know - f.root-servers.net and NSI's servers reacted differently. What are the differences between them (BIND versions, in-house source code changes, operating systems/run-time libraries/compilers) - how long were servers unable to transfer the zone? The SOA says a zone is good for 7 days. Why they expire/corrupt the old zone before getting a new copy? - Routing between ISC and NSI for the preceeding period before the problem was discovered Theories - Network connectivity was insufficient between NSI and ISC for long enough the zones timed out (why were other servers affected?) - Bug in BIND (or an in-house modified version) (why did vixie's and NSI's servers return different responses?) - Bug in a support system (O/S, RTL, Compiler, etc) or its installation - Operator error (erroneous reports of failure) - Other malicious activity? -- Sean Donelan, Data Research Associates, Inc, St. Louis, MO Affiliation given for identification not representation
I think this is an operationally relevant thread, so let me continue to tilt at windmills here. I like your ideas (as usual) and I think there is a executable idea here. I firmly believe something in this area is much, much better than nothing, which is what we have now. So, here's three communal options: - constitute a mailing list for failure analysis, everyone pitches in with or without assistance. The simple act of analyzing the options and possible failure modes is of value (note the reaction from Paul to your mail message - thus value is demonstrated!) - constitute a closed mailing list, by invitation only. Ask vendors for cooperation, and publish the results with the names removed to protect the guilty and ensure their cooperation. Publish their names if cooperation is refused. - created a moderated digest list, IFAIL-D, and take input from anywhere, but vet it through a panel of experts for analysis and publication. That's basically your newsletter. - create a real working group that meets and travels, and visits the vendors in person. Perhaps they get badges eventually, or cool NTSB like jackets ;-) So, I will jump into the pool if you will. Let's pick a model and try... The point is, there is alot of expertise available. I think starting small, involving experts, being professional, using volunteers and growing as required is a model that has worked many times in Internet Land for some big pieces of infrastructure. In other words, we need to prove the value before people will pay for it. Have we acquired so much operational grey hair we have forgotten our roots? (sorry for the pun). Regards, Eric Carroll
This thread has mostly looked at the details of the recent problem, and hasn't responded much to Sean's original points. A very notable exception is Eric's thoughtful consideration of the approaches that might be taken for a discussion forum. The note about Sean's credibility obviously is also relevant, but I'll note that the recent DNS controversy has made it clear that no amount of personal credibility is enough to withstand a sustained and forceful attack by a diligent and well-funded opponent. Hence, the effort under discussion, here, needs a group behind it, not just an individual. Which is not to say that having it led by a highly credible individual isn't extremely helpful. In considering the possible modes that Eric outlines, the two questions I found myself asking were about openness and control. Is it important that the general public be kept out of the analysis and reporting process, as is done for CERT, or is it important (or at least acceptable) that the public be present? With respect to control, should the discussion be subject to control by an authority or should it be free-form? At 02:17 PM 11/13/98 -0500, Eric M. Carroll wrote:
- constitute a mailing list for failure analysis, everyone pitches in with or without assistance. The simple act of analyzing the options and possible failure modes is of value (note the reaction from Paul to your mail message - thus value is demonstrated!)
This is the open/no-control model. It is the best for encouraging a broad range of opinion. It is the worst for permitting ad hominems, spin control efforts, etc.
- constitute a closed mailing list, by invitation only. Ask vendors for cooperation, and publish the results with the names removed to protect the guilty and ensure their cooperation. Publish their names if cooperation is refused.
This is probably the best for thoughtful analysis and the worst for information gathering.
- created a moderated digest list, IFAIL-D, and take input from anywhere, but vet it through a panel of experts for analysis and publication. That's basically your newsletter.
Open participation means broad input. Moderation means control over the emotional, etc. distractions. It also might be quite a bit of effort for the moderator...
- create a real working group that meets and travels, and visits the vendors in person. Perhaps they get badges eventually, or cool NTSB like jackets ;-)
The most fun for the participants, expensive, and probably not (yet) necessary. I've biased the analysis, to show which one I personally prefer, but it's predicated on having a moderator with the time and skill to do the job. On the other hand, if we take the event detail analysis that has been mostly going on for this thread, we find that contributions have been thoughtful and constructive, so that the job of the moderator would have been minimal. In essence, the moderator introduces a small amount of delay but adds a safety mechanism in case the tone would otherwise start getting out of hand. And now that I've said that, there is a question about timeliness. Does the analysis need to be able to occur in emergency mode, to get things fixed, or will these only be post hoc efforts? d/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Dave Crocker Tel: +60 (19) 3299 445 <mailto:dcrocker@brandenburg.com> Post Office Box 296, U.P.M. Serdang, Selangor 43400 MALAYSIA Brandenburg Consulting <http://www.brandenburg.com> Tel: +1 (408) 246 8253 Fax: +1(408)273 6464 675 Spruce Dr., Sunnyvale, CA 94086 USA
Kind of like an OEM for the Internet? 8) (office of emergency management) I actually had an idea like this some time ago and went ahead and registered oem-i.org, maybe it needs to be re-instated? On Fri, 13 Nov 1998, Eric M. Carroll wrote:
I think this is an operationally relevant thread, so let me continue to tilt at windmills here. I like your ideas (as usual) and I think there is a executable idea here. I firmly believe something in this area is much, much better than nothing, which is what we have now.
So, here's three communal options:
- constitute a mailing list for failure analysis, everyone pitches in with or without assistance. The simple act of analyzing the options and possible failure modes is of value (note the reaction from Paul to your mail message - thus value is demonstrated!) - constitute a closed mailing list, by invitation only. Ask vendors for cooperation, and publish the results with the names removed to protect the guilty and ensure their cooperation. Publish their names if cooperation is refused. - created a moderated digest list, IFAIL-D, and take input from anywhere, but vet it through a panel of experts for analysis and publication. That's basically your newsletter. - create a real working group that meets and travels, and visits the vendors in person. Perhaps they get badges eventually, or cool NTSB like jackets ;-)
So, I will jump into the pool if you will. Let's pick a model and try... The point is, there is alot of expertise available. I think starting small, involving experts, being professional, using volunteers and growing as required is a model that has worked many times in Internet Land for some big pieces of infrastructure. In other words, we need to prove the value before people will pay for it. Have we acquired so much operational grey hair we have forgotten our roots? (sorry for the pun).
Regards,
Eric Carroll
On Fri, 13 Nov 1998, Sean Donelan wrote:
Yes and no. It would be fairly 'easy' to become a editor, start Donelan's Journal Of Internet Disasters, get a number of noted experts to contribute articles analyzing failures with no cooperation of the organizations. But I can predict what the organizations in question would say about such an endeavor:
Your predictions are wrong however they would be true if this journal was edited by someone other than yourself. You have a significant amount of credibility in the industry and if you did edit such a journal, then it would be taken seriously.
I'm going to get pedantic. The results may be obvious, but the cause isn't. I would assert there are a number of large failures where the initial obvious cause has turned out to be wrong (or only a contributing factor).
This is a prime example of why your cerdibility in regard to disaster and disruption analysis is so high. You not only have the background knowledge to understand it and the willingness to research the things you don't know. but you also have the right sceptical attitude that does not stop questioning the situation just because a nice answer has arrived.
difficult for an outside group to analyze the failure. In particular I think it would have been close to impossible for an outside group to find the other contributing factors.
As an editor of a network outages journal, you wouldn't be expected to do all the investigative legwork yourself. But I think that your evenhanded treatment of the events would tend to draw out internal investigation reports of the companies involved. I think that you could run such a journal in a way that would largely evade the negative effects that people fear from disclosure because of your ability to draw parallels with disaster situation in other industries.
- Last month the .GOV domain was missing on a.root-servers.net due to a 'known bug' affecting zone transfers from GOV-NIC - Someone has been probing DNS ports for an unknown reason
- it is known that various individuals flood the Internic with packets related to aatempts to suck down the whois database, one item at a time and/or detect when a specific domain name goes off hold and becomes available for re-registration - pathshow indicated that the Internic circuit over which AXFR was being attempted was congested.
- f.root-servers.net and NSI's servers reacted differently. What are the differences between them (BIND versions, in-house source code changes, operating systems/run-time libraries/compilers)
Whatever was causing the Internic link to be congested could have disrupted NSI's server. Wasn't vixie's server acting properly by answering lame for the zones it could not retrieve? It seems like all the problems revolve around NSI's server and network. Vixie's problems were merely a symptom. On the other hand, I would classify the inability of AXCFR to transfer the zone as a weakness in BIND that could be addressed. Additionally, since it is known that zone transfers require a certain amount of bandwidth, Vixie could improve his operations by implementing a system that monitors the bandwidth with pathshow prior to intiating AXFR. Also, he could monitor the progress of the AXFR and also alarm if it was taking too long. This would have allowed a fallback to ftp sooner and operationally, such a fallback might even be something that could be automated. Of course, none of this means Vixie was at fault and I'd argue that NSI is at fault for not being able to detect the problem sooner and not being able to swap in a backup server sooner. Vixie knows that he is one of 13 root nameservers. But NSI knows that they are the one and only master root nameserver which puts more responsibility on them. -- Michael Dillon - E-mail: michael@memra.com Check the website for my Internet World articles - http://www.memra.com
On Fri, 13 Nov 1998, Michael Dillon wrote:
- f.root-servers.net and NSI's servers reacted differently. What are the differences between them (BIND versions, in-house source code changes, operating systems/run-time libraries/compilers)
Whatever was causing the Internic link to be congested could have disrupted NSI's server. Wasn't vixie's server acting properly by answering lame for the zones it could not retrieve? It seems like all the problems revolve around NSI's server and network. Vixie's problems were merely a symptom. On the other hand, I would classify the inability of AXCFR to transfer the zone as a weakness in BIND that could be addressed. Additionally, since it is known that zone transfers require a certain amount of bandwidth, Vixie could improve his operations by implementing a system that monitors the bandwidth with pathshow prior to intiating AXFR. Also, he could monitor the progress of the AXFR and also alarm if it was taking too long. This would have allowed a fallback to ftp sooner and operationally, such a fallback might even be something that could be automated. Of course, none of this means Vixie was at fault and I'd argue that NSI is at fault for not being able to detect the problem sooner and not being able to swap in a backup server sooner. Vixie knows that he is one of 13 root nameservers. But NSI knows that they are the one and only master root nameserver which puts more responsibility on them.
There have been no even remotely logical claims that f.root-servers.net caused any problems at all. If Paul's server had been working correctly and had transferred the zone properly, the impact of NSI's screwups would have been almost exactly the same. What you are discussing is a problem, but not "the" problem and not a problem that causes a significant impact over the short term. It is important to keep that clear in messages; NSI has already spread enough lies, so any confusion about the issue isn't wise. In fact, the fact that at least three of NSI's servers were giving false NXDOMAINs isn't really the issue either, from nanog's perspective. It needs to be figured out, is a major problem in BIND, etc. but isn't necessarily something they could have or should have been able to prevent before it happened: that is very difficult to figure out from the outside, and I can certainly imagine situations where, despite the best operations anywhere, they could not predict such things. The big issue that needs to be addressed is why the heck it took NSI over two hours after they were notified to fix it, especially in the middle of the day, and why the didn't have any automated system that detected it and notified them in minutes. Whatever the exact problem was is important and needs to be addressed, but addressing each instance is pointless without knowing why NSI's operations procedures are so flawed. In fact, they are so flawed that the VP of engineering either had no idea what was going on or chose to lie. The problem is that NSI currently has no accountability (not even to their customers), and doesn't even make a token effort to followup to their screwups. The organization that controls the root nameservers should have one of the best operations departments, not one of the worst.
On Fri, 13 Nov 1998, Marc Slemko wrote:
What you are discussing is a problem, but not "the" problem and not a problem that causes a significant impact over the short term.
What I'm getting at is that on a network you cannot simply point the finger at the bad guys, NSI and say that since they screwed up everything is their fault. Everyone who interacts with NSI's servers also has a responsibility to arrange their operations so that an NSI problem cannot cause cascading failures. Especially so since NSI is known to regularly screw up like this. That means that the other root nameserver operators have a responsibility to limit the damage that NSI can do to them. You will also note that some ISPs attempt to mitigate the damage by running their own root zones which allows them to fix things without waiting for the NSI bureaucracy to get around to fixing their servers.
It is important to keep that clear in messages; NSI has already spread enough lies, so any confusion about the issue isn't wise.
Nevertheless, there are other lessons to be learned from the incident besides the fact that NSI's internal operations are a mess.
The big issue that needs to be addressed is why the heck it took NSI over two hours after they were notified to fix it,
Precisely! Part of NSI's problem is that they simply do not have the skilled professionals available to build a proper robust architecture. This is evident not only in their nameserver operations but also in the domain name registry as well. But NSI also suffers from the bureaucratic disease that does not give front-line people the authority and the responsibility to fix things fast.
The organization that controls the root nameservers should have one of the best operations departments, not one of the worst.
The solution to this problem is to take this operational responsibility away from NSI. And then to run it totally transparently so that if a problem like this occurred there would be no veil of secrecy. IN such an important infrastructure operation, every detail of the event logs complete with names and dates and times and the content of internal email messages should all be open to the public. This would be a very positive outcome of the new ICANN and would, in fact, be a resurrection of the way things used to be done on the net where everyone shared their data openly and jointly figured out how to do things better. -- Michael Dillon - E-mail: michael@memra.com Check the website for my Internet World articles - http://www.memra.com
Sure, we have a responsibility to mitigate problems from up the pipe. But We as an industry and consumers need to start demanding - via writing to the congress people (They will read letters, not email- and they will listen if enough people contact them), complaints to the FCC and other involved groups - high standards of quality from the InterNic and other organizations that the Internet depend on. We are essentially captive to their screwups, NO MATTER HOW WELL we prepare. Even the FAA recognizes no matter how good a flight crew, if they get incorrect information from the Tower, any problems that occur are the Tower's fault! -Deb ------------------~~~~~~~~~~~~~~~~~---------------~~~~~~~~~~~~~~ Deborah A. Smith das@digex.net Because I can ~~~~~~~~~~~~~~~~~~~----------------~~~~~~~~~~~~~~~~--------------- On Sat, 14 Nov 1998, Michael Dillon wrote:
On Fri, 13 Nov 1998, Marc Slemko wrote:
What you are discussing is a problem, but not "the" problem and not a problem that causes a significant impact over the short term.
What I'm getting at is that on a network you cannot simply point the finger at the bad guys, NSI and say that since they screwed up everything is their fault. Everyone who interacts with NSI's servers also has a responsibility to arrange their operations so that an NSI problem cannot cause cascading failures. Especially so since NSI is known to regularly screw up like this.
That means that the other root nameserver operators have a responsibility to limit the damage that NSI can do to them. You will also note that some ISPs attempt to mitigate the damage by running their own root zones which allows them to fix things without waiting for the NSI bureaucracy to get around to fixing their servers.
It is important to keep that clear in messages; NSI has already spread enough lies, so any confusion about the issue isn't wise.
Nevertheless, there are other lessons to be learned from the incident besides the fact that NSI's internal operations are a mess.
The big issue that needs to be addressed is why the heck it took NSI over two hours after they were notified to fix it,
Precisely! Part of NSI's problem is that they simply do not have the skilled professionals available to build a proper robust architecture. This is evident not only in their nameserver operations but also in the domain name registry as well. But NSI also suffers from the bureaucratic disease that does not give front-line people the authority and the responsibility to fix things fast.
The organization that controls the root nameservers should have one of the best operations departments, not one of the worst.
The solution to this problem is to take this operational responsibility away from NSI. And then to run it totally transparently so that if a problem like this occurred there would be no veil of secrecy. IN such an important infrastructure operation, every detail of the event logs complete with names and dates and times and the content of internal email messages should all be open to the public. This would be a very positive outcome of the new ICANN and would, in fact, be a resurrection of the way things used to be done on the net where everyone shared their data openly and jointly figured out how to do things better.
-- Michael Dillon - E-mail: michael@memra.com Check the website for my Internet World articles - http://www.memra.com
participants (7)
-
Dave Crocker
-
Deborah Ann Smith
-
Eric M. Carroll
-
Marc Slemko
-
Michael Dillon
-
Michael Freeman
-
Sean Donelan