[ On Wednesday, January 24, 2001 at 23:23:11 ( -0500), Howard C. Berkowitz wrote: ]
Subject: Re: Monitoring highly redundant operations
My first point is having what physicians call a "high index of suspicion" when seeing a combination of minor symptoms. I suspect that we need to be looking for patterns of network symptoms that are sensitive (i.e., high chance of being positive when there is a problem) but not necessarily selective (i.e., low probability of false positives).
Your analogy is very interesting because just like in this case with M$'s DNS, the root cause may very well not have been in failing to notice the symptoms or diagnose them correctly, but rather in allowing a situation to build such that these symptoms even occur in the first place. I don't wish to read more into your analogy and your personal life (in a public forum, no less!) than I have a right to do, so let's say "theoretically" if it were past events in your life that were under your direct personal control and which were known at the time to be almost guaranteed to bring on your condition, then presumably you could have avoided that condition by actively avoiding or counter-acting those past events. In the same way M$'s DNS would not likely have suffered any significant visible problems, even if their entire campus had been torn to ruin by a massive earthquake or whatever, if only they had deployed registered DNS servers in other locations around the world (and of course if they'd have been careful enough to use them fully for all relevant zones). The DNS was designed to be, and is at least in theory possible to be, one of the most reliable subsystems on the Internet. However it isn't that way by default -- every zone must be specifically engineered to be that way, and then of course the result needs to be managed properly too. Luckily the engineering and management is extremely simple and in most cases only requires periodic co-operation of autonomous entities to make it all fit together. No doubt M$'s zones get a larger than average number of queries, but still it's just basic engineering to build an enormously reliable DNS system to distribute those zones and answer those queries. If this were not true the root and TLD zones would have crumbled long ago (and stayed that way! :-).
Only after there is a clear clinical problem, or several pieces of laboratory evidence, does a physician jump to more invasive tests, or begin aggressive treatment on suspicion. In like manner, you wouldn't do a processor-intensive trace on a router, or do a possibly disruptive switch to backup links, unless you had reasonable confidence that there was a problem.
No, perhaps not, but surely in an organisation the size of M$ there should have been enough operational procedures in place to have identified the events shortly preceding the beginning of the incident (eg. the configuration change). Similarly of course there should have been procedures in place to roll back all such changes to see if the problem goes away. Obviously such operational recovery procedures are not always perfect, as history has shown, but in the case of something as simple as a set of authoritative nameservers is supposed to be, they should have been highly effective. Furthermore in this particular case there's no need for expensive or disruptive tests -- a company the size of M$ should have had (and perhaps do have, but don't know how to use effectively) proper test gear that can passively analyse the traffic at various points on their networks (including their connection(s) to the Internet) without having to actually use their routers or servers for diagnostic purposes. Finally in this particular case the outage was so long that there was ample time for them to have deployed new, network diverse, servers and added their IP#s to the TLD delegations for their zone and had them show up world-wide well before they'd fixed the actual problem! -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>