From the typical monitoring stations Dave sees, everything appears "normal." Yet, out in the real world there is a problem. Like most
Not to pick on Dave, since I suspect he is going to have to face the Microsoft PR department for re-indoctrination for speaking out of turn, I'm glad to see someone from microsoft made an appearance. But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work." As many of us have found out as we moved from simple networks to more complex networks, the network management is often much harder than the architecture of the network itself. Instead of relying on being notified when stuff "breaks" you have to actively monitor the state of your systems. Fairly frequently I see cases where the backup system failed, but no one knows about it until after the primary system also fails. things its rarely a single thing that breaks, but chain of problems resulting in the final failure. So what should you be monitoring in addition to the typical graphs and logins to detect the problem seen by Microsot yesterday and today? On Wed, 24 January 2001, Dave McKay wrote:
Microsoft's ITG is investigating this issue. I haven't been clued in as of yet as to what is the main issue. Hotmail's graphs and logins are currently following the same trends as normal, they seem unaffected, however this is not the case in all locations. DNS seems to be the obvious choice for the blame. This is not the case in all areas, however. At this point Microsoft is not willing to put the blame on anyone, or any protocol for that matter. (Unless they already released a public statement saying so, then who knows?) Anyway, the issues are being worked on and service will be restored as soon as possible. I apolozise for not being able to disclose more information.
-- Dave McKay dave@sneakerz.org Microsoft Global Network Architect
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
Indeed. We currently monitor each part of our operation from a monitoring station on our network. Under certain conditions, this can give us both false positives and false negatives: - We've lost off-site routing. Our monitoring station can see all our nodes okay, so it thinks everything is fine, but no-one else can see them. - We've lost routing to just the part of our network with the monitoring station on. It reports that everything is down, when in fact stuff is working fine for serving the rest of the internet. One way we plan to overcome these issues is to locate monitoring stations on other ISPs networks at random places on the internet. If you correlate the results from these multiple monitoring stations, then you get a better view of what the rest of the internet is seeing. Simon -- Simon Lockhart | Tel: +44 (0)1737 839676 Internet Engineering Manager | Fax: +44 (0)1737 839516 BBC Internet Services | Email: Simon.Lockhart@bbc.co.uk Kingswood Warren,Tadworth,Surrey,UK | URL: http://support.bbc.co.uk/
On Wed, 24 Jan 2001, Simon Lockhart wrote:
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
Indeed. We currently monitor each part of our operation from a monitoring station on our network. Under certain conditions, this can give us both false positives and false negatives:
- We've lost off-site routing. Our monitoring station can see all our nodes okay, so it thinks everything is fine, but no-one else can see them.
With our monitoring software we also check a few off-site links (our interfaces on our uplinks routers and the router after that) it tends to work well.
- We've lost routing to just the part of our network with the monitoring station on. It reports that everything is down, when in fact stuff is working fine for serving the rest of the internet.
For that situation the software we use allows us to set dependencies, ie, servers A B & C depend on router Z, if router Z is down, assume server A B & C are unreachable/down (but dont start spewing out alerts about it) Unfortunately the software is MS based (Enterprise Monitor, now named IP monitor iirc) I first came across it while working at Xerox, it resides on the only MS box on our network (beyond customer machines, and yes, it's kinda of an oxymoron, a windows monitoring box).
One way we plan to overcome these issues is to locate monitoring stations on other ISPs networks at random places on the internet. If you correlate the results from these multiple monitoring stations, then you get a better view of what the rest of the internet is seeing.
A kind of distributed monitoring system would be nice, or just having people who agree to give you access to add your systems to their monitoring systems (easily done with some software, not so easily with others) I also do this to a small extent. Matthew S. Hallacy XtraTyme Technologies
Simon -- Simon Lockhart | Tel: +44 (0)1737 839676 Internet Engineering Manager | Fax: +44 (0)1737 839516 BBC Internet Services | Email: Simon.Lockhart@bbc.co.uk Kingswood Warren,Tadworth,Surrey,UK | URL: http://support.bbc.co.uk/
On Wed, 24 Jan 2001, Simon Lockhart wrote:
Indeed. We currently monitor each part of our operation from a monitoring station on our network. Under certain conditions, this can give us both false positives and false negatives:
Umm... Keynote? (http://www.keynote.com) I find it truly amazing that people don't already diversely monitor. Hell, have cronned pings running off your friend's cable modem if that's all you can afford, but for christ's sake, a single box colo'd in someone else's cage, or a shell at shells.com or nether.net really isn't that expensive. Fighting the war against bad networks, Matthew Devney Teamsphere Interactive P.S.: It is not wise to get me started about other bad network practices (stub network) or various other stupidities that piss me off.
[ On , January 24, 2001 at 14:31:20 ( -0800), Sean Donelan wrote: ]
Subject: Monitoring highly redundant operations
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
The real problem is that the most critical part of the puzzle has _not_ been made "highly redundant" in this case. If at least one of your registered authoritative DNS servers are not responding from the point of view of any _and_ every user on the Internet, your hosts (MX records, etc.) don't exist for those people and their e-mail to you may well bounce and they will not view your web pages. The only way to ensure that your DNS is highly redundant and working is to ensure that you've got maximum possible dispersion of _registered_ authoritative servers throughout the network geography, just like the root and TLD servers are widely distributed. Note this is just as important (if not more so!) for any delegated sub-domains in your zone too, and equally important for any related zones (eg. passport.com in this case). The only really effective way to measure the effectiveness of your nameserver dispersion is to make it terribly easy for anyone anywhere to report any problems they percieve to you via as many optional channels as possible -- you can't be everywhere at once, but if you make it easy for people to send you information out-of-band then you'll get lots of early warning when various chunks of the Internet can't see your nameservers and/or your other hosts. Now if the majority of DNS cache server operators don't get too paranoid you could try to set up a mesh of equally widely dispersed monitoring systems that cross-check the availability of test records from your zone by querying any number of regional and remote cache servers. You'd make the TTL of these test records the minimum recommended by major nameserver software vendors (300 seconds?) and then query the whole group every TTL+N seconds. Obviously you're probably going to have to report your results out-of-band, and/or have independent people at each monitoring site who are responsible for investigating problems immediately and doing what they can locally to resolve as them. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
On Wed, Jan 24, 2001 at 02:31:20AM -0800, Sean Donelan wrote:
Not to pick on Dave, since I suspect he is going to have to face the Microsoft PR department for re-indoctrination for speaking out of turn, I'm glad to see someone from microsoft made an appearance.
out of curiosity, how do you know he's really from microsoft, whether unofficial or not? (he might be one of those LIENUX ZeAlOtS.) perhaps he might face re-indoctrination over the mail client he is (apparently) using, as well as the mail server software he is (apparently) using: : [ snip ] : : Received: by segue.merit.edu (Postfix) : id 91FFE5E0E3; Wed, 24 Jan 2001 18:12:28 -0500 (EST) : Delivered-To: nanog-outgoing@merit.edu : Received: by segue.merit.edu (Postfix, from userid 56) : id B9B1E5E067; Wed, 24 Jan 2001 17:39:46 -0500 (EST) : Received: from sneakerz.org (sneakerz.org [207.154.226.254]) : by segue.merit.edu (Postfix) with ESMTP id 58CCE5EBCC : for <nanog@merit.edu>; Wed, 24 Jan 2001 17:32:37 -0500 (EST) :---> Received: by sneakerz.org (Postfix, from userid 1003) ---> ^^^^^^^^^^^ : id E7CC15D006; Wed, 24 Jan 2001 16:32:36 -0600 (CST) : Date: Wed, 24 Jan 2001 16:32:36 -0600 : From: Dave McKay <dave@sneakerz.org> : : [ snipped/edited ] : : Message-ID: <20010124163236.A37343@sneakerz.org> : References: <20010124142230.A36944@sneakerz.org> <Pine.BSO.4.31.0101241330390.16244-100000@dqc.org> : :---> User-Agent: Mutt/1.2i : : [ snipped/edited ] : : -- : Dave McKay : dave@sneakerz.org : Microsoft Global Network Architect -- Henry Yen Aegis Information Systems, Inc. Senior Systems Programmer Hicksville, New York
Sean Donelan <sean@donelan.com> observed,
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
I suspect answers here aren't going to be found in traditional engineering, but more in a discipline that deals with extremely complex systems where a full failure may be irretrievable. I'm thinking of clinical medicine. The initial problem there indeed may be subtle. I have a substantial amount of medical experience, but it easily was 2-3 hours before I recognized, in myself, early symptoms of a cardiac problem. It seemed so much like indigestion, and then a pulled muscle. I remember relaxing, and then recognizing a chain of minor events...sweating...mild but persistent left arm pain radiating into the chest...shortness of breath...and then a big OH SH*T. My first point is having what physicians call a "high index of suspicion" when seeing a combination of minor symptoms. I suspect that we need to be looking for patterns of network symptoms that are sensitive (i.e., high chance of being positive when there is a problem) but not necessarily selective (i.e., low probability of false positives). Once the index of suspicion is triggered, the next thing to look for is not necessarily direct indication of a problem, but a more selective surrogate marker: objective criteria, especially when analyzed as trends, point in the direction of an impending failure. In emergency medicine, the EKG often isn't as informative as TV drama would suggest. A constantly improving area, however, has been measurement, especially successive measurements, of blood chemicals that indicate cardiac tissue is being damaged or destroyed. Early in the use of cardiac-related enzymes, it was a matter of considering several nonspecific factors in combination. SGOT, CPK and LDH are all enzymes that will elevate with tissue damage. The problem is that any one can be elevated by problems in different areas: liver and heart, heart and skeletal muscle, etc. You need to look for elevations in a couple of areas that are associated with the heart, AND look for normal values for other tests that rule out liver disease, etc. The biochemical techniques have constantly improved, but you still need to look at several factors. The second-phase analogy for networking could be more frequent polling and trending, or relatively benign tests such as traceroutes, etc. Only after there is a clear clinical problem, or several pieces of laboratory evidence, does a physician jump to more invasive tests, or begin aggressive treatment on suspicion. In like manner, you wouldn't do a processor-intensive trace on a router, or do a possibly disruptive switch to backup links, unless you had reasonable confidence that there was a problem.
[ On Wednesday, January 24, 2001 at 23:23:11 ( -0500), Howard C. Berkowitz wrote: ]
Subject: Re: Monitoring highly redundant operations
My first point is having what physicians call a "high index of suspicion" when seeing a combination of minor symptoms. I suspect that we need to be looking for patterns of network symptoms that are sensitive (i.e., high chance of being positive when there is a problem) but not necessarily selective (i.e., low probability of false positives).
Your analogy is very interesting because just like in this case with M$'s DNS, the root cause may very well not have been in failing to notice the symptoms or diagnose them correctly, but rather in allowing a situation to build such that these symptoms even occur in the first place. I don't wish to read more into your analogy and your personal life (in a public forum, no less!) than I have a right to do, so let's say "theoretically" if it were past events in your life that were under your direct personal control and which were known at the time to be almost guaranteed to bring on your condition, then presumably you could have avoided that condition by actively avoiding or counter-acting those past events. In the same way M$'s DNS would not likely have suffered any significant visible problems, even if their entire campus had been torn to ruin by a massive earthquake or whatever, if only they had deployed registered DNS servers in other locations around the world (and of course if they'd have been careful enough to use them fully for all relevant zones). The DNS was designed to be, and is at least in theory possible to be, one of the most reliable subsystems on the Internet. However it isn't that way by default -- every zone must be specifically engineered to be that way, and then of course the result needs to be managed properly too. Luckily the engineering and management is extremely simple and in most cases only requires periodic co-operation of autonomous entities to make it all fit together. No doubt M$'s zones get a larger than average number of queries, but still it's just basic engineering to build an enormously reliable DNS system to distribute those zones and answer those queries. If this were not true the root and TLD zones would have crumbled long ago (and stayed that way! :-).
Only after there is a clear clinical problem, or several pieces of laboratory evidence, does a physician jump to more invasive tests, or begin aggressive treatment on suspicion. In like manner, you wouldn't do a processor-intensive trace on a router, or do a possibly disruptive switch to backup links, unless you had reasonable confidence that there was a problem.
No, perhaps not, but surely in an organisation the size of M$ there should have been enough operational procedures in place to have identified the events shortly preceding the beginning of the incident (eg. the configuration change). Similarly of course there should have been procedures in place to roll back all such changes to see if the problem goes away. Obviously such operational recovery procedures are not always perfect, as history has shown, but in the case of something as simple as a set of authoritative nameservers is supposed to be, they should have been highly effective. Furthermore in this particular case there's no need for expensive or disruptive tests -- a company the size of M$ should have had (and perhaps do have, but don't know how to use effectively) proper test gear that can passively analyse the traffic at various points on their networks (including their connection(s) to the Internet) without having to actually use their routers or servers for diagnostic purposes. Finally in this particular case the outage was so long that there was ample time for them to have deployed new, network diverse, servers and added their IP#s to the TLD delegations for their zone and had them show up world-wide well before they'd fixed the actual problem! -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
participants (7)
-
Henry Yen
-
Howard C. Berkowitz
-
mdevney@teamsphere.com
-
poptix@sleepybox.poptix.net
-
Sean Donelan
-
Simon Lockhart
-
woods@weird.com