[ On , January 24, 2001 at 14:31:20 ( -0800), Sean Donelan wrote: ]
Subject: Monitoring highly redundant operations
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
The real problem is that the most critical part of the puzzle has _not_ been made "highly redundant" in this case. If at least one of your registered authoritative DNS servers are not responding from the point of view of any _and_ every user on the Internet, your hosts (MX records, etc.) don't exist for those people and their e-mail to you may well bounce and they will not view your web pages. The only way to ensure that your DNS is highly redundant and working is to ensure that you've got maximum possible dispersion of _registered_ authoritative servers throughout the network geography, just like the root and TLD servers are widely distributed. Note this is just as important (if not more so!) for any delegated sub-domains in your zone too, and equally important for any related zones (eg. passport.com in this case). The only really effective way to measure the effectiveness of your nameserver dispersion is to make it terribly easy for anyone anywhere to report any problems they percieve to you via as many optional channels as possible -- you can't be everywhere at once, but if you make it easy for people to send you information out-of-band then you'll get lots of early warning when various chunks of the Internet can't see your nameservers and/or your other hosts. Now if the majority of DNS cache server operators don't get too paranoid you could try to set up a mesh of equally widely dispersed monitoring systems that cross-check the availability of test records from your zone by querying any number of regional and remote cache servers. You'd make the TTL of these test records the minimum recommended by major nameserver software vendors (300 seconds?) and then query the whole group every TTL+N seconds. Obviously you're probably going to have to report your results out-of-band, and/or have independent people at each monitoring site who are responsible for investigating problems immediately and doing what they can locally to resolve as them. -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>