On Wed, 24 Jan 2001, Simon Lockhart wrote:
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
Indeed. We currently monitor each part of our operation from a monitoring station on our network. Under certain conditions, this can give us both false positives and false negatives:
- We've lost off-site routing. Our monitoring station can see all our nodes okay, so it thinks everything is fine, but no-one else can see them.
With our monitoring software we also check a few off-site links (our interfaces on our uplinks routers and the router after that) it tends to work well.
- We've lost routing to just the part of our network with the monitoring station on. It reports that everything is down, when in fact stuff is working fine for serving the rest of the internet.
For that situation the software we use allows us to set dependencies, ie, servers A B & C depend on router Z, if router Z is down, assume server A B & C are unreachable/down (but dont start spewing out alerts about it) Unfortunately the software is MS based (Enterprise Monitor, now named IP monitor iirc) I first came across it while working at Xerox, it resides on the only MS box on our network (beyond customer machines, and yes, it's kinda of an oxymoron, a windows monitoring box).
One way we plan to overcome these issues is to locate monitoring stations on other ISPs networks at random places on the internet. If you correlate the results from these multiple monitoring stations, then you get a better view of what the rest of the internet is seeing.
A kind of distributed monitoring system would be nice, or just having people who agree to give you access to add your systems to their monitoring systems (easily done with some software, not so easily with others) I also do this to a small extent. Matthew S. Hallacy XtraTyme Technologies
Simon -- Simon Lockhart | Tel: +44 (0)1737 839676 Internet Engineering Manager | Fax: +44 (0)1737 839516 BBC Internet Services | Email: Simon.Lockhart@bbc.co.uk Kingswood Warren,Tadworth,Surrey,UK | URL: http://support.bbc.co.uk/