Sean Donelan <sean@donelan.com> observed,
But he does raise an interesting problem. How do you know if your highly redudant, diverse, etc system has a problem. With an ordinary system its easy. It stops working. In a highly redudant system you can start losing critical components, but not be able to tell if your operation is in fact seriously compromised, because it continues to "work."
I suspect answers here aren't going to be found in traditional engineering, but more in a discipline that deals with extremely complex systems where a full failure may be irretrievable. I'm thinking of clinical medicine. The initial problem there indeed may be subtle. I have a substantial amount of medical experience, but it easily was 2-3 hours before I recognized, in myself, early symptoms of a cardiac problem. It seemed so much like indigestion, and then a pulled muscle. I remember relaxing, and then recognizing a chain of minor events...sweating...mild but persistent left arm pain radiating into the chest...shortness of breath...and then a big OH SH*T. My first point is having what physicians call a "high index of suspicion" when seeing a combination of minor symptoms. I suspect that we need to be looking for patterns of network symptoms that are sensitive (i.e., high chance of being positive when there is a problem) but not necessarily selective (i.e., low probability of false positives). Once the index of suspicion is triggered, the next thing to look for is not necessarily direct indication of a problem, but a more selective surrogate marker: objective criteria, especially when analyzed as trends, point in the direction of an impending failure. In emergency medicine, the EKG often isn't as informative as TV drama would suggest. A constantly improving area, however, has been measurement, especially successive measurements, of blood chemicals that indicate cardiac tissue is being damaged or destroyed. Early in the use of cardiac-related enzymes, it was a matter of considering several nonspecific factors in combination. SGOT, CPK and LDH are all enzymes that will elevate with tissue damage. The problem is that any one can be elevated by problems in different areas: liver and heart, heart and skeletal muscle, etc. You need to look for elevations in a couple of areas that are associated with the heart, AND look for normal values for other tests that rule out liver disease, etc. The biochemical techniques have constantly improved, but you still need to look at several factors. The second-phase analogy for networking could be more frequent polling and trending, or relatively benign tests such as traceroutes, etc. Only after there is a clear clinical problem, or several pieces of laboratory evidence, does a physician jump to more invasive tests, or begin aggressive treatment on suspicion. In like manner, you wouldn't do a processor-intensive trace on a router, or do a possibly disruptive switch to backup links, unless you had reasonable confidence that there was a problem.