On Tue, Aug 25, 2009 at 7:53 AM, Jeff Aitken<jaitken@aitken.com> wrote:
[..] Periodically inducing failures to catch [...] them is sorta like using your smoke detector as an oven timer. [..] machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails.
Config checking can't say much about silent hardware failures. Unanticipated problems are likely to arise in failover systems, especially complicated ones. A failover system that has not been periodically verified may not work as designed. Simulations, config review, and change controls are not substitutes for testing, they address overlapping but different problems. Testing detects unanticipated error; config review is a preventive measure that helps avoid and correct apparent configuration issues. Config checking (both software and hardware choices) also help to keep out unnecessary complexity. A human still has to write the script and review its output -- an operator error would eventually occur that is an accidental omission from both the current state and from the "desired" state; there is a chance that an erroneous entry escapes detection. There can be other types of errors: Possibly there is a damaged patch cable, dying port, failing power supply, or other hardware on the warm spare that has silently degraded and its poor condition won't be detected (until it actually tries to take a heavy workload, blows a fuse, eats a transceiver, and everything just falls apart). Perhaps you upgraded a hardware module or software image X months ago, to fix bug Y on the secondary unit, and the upgrade caused completely unanticipated side effect Z. Config checking can't say much about silent hardware problems. -- -Mysid