
-----Original Message----- From: Leo Bicknell [mailto:bicknell@ufp.org]
I want to emphasize _and test_.
[snip]
I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would
do
the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker.
*DING DING* - we have a winner! In a previous life, I used to spend a lot of time in other people's data centers. The key question to ask was how often they pulled the plug - i.e. disconnected utility power without having backup generators running. Simulating an actual failure. That goes for pulling out an Ethernet cord or unplugging a server, or flipping a breaker. Its all the same. The problem is that if you don't do this for a while, you get SCARED of doing it, and you stop doing it. The longer you go without, the scarier it gets, to the point where you will never do it, because you have no idea what will happen, other that you probably getting fired. This is called "horrible engineering management", and is very common. The other problem, of course, is that people design under the assumption that everything will always work, and that failure modes, when they occur, are predictable and fall into a narrow set. Multiple failure modes? Not tested. Failure modes including operator error? Never tested. When was the last time you had a drill? - Dan
Then he would wait, to see how long before a technician came to fix it.
If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage.
I've seen too many companies who's "test" is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers.
TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant.
-- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/