On Sun, May 01, 2011 at 12:50:37PM -0700, George Bonser wrote:
From my reading of what happened, it looks like they didn't have a single point of failure but ended up routing around their own redundancy.
They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network.
Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable.
There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. [ ... ]
This looks like it was a procedural error and not an architectural problem. They seem to have had standby capability on the primary network and, from the way I read their statement, did not use it.
The procedural error was putting all the traffic on the secondary network. They promptly recognized that error, and fixed it. It's certainly true that you can't eliminate human error. The architectural problem is that they had insufficient error recovery capability. Initially, the system was trying to use a network that was too small; that situation lasted for some number of minutes; it's no surprise that the system couldn't operate under those conditions and that isn't an indictment of the architecture. However, after they put it back on a network that wasn't too small, the service stayed down/degraded for many, many hours. That's an architectural problem. (And a very common one. Error recovery is hard and tedious and more often than not, not done well.) Prodecural error isn't the only way to get into that boat. If the wrong pair of redundant equipment in their primary network failed simultanesouly, they'd have likely found themselves in the same boat: a short outage caused by a risk they accepted: loss of a pair of rundundant hardware; followed by a long outage (after they restored the network) caused by insufficient recovery capability. Their writeup suggests they fully understand these issues and are doing the right thing by seeking to have better recovery capability. They spent one sentence saying they'll look at their procedures to reduce the risk of a similar procedural error in the future, and then spent paragraphs on what they are going to do to have better recovery should something like this occur in the future. (One additional comment, for whoever posted that NetFlix had a better architecture and wasn't impacted by this outage. It might well be that NetFlix does have a better archiecture and that might be why they weren't impacted ... but there's also the possibility that they just run in a different region. Lots of entities with poor architecture running on AWS survived this outage just fine, simply by not being in the region that had the problem.) -- Brett