I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.
Good job by the AWS team however, I am sure your new procedures and processes will receive a shakeout again, and it will be interesting to see how that goes. I bet there will be more to learn along this road for us all.
Mike-
From my reading of what happened, it looks like they didn't have a single point of failure but ended up routing around their own redundancy.
They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network. Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable. There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. In this case it is my opinion that Amazon should not have considered their secondary network to be a true secondary if it was not capable of handling the traffic. A completely broken network might have been an easier failure mode to handle than a saturated network (high packet loss but the network is "there"). This looks like it was a procedural error and not an architectural problem. They seem to have had standby capability on the primary network and, from the way I read their statement, did not use it.