
The last 2 Amazon outages were power issues isolated to just there us-east Virginia data center. I read somewhere that Amazon has something like 70% of their ec2 resources in Virginia and its also their oldest ec2 datacenter..so I am guessing they learned a lot of lessons and are stuck with an aged infrastructure there. I think the real problem here is that a large subset of the customers using ec2 misunderstand the redundancy that is built into the Amazon architecture. You are essentially supposed to view individual virtual machines as bring entirely disposable and make duplicates of everything across availability zones and for extra points across regions. most people instead think that the 2 cents/hour price tag is a massive cost savings and the cloud is invincible..look at the SLA for ec2...Amazon basically doesn't really consider it a real outage unless its more than one availability zone that is down whats more surprising is that netflix was so affected by a single availability zone outage. They are constantly talking about their chaos monkey/simian army tool that purposely breaks random parts of their infrastructure to prove its fault tolerate, or to point out weaknesses to fix. ( http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html) I think the closest thing to a cascading failure they have had was 4/29/11 outage (http://aws.amazon.com/message/65648/) Mike On Jun 30, 2012 3:05 PM, "Todd Underwood" <toddunder@gmail.com> wrote:
This was not a cascading failure. It was a simple power outage
Cascading failures involve interdependencies among components.
T On Jun 30, 2012 2:21 PM, "Seth Mattinen" <sethm@rollernet.us> wrote:
On 6/30/12 9:25 AM, Todd Underwood wrote:
On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us <mailto:sethm@rollernet.us>> wrote:
But haven't they all been cascading failures?
No. They have not. That's not what that term means.
'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run.
I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure.
~Seth