Re: FYI Netflix is down

30 Jun 2012

      The last 2 Amazon outages were power issues isolated to just there us-east
Virginia data center. I read somewhere that Amazon has something like 70%
of their ec2 resources in Virginia and its also their oldest ec2
datacenter..so I am guessing they learned a lot of lessons and are stuck
with an aged infrastructure there.

I think the real problem here is that a large subset of the customers using
ec2 misunderstand the redundancy that is built into the Amazon
architecture. You are essentially supposed to view individual virtual
machines as bring entirely disposable and make duplicates of everything
across availability zones and for extra points across regions.

most people instead think that the 2 cents/hour price tag is a massive cost
savings and the cloud is invincible..look at the SLA for ec2...Amazon
basically doesn't really consider it a real outage unless its more than one
availability zone that is down

whats more surprising is that netflix was so affected by a single
availability zone outage. They are constantly talking about their chaos
monkey/simian army tool that purposely breaks random parts of their
infrastructure to prove its fault tolerate, or to point out weaknesses to
fix. (
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html)

I think the closest thing to a cascading failure they have had was 4/29/11
outage (http://aws.amazon.com/message/65648/)

Mike

On Jun 30, 2012 3:05 PM, "Todd Underwood" <toddunder@gmail.com> wrote:
...
This was not a cascading failure.  It was a simple power outage
Cascading failures involve interdependencies among components.
T
On Jun 30, 2012 2:21 PM, "Seth Mattinen" <sethm@rollernet.us> wrote:
...
On 6/30/12 9:25 AM, Todd Underwood wrote:
...
On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us
<mailto:sethm@rollernet.us>> wrote:
...
But haven't they all been cascading failures?
No.  They have not.  That's not what that term means.
'Cascading failure' has a fairly specific meaning that doesn't imply
resilience in the face of decomposition into smaller parts.  Cascading
failures can occur even when a system is decomposed into small parts,
each of which is apparently well run.
I honestly have no idea how to parse that since it doesn't jive with my
practical view of a cascading failure.
~Seth

Re: FYI Netflix is down

Mike Devlin