Amazon diagnosis

Joly MacFie

29 Apr 2011 29 Apr '11

7:35 p.m.

*http://aws.amazon.com/message/65648/*<http://aws.amazon.com/message/65648/> ___ -- --------------------------------------------------------------- Joly MacFie 218 565 9365 Skype:punkcast WWWhatsup NYC - http://wwwhatsup.com http://pinstand.com - http://punkcast.com VP (Admin) - ISOC-NY - http://isoc-ny.org -------------------------------------------------------------- -

Show replies by date

Mike

1 May 1 May

6:07 p.m.

On 04/29/2011 12:35 PM, Joly MacFie wrote:

...

*http://aws.amazon.com/message/65648/*<http://aws.amazon.com/message/65648/>

___

So, in a nut shell, Amazon had a single point of failure which touched off this entire incident. I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one. Good job by the AWS team however, I am sure your new procedures and processes will receive a shakeout again, and it will be interesting to see how that goes. I bet there will be more to learn along this road for us all. Mike-

Jay Ashworth

6:10 p.m.

----- Original Message -----

...

From: "Mike" <mike-nanog@tiedyenetworks.com>

...

On 04/29/2011 12:35 PM, Joly MacFie wrote:

...
*http://aws.amazon.com/message/65648/*<http://aws.amazon.com/message/65648/>

So, in a nut shell, Amazon had a single point of failure which touched off this entire incident.

I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.

Well, in fairness to Amazon, let's ask this: did the failure occur *behind a component interface they advertise as Reliable*? Either way, was it possible for a single customer to avoid that possible failure, and at what cost in expansion of scope and money? Cheers, -- jra

Andrew Kirch

6:18 p.m.

On 5/1/2011 2:07 PM, Mike wrote:

...

I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.

Sure they can, but as a thought exercise fully 2n redundancy is difficult on a small scale for anything web facing. I've seen a very simple implementation for a website requiring 5 9's that consumed over $50k in equipment, and this wasn't even geographically diverse. I have to believe that scaling up the concept of "doing it right" results in exponential cost increases. To illustrate the problem, I would give you the first step in the thought exercise: first find two datacenters with diverse carriers, that aren't on the same regional power grid (As we've learned in the (iirc) 2003 power outage, New York and DC won't work, nor will Ohio, so you need redundant teams to cover a very remote site).

Jeff Wheeler

7:29 p.m.

On Sun, May 1, 2011 at 2:18 PM, Andrew Kirch <trelane@trelane.net> wrote:

...

Sure they can, but as a thought exercise fully 2n redundancy is difficult on a small scale for anything web facing. I've seen a very simple implementation for a website requiring 5 9's that consumed over $50k in equipment, and this wasn't even geographically diverse. I have

What it really boils down to is this: if application developers are doing their jobs, a given service can be easy and inexpensive to distribute to unrelated systems/networks without a huge infrastructure expense. If the developers are not, you end up spending a lot of money on infrastructure to make up for code, databases, and APIs which were not designed with this in mind. These same developers who do not design and implement services with diversity and redundancy in mind will fare little better with AWS than any other platform. Look at Reddit, for example. This is an application/service which is utterly trivial to implement in a cheap, distributed manner, yet they have failed to do so for years, and suffer repeated, long-duration outages as a result. They probably buy a lot more AWS services than would otherwise be needed, and truly have a more complex infrastructure than such a simple service should. IT managers would do well to understand that a few smart programmers, who understand how all their tools (web servers, databases, filesystems, load-balancers, etc.) actually work, can often do more to keep infrastructure cost under control, and improve the reliability of services, than any other investment in IT resources. -- Jeff S Wheeler <jsw@inconcepts.biz> Sr Network Operator / Innovative Network Concepts

Paul Graydon

8:03 p.m.

...

On Sun, May 1, 2011 at 2:18 PM, Andrew Kirch<trelane@trelane.net> wrote:

...
Sure they can, but as a thought exercise fully 2n redundancy is difficult on a small scale for anything web facing. I've seen a very simple implementation for a website requiring 5 9's that consumed over $50k in equipment, and this wasn't even geographically diverse. I have What it really boils down to is this: if application developers are doing their jobs, a given service can be easy and inexpensive to distribute to unrelated systems/networks without a huge infrastructure expense. If the developers are not, you end up spending a lot of money on infrastructure to make up for code, databases, and APIs which were not designed with this in mind.

These same developers who do not design and implement services with diversity and redundancy in mind will fare little better with AWS than any other platform. Look at Reddit, for example. This is an application/service which is utterly trivial to implement in a cheap, distributed manner, yet they have failed to do so for years, and suffer repeated, long-duration outages as a result. They probably buy a lot more AWS services than would otherwise be needed, and truly have a more complex infrastructure than such a simple service should.

IT managers would do well to understand that a few smart programmers, who understand how all their tools (web servers, databases, filesystems, load-balancers, etc.) actually work, can often do more to keep infrastructure cost under control, and improve the reliability of services, than any other investment in IT resources. If you want a perfect example of this, consider Netflix. Their infrastructure runs on AWS and we didn't see any downtime with them

On 5/1/2011 9:29 AM, Jeff Wheeler wrote: throughout the entire affair. One of the interesting things they've done to try and enforce reliability of services is an in house service called Chaos Monkey who's sole purpose is to randomly kill instances and services inside the infrastructure. Courtesy of Chaos Monkey and the defensive programming it enforces, nothing is dependent on each other, you will always get at least some form of a service. For example if the recommendation engine dies, then the application is smart enough to catch that and instead return a list of the most popular movies, and so on. There is an interesting blog from their Director of Engineering about what they learned on their migration to AWS, including using less chatty APIs to reduce the impact of typical AWS latency: http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html Paul

Jeroen van Aart

2 May 2 May

7:27 p.m.

Jeff Wheeler wrote:

...

IT managers would do well to understand that a few smart programmers, who understand how all their tools (web servers, databases, filesystems, load-balancers, etc.) actually work, can often do more to

I fully agree. But much to my dismay and surprise I have learned that developers know very little above and beyond their field of interest, say java programming. And I bet this is vice versa. It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge because in general they're interested in many aspects of IT, try to find out as much as possible and if they do not know something they make an effort learning it. Also considering many (practical) things just aren't taught in university, which is to be expected since the idea is to develop an academic way of thinking. Maybe this "hacker" mentality is less prevalent than I, naively, assumed. So I believe it's just really hard to find someone who is smart and who understands all or most of the aspects of IT, i.e. servers, databases, file systems, load balancers, networks etc. And it's easier and cheaper in the short term to just open a can of <insert random IT job> and hope for the best. Regards, Jeroen -- http://goldmark.org/jeff/stupid-disclaimers/ http://linuxmafia.com/~rick/faq/plural-of-virus.html

Valdis.Kletnieks＠vt.edu

7:46 p.m.

On Mon, 02 May 2011 12:27:34 PDT, Jeroen van Aart said:

...

It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge

No, the average IT worker is always a mere 3 keystrokes away from getting their latest creation listed on www.thedailywtf.com. They're lucky they can manage to get stuff done in their own area of competency, much less develop broad knowledge. Sorry to break it to you.

Jeroen van Aart

9:04 p.m.

Valdis.Kletnieks@vt.edu wrote:

...

On Mon, 02 May 2011 12:27:34 PDT, Jeroen van Aart said:

...
It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge

...

Sorry to break it to you.

That's ok, the past tense in my story testifies to the fact I was already aware of it. But thanks. ;-) Greetings, Jeroen -- http://goldmark.org/jeff/stupid-disclaimers/ http://linuxmafia.com/~rick/faq/plural-of-virus.html

George Herbert

9:11 p.m.

On Mon, May 2, 2011 at 2:04 PM, Jeroen van Aart <jeroen@mompl.net> wrote:

...

Valdis.Kletnieks@vt.edu wrote:

...
On Mon, 02 May 2011 12:27:34 PDT, Jeroen van Aart said:

...
It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge

...
Sorry to break it to you.

That's ok, the past tense in my story testifies to the fact I was already aware of it. But thanks. ;-)

There was a significant decline in knowledge as the .com era peaked in the 90s; less CS background required as an entry barrier, the employment pool grew fast enough that community knowledge organizations (Usenix, etc) didn't effectively diffuse into the new community, etc. The number of people who "get" computer architecture, ops, clusters, networking, systems architecture and engineering, etc... Not good. Sigh. -- -george william herbert george.herbert@gmail.com

Jason Baugher

3 May 3 May

2:41 p.m.

On 5/2/2011 4:11 PM, George Herbert wrote:

...

On Mon, May 2, 2011 at 2:04 PM, Jeroen van Aart<jeroen@mompl.net> wrote:

...
Valdis.Kletnieks@vt.edu wrote:

...
On Mon, 02 May 2011 12:27:34 PDT, Jeroen van Aart said:

...
It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge Sorry to break it to you. That's ok, the past tense in my story testifies to the fact I was already aware of it. But thanks. ;-)

There was a significant decline in knowledge as the .com era peaked in the 90s; less CS background required as an entry barrier, the employment pool grew fast enough that community knowledge organizations (Usenix, etc) didn't effectively diffuse into the new community, etc.

The number of people who "get" computer architecture, ops, clusters, networking, systems architecture and engineering, etc... Not good.

Sigh.

Unfortunately we see this when we interview candidates. Even those who have certifications generally only know how to do specific things within a narrow field. They don't have the base understanding of how things work, such as TCP/IP, so when they need to do something a little outside of the normal, they flounder. Jason

Phil Pierotti

10:20 p.m.

Unlike the US of A, here in Australia the industry has gone *very* heavily down the path of requiring/expecting certification. They have bought into the faith that unless your resume includes CC?? you're worthless. There are "colleges" (er, I mean training businesses) who will *guarantee* you will pass your exam at the end of the course. Amazingly enough for some of them you never actually touch a router console (not even a virtual one) through the entire course. Unfortunately the end result has been an entire generation of potential employees who are perfectly capable of passing an exam and thereby becoming 'certified', but cannot be trusted to touch a production network. They have no understanding of what they're doing, why things work (or not) the way they do, no real troubleshooting skills and certainly not an ounce of real world production-network common sense. Not *all* of them, but by far the vast overwhelming majority of candidates have the "I'm certified so gimme a job and pay me big bucks" attitude despite having *no real skills* worth mentioning. PhilP On Wed, May 4, 2011 at 12:41 AM, Jason Baugher <jason@thebaughers.com>wrote:

...

On 5/2/2011 4:11 PM, George Herbert wrote:

...
On Mon, May 2, 2011 at 2:04 PM, Jeroen van Aart<jeroen@mompl.net> wrote:

...
Valdis.Kletnieks@vt.edu wrote:

...
On Mon, 02 May 2011 12:27:34 PDT, Jeroen van Aart said:

It surprised me because I, perhaps naively, assumed IT workers in

...
general have a rather broad knowledge

Sorry to break it to you.

That's ok, the past tense in my story testifies to the fact I was already aware of it. But thanks. ;-)

There was a significant decline in knowledge as the .com era peaked in the 90s; less CS background required as an entry barrier, the employment pool grew fast enough that community knowledge organizations (Usenix, etc) didn't effectively diffuse into the new community, etc.

The number of people who "get" computer architecture, ops, clusters, networking, systems architecture and engineering, etc... Not good.

Sigh.

Unfortunately we see this when we interview candidates. Even those who have certifications generally only know how to do specific things within a narrow field. They don't have the base understanding of how things work, such as TCP/IP, so when they need to do something a little outside of the normal, they flounder.

Jason

Paul Graydon

2 May 2 May

8:05 p.m.

On 05/02/2011 09:27 AM, Jeroen van Aart wrote:

...

Jeff Wheeler wrote:

...
IT managers would do well to understand that a few smart programmers, who understand how all their tools (web servers, databases, filesystems, load-balancers, etc.) actually work, can often do more to

I fully agree.

But much to my dismay and surprise I have learned that developers know very little above and beyond their field of interest, say java programming. And I bet this is vice versa.

It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge because in general they're interested in many aspects of IT, try to find out as much as possible and if they do not know something they make an effort learning it. Also considering many (practical) things just aren't taught in university, which is to be expected since the idea is to develop an academic way of thinking.

I work with a bunch of developers, we're a primarily java based company, but I've got more than enough on my plate trying to keep up with everything practical as a sysadmin, from networks to hardware to audit needs, to even start to think about adding in Java skills to my repertoire! Especially given I'm the only sysadmin here and our infrastructure needs are quite diverse. I've learned to interpret java stack traces that get sent to me 24x7 on our critical mailing list so that I can identify whether is code or infrastructure but that's as far as I go with java. I don't particularly see that I need to either. I strive to work with//developers, no 'them vs us' attitudes, no arrogant "my way or the highway". I can't conceive why anyone would even consider maintaining those kind of attitudes but unfortunately have seen them frequently, and it seems so often to be the normal rather than the abnormal. Programming is not something I'd consider myself to be any good at. I'll happily and reasonably competently script stuff in perl, python or bash for sysadmin purposes, but I'd never make any pretence at it being 'good' and well done scripting. It's just not the way my mind works. I have my specialisms and they have theirs, more productive use of time is to work with those who excel at that kind of thing. Here they don't make assumptions about my end of things, and I don't make assumptions about theirs. We ask each other questions, and work together to figure out how best to proceed. Thankfully we're a relatively small enough operation that management isn't too much of a burden. Smart IT managers, in my book, work to take advantage of all the skills that their workers have and provide an efficient framework for them to work together. What it seems we see more often than not are IT managers that persist in seeing Sysadmin and Development as 'ops' and 'dev' separately rather than combined, perpetuating the 'them' vs 'us' attitudes rather than throwing them out for the inefficient, financially wasteful things they are. Paul

Ryan Malayter

5 May 5 May

5:45 p.m.

On May 1, 2:29 pm, Jeff Wheeler <j...@inconcepts.biz> wrote:

...

What it really boils down to is this: if application developers are doing their jobs, a given service can be easy and inexpensive to distribute to unrelated systems/networks without a huge infrastructure expense. If the developers are not, you end up spending a lot of money on infrastructure to make up for code, databases, and APIs which were not designed with this in mind.

Umm... see the CAP theorem. There are certain things, such as ACID transactions, which are *impossible* to geographically distribute with redundancy in a performant and scalable way because of speed of light constraints. Of course web-startups like Reddit have no excuse in this area: they don't even *need* ACID transactions for anything they do, as what they are storing is utterly unimportant in the financial sense and can be handled with eventually-consistent semantics. But asynchronous replication doesn't cut it for something like stock trades, or even B2C order taking. I like to bag on my developers for not knowing anything about the infrastructure, but sometimes you just can't do it right because of physics. Or you can't do it right without writing your own OS, networking stacks, file systems, etc., which means it is essentially "impossible" in the real world.

George Herbert

5:52 p.m.

On Thu, May 5, 2011 at 10:45 AM, Ryan Malayter <malayter@gmail.com> wrote:

...

On May 1, 2:29 pm, Jeff Wheeler <j...@inconcepts.biz> wrote:

...
What it really boils down to is this: if application developers are doing their jobs, a given service can be easy and inexpensive to distribute to unrelated systems/networks without a huge infrastructure expense. If the developers are not, you end up spending a lot of money on infrastructure to make up for code, databases, and APIs which were not designed with this in mind.

Umm... see the CAP theorem. There are certain things, such as ACID transactions, which are *impossible* to geographically distribute with redundancy in a performant and scalable way because of speed of light constraints.

That specific example depends on how order-dependent your consistency constraint is; you can have time-asynchronous locally ACID changes across databases which are widely separate. If your consistency requires order synchronicity across the geographic DB cluster then this is a potential epic fail, of course. The general point is valid. Being able to tell if your application *really* does require strict consistency or not, and if it requires strict ordering or not if it requires strict consistency, is unfortunately beyond most line-level system designers. A lot of people guess wrong in both directions, and either cripple the app's performance unnecessarily or end up with dangerous failure modes inherent in the architecture. -- -george william herbert george.herbert@gmail.com

Jay Ashworth

8:51 p.m.

----- Original Message -----

...

From: "Ryan Malayter" <malayter@gmail.com>

...

I like to bag on my developers for not knowing anything about the infrastructure, but sometimes you just can't do it right because of physics. Or you can't do it right without writing your own OS, networking stacks, file systems, etc., which means it is essentially "impossible" in the real world.

"Physics"? Isn't that an entirely inadequate substitute for "desire"? Cheers, -- jra

Ryan Malayter

6 May 6 May

2:35 p.m.

On May 5, 3:51 pm, Jay Ashworth <j...@baylink.com> wrote:

...

----- Original Message -----

...
From: "Ryan Malayter" <malay...@gmail.com> I like to bag on my developers for not knowing anything about the infrastructure, but sometimes you just can't do it right because of physics. Or you can't do it right without writing your own OS, networking stacks, file systems, etc., which means it is essentially "impossible" in the real world.

"Physics"?

Isn't that an entirely inadequate substitute for "desire"?

Not really. For some applications, it is physics: 1) You need two or more locations separated by say 500km for disaster protection (think Katrina, or Japan Tsunami). 2) Those two locations need to be 100% consistent, with in-order "serializable" ACID semantics for a particular database entity. An example would be some sort of financial account - the order of transactions against that account must be such that an account cannot go below a certain value, and debits to and from different accounts must always happen together or not at all. The above implies a two-phase commit protocol. This, in turn, implies *at least* two network round-trips. Given a perfect dedicated fiber network and no switch/router/CPU/disk latency, this means at least 10.8 ms per transaction, or at most 92 transactions per second per affected database entity. The reality of real networks, disks, databases, and servers makes this perfect scenario unachievable - often by an order of magnitude. I don't have inside knowledge, but I suspect this is why Wall Street firms have DR sites across the river in New Jersey, rather than somewhere "safer". Amazon's EBS service is network-based block storage, with semantics similar to the financial account scenario: data writes to the volume must happen in-order at all replicas. Which is why EBS volumes cannot have a replica a great distance away from the primary. So any application which used the EBS abstraction for keeping consistent state were screwed during this Amazon outage. The fact that Amazon's availability zones were not, in fact, very isolated from each other for this particular failure scenario compounded the problem.

Jay Ashworth

2:42 p.m.

----- Original Message -----

...

From: "Ryan Malayter" <malayter@gmail.com>

...

On May 5, 3:51 pm, Jay Ashworth <j...@baylink.com> wrote:

...
----- Original Message -----

...
From: "Ryan Malayter" <malay...@gmail.com> I like to bag on my developers for not knowing anything about the infrastructure, but sometimes you just can't do it right because of physics. Or you can't do it right without writing your own OS, networking stacks, file systems, etc., which means it is essentially "impossible" in the real world.

"Physics"?

Isn't that an entirely inadequate substitute for "desire"?

Not really. For some applications, it is physics:

You misinterpreted me. I was making fun of people who think "I want it, and therefore it WILL be so" trumps physics, of whom there are altogether too many in positions of power these days to suit me.

...

I don't have inside knowledge, but I suspect this is why Wall Street firms have DR sites across the river in New Jersey, rather than somewhere "safer".

You don't need inside knowledge; that issue's the subject of much general press lately; that's exactly why they do it. And they think it's good enough. I truly wish that their finding out it's not wouldn't be so massively disrupting for the rest of us poor slobs...

...

Amazon's EBS service is network-based block storage, with semantics similar to the financial account scenario: data writes to the volume must happen in-order at all replicas. Which is why EBS volumes cannot have a replica a great distance away from the primary. So any application which used the EBS abstraction for keeping consistent state were screwed during this Amazon outage. The fact that Amazon's availability zones were not, in fact, very isolated from each other for this particular failure scenario compounded the problem.

Oh, so maybe "letting someone else do the cloud for you"'s a bad idea? Whod'a thunk *that*? :-) Cheers, -- jra

Kenneth M. Chipps Ph.D.

7:43 p.m.

I am preparing a graduate level course for network managers. As part of this course I would like to use a series of case studies looking at problems such as described in the report from Amazon. If anyone has something similar or knows where I could find such things, I would appreciate a copy or a link to it. Kenneth M. Chipps Ph.D.

George Bonser

1 May 1 May

7:50 p.m.

...

I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.

Good job by the AWS team however, I am sure your new procedures and processes will receive a shakeout again, and it will be interesting to see how that goes. I bet there will be more to learn along this road for us all.

Mike-

...

From my reading of what happened, it looks like they didn't have a single point of failure but ended up routing around their own redundancy.

They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network. Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable. There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. In this case it is my opinion that Amazon should not have considered their secondary network to be a true secondary if it was not capable of handling the traffic. A completely broken network might have been an easier failure mode to handle than a saturated network (high packet loss but the network is "there"). This looks like it was a procedural error and not an architectural problem. They seem to have had standby capability on the primary network and, from the way I read their statement, did not use it.

Brett Frankenberger

8:32 p.m.

On Sun, May 01, 2011 at 12:50:37PM -0700, George Bonser wrote:

...

From my reading of what happened, it looks like they didn't have a single point of failure but ended up routing around their own redundancy.

They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network.

Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable.

There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. [ ... ]

This looks like it was a procedural error and not an architectural problem. They seem to have had standby capability on the primary network and, from the way I read their statement, did not use it.

The procedural error was putting all the traffic on the secondary network. They promptly recognized that error, and fixed it. It's certainly true that you can't eliminate human error. The architectural problem is that they had insufficient error recovery capability. Initially, the system was trying to use a network that was too small; that situation lasted for some number of minutes; it's no surprise that the system couldn't operate under those conditions and that isn't an indictment of the architecture. However, after they put it back on a network that wasn't too small, the service stayed down/degraded for many, many hours. That's an architectural problem. (And a very common one. Error recovery is hard and tedious and more often than not, not done well.) Prodecural error isn't the only way to get into that boat. If the wrong pair of redundant equipment in their primary network failed simultanesouly, they'd have likely found themselves in the same boat: a short outage caused by a risk they accepted: loss of a pair of rundundant hardware; followed by a long outage (after they restored the network) caused by insufficient recovery capability. Their writeup suggests they fully understand these issues and are doing the right thing by seeking to have better recovery capability. They spent one sentence saying they'll look at their procedures to reduce the risk of a similar procedural error in the future, and then spent paragraphs on what they are going to do to have better recovery should something like this occur in the future. (One additional comment, for whoever posted that NetFlix had a better architecture and wasn't impacted by this outage. It might well be that NetFlix does have a better archiecture and that might be why they weren't impacted ... but there's also the possibility that they just run in a different region. Lots of entities with poor architecture running on AWS survived this outage just fine, simply by not being in the region that had the problem.) -- Brett

Robert Bonomi

9:35 p.m.

...

Subject: RE: Amazon diagnosis Date: Sun, 1 May 2011 12:50:37 -0700 From: George Bonser <gbonser@seven.com>

They apparently had a redundant primary network and, on top of that, a secondary network. The secondary network, however, did not have the capacity of the primary network.

Rather than failing over from the active portion of the primary network to the standby portion of the primary network, they inadvertently failed the entire primary network to the secondary. This resulted in the secondary network reaching saturation and becoming unusable.

There isn't anything that can be done to mitigate against human error. You can TRY, but as history shows us, it all boils down the human that implements the procedure. All the redundancy in the world will not do you an iota of good if someone explicitly does the wrong thing. ...

This looks like it was a procedural error and not an architectural problem.

A sage sayeth sooth: "For any 'fool-proof' system, there exists a *sufficiently*determied* fool capable of breaking it." It would seem that the validity of that has just been re-confirmed. <wry grin> It is worthy of note that it is considerably harder to protect against accidental stupidity than it is to protect againt intentional malice. ('malice' is _much_ more predictable, in general. <wry grin>)

Robert Bonomi

9:22 p.m.

...

Date: Sun, 01 May 2011 11:07:56 -0700 From: Mike <mike-nanog@tiedyenetworks.com> To: nanog@nanog.org Subject: Re: Amazon diagnosis

On 04/29/2011 12:35 PM, Joly MacFie wrote:

...
http://aws.amazon.com/message/65648/

___

So, in a nut shell, Amazon had a single point of failure which touched off this entire incident.

I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.

this was a classical case of _O'Brien's_Law_ in action -- which states, rather pithily: "Murphy... was an OPTIMIST!!"

Valdis.Kletnieks＠vt.edu

9:28 p.m.

On Sun, 01 May 2011 11:07:56 PDT, Mike said:

...

I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.

For starters, you almost always screw up and have one NOC full of chuckle-headed banana eaters. And if you have two NOCs, that implies one entity deciding which one takes lead on a problem. ;)

Stefan

10:59 p.m.

On Fri, Apr 29, 2011 at 2:35 PM, Joly MacFie <joly@punkcast.com> wrote:

...

*http://aws.amazon.com/message/65648/*<http://aws.amazon.com/message/65648/>

___ -- --------------------------------------------------------------- Joly MacFie 218 565 9365 Skype:punkcast WWWhatsup NYC - http://wwwhatsup.com http://pinstand.com - http://punkcast.com VP (Admin) - ISOC-NY - http://isoc-ny.org -------------------------------------------------------------- -

http://storagemojo.com/2011/04/29/amazons-ebs-outage/ ***Stefan Mititelu http://twitter.com/netfortius http://www.linkedin.com/in/netfortius

5297

Age (days ago)

5304

Last active (days ago)

List overview

Download

24 comments

17 participants

participants (17)

Andrew Kirch
Brett Frankenberger
George Bonser
George Herbert
Jason Baugher
Jay Ashworth
Jeff Wheeler
Jeroen van Aart
Joly MacFie
Kenneth M. Chipps Ph.D.
Mike
Paul Graydon
Phil Pierotti
Robert Bonomi
Ryan Malayter
Stefan
Valdis.Kletnieks＠vt.edu