FYI Netflix is down

Joe Blanchard

30 Jun 2012 30 Jun '12

3:22 a.m.

Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details. -Joe

Show replies by date

Jason Baugher

30 Jun 30 Jun

3:40 a.m.

Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated. On 6/29/2012 10:22 PM, Joe Blanchard wrote:

...

Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

-Joe

Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. -Grant On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com>wrote:

...

Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

...
Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

-Joe

James Laszko

3:42 a.m.

To further expand: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone. 8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power. -----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

...

From Amazon

...

Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

...
Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

-Joe

Grant Ridder

3:44 a.m.

I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down. On Fri, Jun 29, 2012 at 10:42 PM, James Laszko <jamesl@mythostech.com>wrote:

...

To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

From Amazon

Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/ ) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com

...
wrote:

...
Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

...
Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

-Joe

Ian Wilson

3:47 a.m.

On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder <shortdudey123@gmail.com> wrote:

...

I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down.

It is my understanding that instance zones are randomized between customers -- so your zone C may be my zone A. Ian -- Ian Wilson ian.m.wilson@gmail.com Solving site load issues with database replication is a lot like solving your own personal problems with heroin -- at first, it sorta works, but after a while things just get out of hand.

Grant Ridder

3:49 a.m.

Yes, although, when you launch an instance, you do have the option of selecting a zone if you want. However, once the instance is started it stays in that zone and does not switch. On Fri, Jun 29, 2012 at 10:47 PM, Ian Wilson <ian.m.wilson@gmail.com> wrote:

...

On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder <shortdudey123@gmail.com> wrote:

...
I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down.

It is my understanding that instance zones are randomized between customers -- so your zone C may be my zone A.

Ian -- Ian Wilson ian.m.wilson@gmail.com

Solving site load issues with database replication is a lot like solving your own personal problems with heroin -- at first, it sorta works, but after a while things just get out of hand.

Rayson Ho

8:08 p.m.

If I recall correctly, availability zone (AZ) mappings are specific to an AWS account, and in fact there is no way to know if you are running in the same AZ as another AWS account: http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_am_in_the_same_Av... Also, AWS Elastic Load Balancer (and/or CloudWatch) should be able to detect that some instances are not reachable, and thus can start new instances and remap DNS entries automatically: http://aws.amazon.com/elasticloadbalancing/ This time only 1 AZ is affected by the power outage, so sites with fault tolerance built into their AWS infrastructure should be able to handle the issues relatively easily. Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ On Fri, Jun 29, 2012 at 11:44 PM, Grant Ridder <shortdudey123@gmail.com> wrote:

...

I have an instance in zone C and it is up and fine, so it must be A, B, or D that is down.

On Fri, Jun 29, 2012 at 10:42 PM, James Laszko <jamesl@mythostech.com>wrote:

...
To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

From Amazon

Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/ ) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com

...
wrote:

...
Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

...
Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

-Joe

Bryan Horstmann-Allen

8:45 p.m.

+------------------------------------------------------------------------------ | On 2012-06-30 16:08:40, Rayson Ho wrote: | | If I recall correctly, availability zone (AZ) mappings are specific to | an AWS account, and in fact there is no way to know if you are running | in the same AZ as another AWS account: | | http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_am_in_the_same_Av... | | Also, AWS Elastic Load Balancer (and/or CloudWatch) should be able to | detect that some instances are not reachable, and thus can start new | instances and remap DNS entries automatically: | http://aws.amazon.com/elasticloadbalancing/ | | This time only 1 AZ is affected by the power outage, so sites with | fault tolerance built into their AWS infrastructure should be able to | handle the issues relatively easily. Explain Netflix and Heroku last night. Both of whom architect across multiple AZs and have for many years. The API and EBS across the region were also affected. ELB was _also_ affected across the region, and many customers continue to report problems with it. We were told in May of last year after the last massive full-region EBS outage that the "control planes" for the API and related services were being decoupled so issues in a single AZ would not affect all. Seems to not be the case. Just because they offer these features that should help with resiliency doesn't actually mean they _work_ under duress. -- bdha cyberpunk is dead. long live cyberpunk.

Mike Devlin

8:55 p.m.

On Sat, Jun 30, 2012 at 4:45 PM, Bryan Horstmann-Allen < bdha@mirrorshades.net> wrote:

...

Explain Netflix and Heroku last night. Both of whom architect across multiple AZs and have for many years.

The API and EBS across the region were also affected. ELB was _also_ affected across the region, and many customers continue to report problems with it.

We were told in May of last year after the last massive full-region EBS outage that the "control planes" for the API and related services were being decoupled so issues in a single AZ would not affect all. Seems to not be the case.

Just because they offer these features that should help with resiliency doesn't actually mean they _work_ under duress. --

But in netflix case, if they architected their environment the way they said they did, why wouldnt they just fail over to us-west? especially at their scale, I wouldn't expect them to be dependent on any AWS function in any region. Mike

Bryan Horstmann-Allen

9:04 p.m.

+------------------------------------------------------------------------------ | On 2012-06-30 16:55:53, Mike Devlin wrote: | | But in netflix case, if they architected their environment the way they | said they did, why wouldnt they just fail over to us-west? especially at | their scale, I wouldn't expect them to be dependent on any AWS function in | any region. Have a look at Asgard, the AWS management tool they just open sourced. It implies they rely very heavily on many AWS features, some of which are very much region specific. As to their multi-region capability, I have no idea. I don't think I've ever seen the mention it. -- bdha cyberpunk is dead. long live cyberpunk.

Mike Devlin

9:13 p.m.

On Sat, Jun 30, 2012 at 5:04 PM, Bryan Horstmann-Allen < bdha@mirrorshades.net> wrote:

...

Have a look at Asgard, the AWS management tool they just open sourced. It implies they rely very heavily on many AWS features, some of which are very much region specific.

As to their multi-region capability, I have no idea. I don't think I've ever seen the mention it. -- bdha cyberpunk is dead. long live cyberpunk.

yeah, i am sure I am making some assumptions about how much resilience they have been building into their architecture, but since every year they have been getting rid of more and more of their physical infrastructure and putting it fully in AWS, and given the fact they are a pay service, I would think they would account for a region going down Mike

Jason Baugher

3:45 a.m.

Nature is such a PITA. On 6/29/2012 10:42 PM, James Laszko wrote:

...

To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.com] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

...
From Amazon

Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com>wrote:

...
Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

...
Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

-Joe

Mike Lyon

3:47 a.m.

Whatever happened to UPSs and generators? On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher <jason@thebaughers.com>wrote:

...

Nature is such a PITA.

On 6/29/2012 10:42 PM, James Laszko wrote:

...
To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.**com<shortdudey123@gmail.com> ] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

...
From Amazon

Amazon Elastic Compute Cloud (N. Virginia) ( http://status.aws.amazon.com/**) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com

...
wrote:

Seeing some reports of Pinterest and Instagram down as well. Amazon

...
cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

Seems that they are unreachable at the moment. Called and theres a

...
recorded message stating they are aware of an issue, no details.

-Joe

-- Mike Lyon 408-621-4826 mike.lyon@gmail.com http://www.linkedin.com/in/mlyon

Derek Ivey

3:48 a.m.

I was wondering the same thing! Also, Reddit appears to be really slow right now and I keep getting "reddit is under heavy load right now, sorry. Try again in a few minutes." I wonder if it's related. I believe they use Amazon for some of their stuff. Derek On 6/29/2012 11:47 PM, Mike Lyon wrote:

...

Whatever happened to UPSs and generators?

On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher <jason@thebaughers.com>wrote:

...
Nature is such a PITA.

On 6/29/2012 10:42 PM, James Laszko wrote:

...
To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.**com<shortdudey123@gmail.com> ] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

...
From Amazon

Amazon Elastic Compute Cloud (N. Virginia) ( http://status.aws.amazon.com/**) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com

...
wrote: Seeing some reports of Pinterest and Instagram down as well. Amazon cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

Seems that they are unreachable at the moment. Called and theres a

...
recorded message stating they are aware of an issue, no details.

-Joe

Seth Mattinen

3:51 a.m.

On 6/29/12 8:47 PM, Mike Lyon wrote:

...

Whatever happened to UPSs and generators?

You don't need them with The Cloud! But seriously, this is something like the third or fourth time AWS fell over flat in recent memory. ~Seth

Grant Ridder

3:52 a.m.

They may use it for content, but reddit.com resolves to IPs own by quest On Fri, Jun 29, 2012 at 10:51 PM, Seth Mattinen <sethm@rollernet.us> wrote:

...

On 6/29/12 8:47 PM, Mike Lyon wrote:

...
Whatever happened to UPSs and generators?

You don't need them with The Cloud!

But seriously, this is something like the third or fourth time AWS fell over flat in recent memory.

~Seth

Grant Ridder

3:54 a.m.

8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online On Fri, Jun 29, 2012 at 10:52 PM, Grant Ridder <shortdudey123@gmail.com>wrote:

...

They may use it for content, but reddit.com resolves to IPs own by quest

On Fri, Jun 29, 2012 at 10:51 PM, Seth Mattinen <sethm@rollernet.us>wrote:

...
On 6/29/12 8:47 PM, Mike Lyon wrote:

...
Whatever happened to UPSs and generators?

You don't need them with The Cloud!

But seriously, this is something like the third or fourth time AWS fell over flat in recent memory.

~Seth

Justin M. Streiner

4:22 a.m.

On Fri, 29 Jun 2012, Mike Lyon wrote:

...

Whatever happened to UPSs and generators?

They can and do fail. See list archives for numerous reports and examples :) Generators are capable of not starting. ATSs can get into a situation where they don't transfer loads properly, or they can't start the generator(s) UPSs can fail, drain out, or be left in bypass. Breakers can trip and need a manual reset etc... jms

...

On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher <jason@thebaughers.com>wrote:

...
Nature is such a PITA.

On 6/29/2012 10:42 PM, James Laszko wrote:

...
To further expand:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.**com<shortdudey123@gmail.com> ] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

...
From Amazon

Amazon Elastic Compute Cloud (N. Virginia) ( http://status.aws.amazon.com/**) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com

...
wrote:

Seeing some reports of Pinterest and Instagram down as well. Amazon

...
cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

Seems that they are unreachable at the moment. Called and theres a

...
recorded message stating they are aware of an issue, no details.

-Joe

-- Mike Lyon 408-621-4826 mike.lyon@gmail.com

http://www.linkedin.com/in/mlyon

jamie rishaw

5:38 a.m.

you know what's happening even more? ..Amazon not learning their lesson. they just had an outage quite similar.. they "performed a full audit" on electrical systems worldwide, according to the rfo/post mortem. looks like they need to perform a "full and we mean it" audit, and like I've been doing/participating in at dot coms for a decade plus: Actually Do Regular Load tests.. Related/equally to blame: companies that rely heavily on one aws zone, or arguably "one cloud" (period), are asking for it. Please stop these crappy practices, people. Do real world DR testing. Play "What If This City Dropped Off The Map" games, because tonight, parts of VA infact did. Down: Instagram, Pinterest, Netflix, Heroku, Woot. Pocket(Read It Later), and on and on. A bunch of openID sites. A bunch of DNS sites (think zoneedit et al). Infact, probably nearly a /12 if not more of space.. Blame lies both with AWS (again) and with these services providers. They all should know better. -j On Jun 29, 2012 11:22 PM, "Justin M. Streiner" <streiner@cluebyfour.org> wrote:

...

On Fri, 29 Jun 2012, Mike Lyon wrote:

Whatever happened to UPSs and generators?

...
They can and do fail. See list archives for numerous reports and examples :)

Generators are capable of not starting. ATSs can get into a situation where they don't transfer loads properly, or they can't start the generator(s) UPSs can fail, drain out, or be left in bypass. Breakers can trip and need a manual reset etc...

jms

On Fri, Jun 29, 2012 at 8:45 PM, Jason Baugher <jason@thebaughers.com

...
...
wrote:

Nature is such a PITA.

...
On 6/29/2012 10:42 PM, James Laszko wrote:

To further expand:

...
8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

-----Original Message----- From: Grant Ridder [mailto:shortdudey123@gmail.****com< shortdudey123@gmail.com> ] Sent: Friday, June 29, 2012 8:42 PM To: Jason Baugher Cc: nanog@nanog.org Subject: Re: FYI Netflix is down

From Amazon

...
Amazon Elastic Compute Cloud (N. Virginia) ( http://status.aws.amazon.com/**** <http://status.aws.amazon.com/**>) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

-Grant

On Fri, Jun 29, 2012 at 10:40 PM, Jason Baugher <jason@thebaughers.com

...
wrote:

Seeing some reports of Pinterest and Instagram down as well. Amazon

...
cloud services being implicated.

On 6/29/2012 10:22 PM, Joe Blanchard wrote:

Seems that they are unreachable at the moment. Called and theres a

...
recorded message stating they are aware of an issue, no details.

-Joe

-- Mike Lyon 408-621-4826 mike.lyon@gmail.com

http://www.linkedin.com/in/**mlyon <http://www.linkedin.com/in/mlyon>

Bjorn Leffler

5:46 a.m.

On Sat, Jun 30, 2012 at 3:38 PM, jamie rishaw <j@arpa.com> wrote:

...

... Down: Instagram, Pinterest, Netflix, Heroku, Woot. Pocket(Read It Later), and on and on. A bunch of openID sites. A bunch of DNS sites (think zoneedit et al). Infact, probably nearly a /12 if not more of space.. ...

Zoneedit doesn't seem to be down . I can both use the website and resolve my domains.

Roy

6:07 a.m.

On 6/29/2012 10:38 PM, jamie rishaw wrote:

...

you know what's happening even more?

..Amazon not learning their lesson.

they just had an outage quite similar.. they "performed a full audit" on electrical systems worldwide, according to the rfo/post mortem.

looks like they need to perform a "full and we mean it" audit, and like I've been doing/participating in at dot coms for a decade plus: Actually Do Regular Load tests..

Related/equally to blame: companies that rely heavily on one aws zone, or arguably "one cloud" (period), are asking for it.

Please stop these crappy practices, people. Do real world DR testing. Play "What If This City Dropped Off The Map" games, because tonight, parts of VA infact did.

...

I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.

Grant Ridder

6:09 a.m.

well one would think that they could at least get power redundancy right... On Sat, Jun 30, 2012 at 1:07 AM, Roy <r.engehausen@gmail.com> wrote:

...

On 6/29/2012 10:38 PM, jamie rishaw wrote:

...
you know what's happening even more?

..Amazon not learning their lesson.

they just had an outage quite similar.. they "performed a full audit" on electrical systems worldwide, according to the rfo/post mortem.

looks like they need to perform a "full and we mean it" audit, and like I've been doing/participating in at dot coms for a decade plus: Actually Do Regular Load tests..

Related/equally to blame: companies that rely heavily on one aws zone, or arguably "one cloud" (period), are asking for it.

Please stop these crappy practices, people. Do real world DR testing. Play "What If This City Dropped Off The Map" games, because tonight, parts of VA infact did.

...

I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.

Jimmy Hess

11:48 a.m.

On 6/30/12, Grant Ridder <shortdudey123@gmail.com> wrote:

...

well one would think that they could at least get power redundancy right...

It is very similar to suggesting redundancy within a site against building collapse. Reliable power redundancy is very hard and very expensive. Much harder and much more expensive than achieving network redunancy against switch or router failures. And there are always tradeoffs involved, because there is only one utility grid available. There are always some limitations in the amount of isolation possible. You have devices plugged into both power systems. There is some possibility a random device plugged into both systems creates a short in both branches that it plugs into. Both power systems always have to share the same ground, due to safety considerations. Both power systems always have to have fuses or breakers installed, due to safety considerations, and there is always a possibility that various kinds of anomolies cause fuses to simultaneously blow in both systems. -- -JH

Tyler Haske

7:11 a.m.

...

I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.

How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

Andrew D Kirch

7:15 a.m.

On 6/30/2012 3:11 AM, Tyler Haske wrote:

...

How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

Based on? Clouds are nothing more than outsourced responsibility. My business has stopped while my IT department explains to me that it's not their fault because Amazon's down, and I can't exactly fire Amazon. The cloud may be a technological wonder, but as far as business practices go, it's a DISASTER. Andrew

Aaron Burt

1 Jul 1 Jul

12:39 a.m.

New subject: [FoRK] FYI Netflix is down

On Sat, Jun 30, 2012 at 03:15:07AM -0400, Andrew D Kirch wrote:

...

On 6/30/2012 3:11 AM, Tyler Haske wrote:

...
How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

Amazon has many datacenters and tries to make it easy to diversify.

...

Based on? Clouds are nothing more than outsourced responsibility. My business has stopped while my IT department explains to me that it's not their fault because Amazon's down <snip>

It *is* their fault. You can blame faulty manufacturing for having a HDD die, but it's IT's fault if it takes out the only copy of your database. AWS 101: Amazon has clearly-marked "Availablity Zones" for a reason. Oh, and business 101: have an exit strategy for every vendor. This outage is mighty interesting. It's surprising how many big operations had poor availability strategies. Also, I've been working on an exit strategy for one of my VM/colo providers, and AWS + colo in NoVa is one of my options.

...

The cloud may be a technological wonder, but as far as business practices go, it's a DISASTER.

I wouldn't say so. Like any disruptive service, you're getting an acceptably lower-quality product for significantly less money. And like most disruptors, it operates by different rules than the old stuff. Regards, Aaron _______________________________________________ FoRK mailing list http://xent.com/mailman/listinfo/fork

joel jaeggli

30 Jun 30 Jun

7:24 a.m.

...

...
I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

On 6/30/12 12:11 AM, Tyler Haske wrote: there are 7 regions in ec2 three in north america two in asia one in europe and one in south america. us east coast, the one currently being impacted is further subdivided into 5 availability zones. us east 1d appears to be the only one currently being impacted. distributing your application is left as an exercise to the reader.

Cameron Byrne

1:12 p.m.

On Jun 30, 2012 12:25 AM, "joel jaeggli" <joelja@bogus.com> wrote:

...

On 6/30/12 12:11 AM, Tyler Haske wrote:

...
...
I am not a computer science guy but been around a long time. Data

centers

...

...
...
and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.

How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

there are 7 regions in ec2 three in north america two in asia one in europe and one in south america.

us east coast, the one currently being impacted is further subdivided into 5 availability zones.

us east 1d appears to be the only one currently being impacted.

distributing your application is left as an exercise to the reader.

+1 Sorry to be the monday morning quarterback, but the sites that went down learned a valuable lesson in single point of failure analysis. A highly redundant and professionally run data center is a single point of failure. Geo-redundancy is key. In fact, i would take distributed data centers over RAID, UPS, or any other "fancy pants" © mechanisms any day. And, aws East also seems to be cursed. I would run out of west for a while. :-) I would also look into clouds of clouds. ... Who knows. Amazon could have an Enron moment, at which point a corporate entity with a tax id is now a single point of failure. Pay your money, take your chances. CB

Jimmy Hess

3:06 p.m.

On 6/30/12, Cameron Byrne <cb.list6@gmail.com> wrote:

...

On Jun 30, 2012 12:25 AM, "joel jaeggli" <joelja@bogus.com> wrote:

...
On 6/30/12 12:11 AM, Tyler Haske wrote: Geo-redundancy is key. In fact, i would take distributed data centers over RAID, UPS, or any other "fancy pants" © mechanisms any day.

Geo-redundancy is more expensive than any of those technologies, because it directly impacts every application and reduces performance. It means that, for example, if an application needs to guarantee something is persisted to a distributed database, such as a record that such and such user's credit card has just been charged $X or such and such user has uploaded this blob to the web service ; The round trip time of the longest latency path between any of the redundancy sites, is added to the critical path of the WRITE transaction latency during the commit stage. Because you cannot complete a transaction and ensure you have consistency or correct data, until that transaction reaches a system at the remote site managing the persistence, and is acknowledged as received intact. For example, if you have geo sites, which are a minimum of 250 miles apart; if you recall, light only travels 186.28 miles per millisecond. That means you have a 500 mile round-trip and therefore have added a bare minimum of 2.6 milliseconds of latency to every write transaction, and probably more like 15 milliseconds. If your original transaction latency was at 1 milliseconds. or 1000 transactions per second, AND you require only that the data reaches the remote site and is acknowledged (not that the transaction succeeds at the remote site, before you commit), you are now at a minimum of 2.6 milliseconds average 384 transactions per second. To actually do it safely, you require 3.6 milliseconds, limiting you to an average of 277 transactions per second. If the application is not specially designed for remote site redundancy, then this means you require a scheme such as synchronous storage-level replication to achieve clustering; which has even worse results if there is significant geographic dispersion. RAID transactional latencies are much lower. UPSes and redundant power do not increase transaction latencies at all. -- -JH

Randy Bush

8:09 p.m.

...

Sorry to be the monday morning quarterback, but the sites that went down learned a valuable lesson in single point of failure analysis.

as this has happened more than once before, i am less optimistic. or maybe they decided the spof risk was not worth the avoidance costs. randy

Lynda

7:30 a.m.

On 6/30/2012 12:11 AM, Tyler Haske wrote:

...

On 6/29/2012 11:07 PM, Roy wrote:

...
I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters.

How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

First off. They HAVE more than one location, and they are indeed far apart. That said, it's all mixed together, like some kind of goulash, and the companies who've gone with this particular model for their sites are paying for that fact. Second, and more important. I *was* a "computer science guy" in a past life, and this is nonsense. You can have astonishingly large software projects that just continue to run smoothly, day in, day out, and they don't hit the news, because they don't break. There are data centers that don't hit the news, in precisely the same way. If I had a business, right now, I would not have chosen Amazon's cloud (or anyone's for that matter). I would also not be using Google docs/services, for precisely the same reason. I'm a fan of controlling risk, where possible, and I'd say that this is all in the wrong direction for doing that. No worries, though. It seems we are doomed to continue making the same mistakes, over and over. -- Politicians are like a Slinky. They're really not good for anything, but they still bring a smile to your face when you push them down a flight of stairs.

George Herbert

2 Jul 2 Jul

7:08 p.m.

Late reply, but: On Sat, Jun 30, 2012 at 12:30 AM, Lynda <shrdlu@deaddrop.org> wrote:

...

... Second, and more important. I *was* a "computer science guy" in a past life, and this is nonsense. You can have astonishingly large software projects that just continue to run smoothly, day in, day out, and they don't hit the news, because they don't break. There are data centers that don't hit the news, in precisely the same way.

I really need to write the book on IT reliability I keep meaning to. There's reliability - backwards looking statistical, which can be 100% for a given service or datacenter - and then there's dependability, forwards-predicted outage risks, which people often *assert* equals the prior reliability record, but in reality you often have a number of latent failures (and latent cascade paths) that you do not understand, did not identify previously, and are not aware of. I've had or had to respond to over a billion dollars of culminative IT disaster loss over my consulting career so far; I have NEVER seen anyone who did it perfect, even the best pros. And I include myself in that list. Looking at other fields like aerospace and nuclear engineering, what is done in IT is not anywhere close to the same level of QA and engineering analysis and testing. We cannot assert better results with less work. "Oh, that never happens", except I've had my stuff in three locations that had catastrophic generator failures. "Oh, that never happens" when you're doing power maintenance and the best-rated electrical company in California, in conjunction with the generator vendor and a couple of independent power EEs, mis-balance the maintenance generator loads between legs and blow the generators and datacenter. "Oh, that never happens" that the datacenter burns (or starts to burn and then gets flooded). "Oh, that never happens" that the FM-200 goes off or preaction breaks and water leaks. "Oh, that never happens" that well maintained and monitored and triple-redundant AC units all trip offline due to a common mode failure over the course of a weekend and the room gets up to 106 degrees. Oh thank god the next thing didn't go wrong in THAT situation, because the spot temperature meters indicated that the ceiling height of that particular room peaked at 1 degree short of the temp at which the sprinkler heads are supposed to discharge, so we nearly lost that room to flooding rather than just a 10% disk and 15% power supply attrition over the next year... Don't be so confident in the infrastructure. It's not engineered or built or maintained well enough to actually support that assertion. The same can be said of the application software and application architecture and integration. -- -george william herbert george.herbert@gmail.com

Roy

30 Jun 30 Jun

3:41 p.m.

On 6/30/2012 12:11 AM, Tyler Haske wrote:

...

...
I am not a computer science guy but been around a long time. Data centers and clouds are like software. Once they reach a certain size, its impossible to keep the bugs out. You can test and test your heart out and something will slip by. You can say the same thing about nuclear reactors, Apollo moon missions, the NorthEast power grid, and most other technology disasters. How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/ .

It doesn't change my theory. You add that complexity, something happens and the failover routing doesn't work as planned. Been there, done that, have the T-shirt.

Jay Ashworth

1 Jul 1 Jul

6:38 p.m.

----- Original Message -----

...

From: "Tyler Haske" <tyler.haske@gmail.com>

...

How to run a datacenter 101. Have more then one location, preferably far apart. It being Amazon I would expect more. :/

Not entirely. Datacenters do go down, our best efforts to the contrary notwithstanding. Amazon doesn't guarantee you redundancy on EC2, only the tools to provide it yourself. 25% Amazon; 75% service provider clients; that's my appraisal of the blame. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

steve pirk [egrep]

2 Jul 2 Jul

2:38 a.m.

On Sun, Jul 1, 2012 at 11:38 AM, Jay Ashworth <jra@baylink.com> wrote:

...

Not entirely. Datacenters do go down, our best efforts to the contrary notwithstanding. Amazon doesn't guarantee you redundancy on EC2, only the tools to provide it yourself. 25% Amazon; 75% service provider clients; that's my appraisal of the blame.

...

From a Wired article:

...

That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According to Twitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed.

http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/ The GSLB fail-over that was supposed to take place for the affected services (that had configured their applications to fail-over) failed. I heard about this the day after Google announced the Compute Engine addition to the App Engine product lines they have. The demo was awesome. I imagine Google has GSLB down pat by now, so some companies might start looking... ;-] --steve

Justin M. Streiner

30 Jun 30 Jun

11:50 a.m.

On Sat, 30 Jun 2012, jamie rishaw wrote:

...

you know what's happening even more?

..Amazon not learning their lesson.

I was not giving anyone a free pass or attempting to shrug off the outage. I was just stating that there are many reasons why things break. I haven't seen anything official on this yet, but this looks a lot like a cascading failure. jms

Seth Mattinen

3:23 p.m.

On 6/30/12 4:50 AM, Justin M. Streiner wrote:

...

On Sat, 30 Jun 2012, jamie rishaw wrote:

...
you know what's happening even more?

..Amazon not learning their lesson.

I was not giving anyone a free pass or attempting to shrug off the outage. I was just stating that there are many reasons why things break. I haven't seen anything official on this yet, but this looks a lot like a cascading failure.

But haven't they all been cascading failures? One can't just say "well it's a huge system, therefore hard". Especially when they claimed to have learned their lesson from previous outwardly similar failures; either they were lying, or didn't really learn anything, or the scope simply exceeds their grasp. If it's too hard for entity X to handle a large system (for whatever "large" means to them), then X needs to break it down into smaller parts that they're capable of handling in a competent manner. ~Seth

Todd Underwood

4:25 p.m.

On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us> wrote:

...

But haven't they all been cascading failures?

No. They have not. That's not what that term means. 'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run. T

Jimmy Hess

5:41 p.m.

On 6/30/12, Todd Underwood <toddunder@gmail.com> wrote:

...

On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us> wrote:

...
But haven't they all been cascading failures? No. They have not. That's not what that term means.

'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading

Not sure where you're going there; Cascading failures are common, but fortunately are usually temporary or have some kind of scope limit. Cascading just means you have a dependency between components, where the failure of one component may result in the failure of a second component, the failure of the second component results in failure of a third component, and this process continues until no more components are dependent or no more components are still operating. This can happen to the small pieces inside of one specific system, causing that system to collapse. It's just as valid to say Cascading failure is across across larger/more complex pieces of different higher level systems, where the components of one system aren't sufficiently independent of those in other systems, causing both systems to fail. Your application logic can be a point of failure, just as readily as your datacenter can. Cascades can happen at a higher level where entire systems are dependant upon entire other systems. And it can happen Organizationally, External dependancy risk occurs when an entire business is dependant on another organization (such as product support), to remotely administer software they sold, and the subcontracter of the product support org. stops doing their job, or a smaller component (one member of their staff) becomes a rogue/malicious element. -- -JH

Seth Mattinen

6:21 p.m.

On 6/30/12 9:25 AM, Todd Underwood wrote:

...

On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us <mailto:sethm@rollernet.us>> wrote:

...
But haven't they all been cascading failures?

No. They have not. That's not what that term means.

'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run.

I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure. ~Seth

Todd Underwood

7:04 p.m.

This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components. T On Jun 30, 2012 2:21 PM, "Seth Mattinen" <sethm@rollernet.us> wrote:

...

On 6/30/12 9:25 AM, Todd Underwood wrote:

...
On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us <mailto:sethm@rollernet.us>> wrote:

...
But haven't they all been cascading failures?

No. They have not. That's not what that term means.

'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run.

I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure.

~Seth

Jimmy Hess

7:28 p.m.

On 6/30/12, Todd Underwood <toddunder@gmail.com> wrote:

...

This was not a cascading failure. It was a simple power outage Cascading failures involve interdependencies among components.

Actually, you can't really say that. It's true that it was a simple power outage for Amazon. Power failed, causing the AWS service at certain locations to experience issues. Any of the issues related to services at locations that didn't lose power are a possible result of cascade. But as for the other possible outages being reported... Instagram, Pinterest, Netflix, Heroku, Woot, Pocket, zoneedit. Possibly Amazon's power failure caused AWS problems, which resulted in issues with these services. Some of these services may actually have had redundancy in place, but experience a failure of their service as a result of unexpected cascade from the affected site.

...

T -- -JH

Mike Devlin

7:42 p.m.

The last 2 Amazon outages were power issues isolated to just there us-east Virginia data center. I read somewhere that Amazon has something like 70% of their ec2 resources in Virginia and its also their oldest ec2 datacenter..so I am guessing they learned a lot of lessons and are stuck with an aged infrastructure there. I think the real problem here is that a large subset of the customers using ec2 misunderstand the redundancy that is built into the Amazon architecture. You are essentially supposed to view individual virtual machines as bring entirely disposable and make duplicates of everything across availability zones and for extra points across regions. most people instead think that the 2 cents/hour price tag is a massive cost savings and the cloud is invincible..look at the SLA for ec2...Amazon basically doesn't really consider it a real outage unless its more than one availability zone that is down whats more surprising is that netflix was so affected by a single availability zone outage. They are constantly talking about their chaos monkey/simian army tool that purposely breaks random parts of their infrastructure to prove its fault tolerate, or to point out weaknesses to fix. ( http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html) I think the closest thing to a cascading failure they have had was 4/29/11 outage (http://aws.amazon.com/message/65648/) Mike On Jun 30, 2012 3:05 PM, "Todd Underwood" <toddunder@gmail.com> wrote:

...

This was not a cascading failure. It was a simple power outage

Cascading failures involve interdependencies among components.

T On Jun 30, 2012 2:21 PM, "Seth Mattinen" <sethm@rollernet.us> wrote:

...
On 6/30/12 9:25 AM, Todd Underwood wrote:

...
On Jun 30, 2012 11:23 AM, "Seth Mattinen" <sethm@rollernet.us <mailto:sethm@rollernet.us>> wrote:

...
But haven't they all been cascading failures?

No. They have not. That's not what that term means.

'Cascading failure' has a fairly specific meaning that doesn't imply resilience in the face of decomposition into smaller parts. Cascading failures can occur even when a system is decomposed into small parts, each of which is apparently well run.

I honestly have no idea how to parse that since it doesn't jive with my practical view of a cascading failure.

~Seth

Seth Mattinen

8:03 p.m.

On 6/30/12 12:04 PM, Todd Underwood wrote:

...

This was not a cascading failure. It was a simple power outage

Cascading failures involve interdependencies among components.

I guess I'm assuming there were UPS and generator systems involved (and failing) with powering the critical load, but I suppose it could all be direct to utility power. ~Seth

Jared Mauch

8:13 p.m.

The interesting thing to me is the us population by time zone. If amazon has 70% of servers in the eastern time zone it makes some sense. Mountain + pacific is smaller than central, which is a bit more than half eastern. These stats are older but a good rough gauge: http://answers.google.com/answers/threadview?id=714986 Jared Mauch On Jun 30, 2012, at 4:03 PM, Seth Mattinen <sethm@rollernet.us> wrote:

...

On 6/30/12 12:04 PM, Todd Underwood wrote:

...
This was not a cascading failure. It was a simple power outage

Cascading failures involve interdependencies among components.

I guess I'm assuming there were UPS and generator systems involved (and failing) with powering the critical load, but I suppose it could all be direct to utility power.

~Seth

Scott Howard

8:19 p.m.

On Sat, Jun 30, 2012 at 12:04 PM, Todd Underwood <toddunder@gmail.com>wrote:

...

This was not a cascading failure. It was a simple power outage

Cascading failures involve interdependencies among components.

Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then "fails" itself as a result (in some form or other, but frequently different to the mode of the original failure). Whilst the Amazon outage might have been a "simple" power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow-on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused. Scott

Todd Underwood

8:24 p.m.

scott,

...

...
This was not a cascading failure. It was a simple power outage

Cascading failures involve interdependencies among components.

Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then "fails" itself as a result (in some form or other, but frequently different to the mode of the original failure).

indeed. and that is an interdependency among components. in particular, it is a capacity interdependency.

...

Whilst the Amazon outage might have been a "simple" power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow-on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused.

i think you over-estimate these websites. most of them simply have no redundancy (and obviously have no tested, effective redundancy) and were simply hoping that amazon didn't really go down that much. hope is not the best strategy, as it turns out. i suspect that randy is right though: many of these businesses do not promise perfect uptime and can survive these kinds of failures with little loss to business or reputation. twitter has branded it's early failures with a whale that no only didn't hurt it but helped endear the service to millions. when your service fits these criteria, why would you bother doing the complicated systems and application engineering necessary to actually have functional redundancy? it simply isn't worth it. t

...

Scott

Dan Golding

2 Jul 2 Jul

3:01 p.m.

...

-----Original Message----- From: Todd Underwood [mailto:toddunder@gmail.com]

scott,

...
...
This was not a cascading failure. It was a simple power outage

Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure) Utility Power Failed First Backup Generator Failed (shut down due to a faulty fan) Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker) In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it. - Dan

...

...
...
Cascading failures involve interdependencies among components.

Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then "fails" itself as a result (in some form or other, but frequently different to the mode of the original failure).

indeed. and that is an interdependency among components. in particular, it is a capacity interdependency.

...
Whilst the Amazon outage might have been a "simple" power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow- on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused.

i think you over-estimate these websites. most of them simply have no redundancy (and obviously have no tested, effective redundancy) and were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though: many of these businesses do not promise perfect uptime and can survive these kinds of failures with little loss to business or reputation. twitter has branded it's early failures with a whale that no only didn't hurt it but helped endear the service to millions. when your service fits these criteria, why would you bother doing the complicated systems and application engineering necessary to actually have functional redundancy?

it simply isn't worth it.

t

...
Scott

Todd Underwood

3:30 p.m.

...

Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure)

Utility Power Failed First Backup Generator Failed (shut down due to a faulty fan) Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it.

ok, i give in. as some level of granularity everything is a cascading failure (since molecules colide and the world is an infinite chain of causation in which human free will is merely a myth </Spinoza>) of course, this use of 'cascading' is vacuous and not useful anymore since it applies to nearly every failure, but i'll go along with it. from the perspective of a datacenter power engineer, this was a cascading failure of a few small number of components. from the perspective of every datacenter customer: this was a power failure. from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. from the perspective of nanog mailing list readers: this was an interesting opportunity to speculate about failures about which we have no data (as usual!). can we all agree on those facts? :-) t

Leo Bicknell

4:09 p.m.

In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote:

...

from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility.

I want to emphasize _and test_. Work on an infrastructure which is redundant and designed to provide "100% uptime" (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported. I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker. Then he would wait, to see how long before a technician came to fix it. If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage. I've seen too many companies who's "test" is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers. TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

david raistrick

4:13 p.m.

On Mon, 2 Jul 2012, Leo Bicknell wrote:

...

I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the

you mean like this? http://techblog.netflix.com/2011/07/netflix-simian-army.html -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org

Leo Bicknell

4:17 p.m.

In a message written on Mon, Jul 02, 2012 at 12:13:22PM -0400, david raistrick wrote:

...

you mean like this?

http://techblog.netflix.com/2011/07/netflix-simian-army.html

Yes, Netflix seems to get it, and I think their Simian Army is a great Q&A tool. However, it is not a complete testing system, I have never seen them talk about testing non-software components, and I hope they do that as well. As we saw in the previous Amazon outage, part of the problem was a circuit breaker configuration. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

david raistrick

4:23 p.m.

On Mon, 2 Jul 2012, Leo Bicknell wrote:

...

...
http://techblog.netflix.com/2011/07/netflix-simian-army.html

Yes, Netflix seems to get it, and I think their Simian Army is a great Q&A tool. However, it is not a complete testing system, I have never seen them talk about testing non-software components, and I hope they do that as well. As we saw in the previous Amazon outage, part of the problem was a circuit breaker configuration.

When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to). I suppose they could introduce artificial network latency/loss @ each instance - and could add testing around what happens when amazon's API disappears (as was the case friday). Beyond that....the rest of it is up to the hardware provider (Amazon, in this case). ..david (who also relies on outsourced hardware these days) -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org

Leo Bicknell

5:53 p.m.

In a message written on Mon, Jul 02, 2012 at 12:23:57PM -0400, david raistrick wrote:

...

When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to).

Find a provider with a similar methodology. Perhaps Netflix never conducts a power test, but their colo vendor would perform such testing. If no colo providers exist that share their values on testing, that may be a sign that outsourcing it isn't the right answer... -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

Cameron Byrne

6:20 p.m.

On Jul 2, 2012 10:53 AM, "Leo Bicknell" <bicknell@ufp.org> wrote:

...

In a message written on Mon, Jul 02, 2012 at 12:23:57PM -0400, david

...

...
When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal

raistrick wrote: power/network/etc

...

...
grid you're attached to).

Find a provider with a similar methodology. Perhaps Netflix never conducts a power test, but their colo vendor would perform such testing.

If no colo providers exist that share their values on testing, that may be a sign that outsourcing it isn't the right answer...

-- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

I suggest using RAIC Redundant array of inexpensive clouds. Make your chaos animal go after sites and regions instead of individual VMs. CB

Tony McCrory

6:53 p.m.

On 2 July 2012 19:20, Cameron Byrne <cb.list6@gmail.com> wrote:

...

Make your chaos animal go after sites and regions instead of individual VMs.

CB

...

From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html

" Create More Failures Currently, Netflix uses a service called "Chaos Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla". *"* It would seem the Gorilla hasn't quite matured. Tony

Paul Graydon

6:59 p.m.

...

On 2 July 2012 19:20, Cameron Byrne <cb.list6@gmail.com> wrote:

...
Make your chaos animal go after sites and regions instead of individual VMs.

CB

From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html

" Create More Failures Currently, Netflix uses a service called "Chaos Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla". *"*

It would seem the Gorilla hasn't quite matured.

Tony From conversations with Adrian Cockcroft this weekend it wasn't the result of Chaos Gorilla or Chaos Monkey failing to prepare them adequately. All their automated stuff worked perfectly, the infrastructure tried to self heal. The problem was that yet again Amazon's back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the

On 07/02/2012 08:53 AM, Tony McCrory wrote: problem. Paul

James Downs

7:08 p.m.

On Jul 2, 2012, at 11:59 AM, Paul Graydon wrote:

...

back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem.

Someone needs to define back-plane/control-plane in this case. (and what wasn't working) During the height of the problems, what I saw was a Netflix A record pointing at a broken ELB. If there was an ELB to point to in another AZ, it wouldn't take anything from Amazon to change that A record, as Netflix uses ultradns. -j

david raistrick

8:20 p.m.

On Mon, 2 Jul 2012, James Downs wrote:

...

...
back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem.

Someone needs to define back-plane/control-plane in this case. (and what wasn't working)

Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages. I know nothing of the netflix side of it - but that's what -we- saw. (and that caused all us-east RDS instances in every AZ to appear offline..) -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org

James Downs

11:03 p.m.

On Jul 2, 2012, at 1:20 PM, david raistrick wrote:

...

Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages.

Right, and other toolkits like boto. Each AZ has a different endpoint (url), and as I have no resources running in East, I saw no problems with the API endpoints I use. So, as you note, US-EAST Region was "not controllable".

...

I know nothing of the netflix side of it - but that's what -we- saw. (and that caused all us-east RDS instances in every AZ to appear

And, if you lose US-EAST, you need to run *somewhere*. Netflix did not cutover www.netflix.com to another Region. Why not is another question. -j

Rodrick Brown

3 Jul 3 Jul

2:19 a.m.

On Jul 2, 2012, at 7:03 PM, James Downs <egon@egon.cc> wrote:

...

On Jul 2, 2012, at 1:20 PM, david raistrick wrote:

...
Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages.

Right, and other toolkits like boto. Each AZ has a different endpoint (url), and as I have no resources running in East, I saw no problems with the API endpoints I use. So, as you note, US-EAST Region was "not controllable".

...
I know nothing of the netflix side of it - but that's what -we- saw. (and that caused all us-east RDS instances in every AZ to appear

And, if you lose US-EAST, you need to run *somewhere*. Netflix did not cutover www.netflix.com to another Region. Why not is another question.

At which point are you guys going to realize that no matter how much resiliency, redundancy and fault tolerance you plan into an infrastructure there are always the unforeseen that just doesn't make any sense to plan for. Four major decision factors are cost, complexity, time and failure rate. At some point a business need to focus on its core business. IT like any other business resource has to be managed efficiently and its sole purpose is for the enablement of said business nothing more. Some of the post here are highly laughable and so unrealistic. People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no.. This horse is dead!

...

James Downs

2:57 a.m.

On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote:

...

People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no..

You missed the point.

Dan Golding

1:11 p.m.

...

-----Original Message----- From: James Downs [mailto:egon@egon.cc]

On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote:

...
People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no..

You missed the point.

And very publically missed the point, too. The Netflix issues led to a large discussion of downtime, testing, and fault tolerance that has been very useful for the community and could lead to some good content for NANOG conferences (/pokes PC). For Netflix (and all other similar services) downtime is money and money is downtime. There is a quantifiable cost for customer acquisition and a quantifiable churn during each minute of downtime. Mature organizations actually calculate and track this. The trick is to ensure that you have balanced the cost of greater redundancy vs the cost of churn/customer acquisition. If you are spending too much on redundancy, it's as big of mistake as spending too little. Also, I don't think there is an acceptable level of downtime for water. Neither do water utilities. - Dan

James Downs

3:15 p.m.

On Jul 3, 2012, at 6:11 AM, Dan Golding wrote:

...

Also, I don't think there is an acceptable level of downtime for water. Neither do water utilities.

I remember a certain conversation I had with a web-developer. We were talking about "zero downtime releases". He thought it was acceptable if the website went down for 15 minutes, "because people will just come back". Naturally, he was not as forgiving about the idea that his bank might think the same way, or that I might provide DB or server uptimes with that kind of reliability. Downtime will kill some companies, and not others. Twitter certainly survived their fail-whale period. But then, no one pays for twitter. -j

Rodrick Brown

3:35 p.m.

On Jul 3, 2012, at 9:11 AM, "Dan Golding" <dgolding@ragingwire.com> wrote:

...

...
-----Original Message----- From: James Downs [mailto:egon@egon.cc]

On Jul 2, 2012, at 7:19 PM, Rodrick Brown wrote:

...
People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no..

You missed the point.

And very publically missed the point, too. The Netflix issues led to a large discussion of downtime, testing, and fault tolerance that has been very useful for the community and could lead to some good content for NANOG conferences (/pokes PC). For Netflix (and all other similar services) downtime is money and money is downtime. There is a quantifiable cost for customer acquisition and a quantifiable churn during each minute of downtime. Mature organizations actually calculate and track this. The trick is to ensure that you have balanced the cost of greater redundancy vs the cost of churn/customer acquisition. If you are spending too much on redundancy, it's as big of mistake as spending too little.

I totally got the point and the last bit of my post was just tongue in cheek. As I stated in my original response it's very unrealistic to plan for every possible failure scenario given the constraints most businesses face when implementing BCP today. I doubt Amazon gave much thought to multiple site outages and clients not being able to dynamically redeploy their engines because of inaccessibility from ELB.

...

Also, I don't think there is an acceptable level of downtime for water. Neither do water utilities.

- Dan

david raistrick

4:06 p.m.

On Tue, 3 Jul 2012, Rodrick Brown wrote:

...

face when implementing BCP today. I doubt Amazon gave much thought to multiple site outages and clients not being able to dynamically redeploy their engines because of inaccessibility from ELB.

Considering there's a grand total of -one- tool in the entirely AWS toolkit that supports working across multiple regions at all sanely (that would be ec2-migrate-bundle, btw), I'd agree. Amazon has put nearly zero thought into multiple site outages or how their customer base could leverage the multiple sites (regions) operated by AWS. -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org

Randy Bush

4 Jul 4 Jul

12:32 a.m.

...

Also, I don't think there is an acceptable level of downtime for water.

coming soon to a planet near you randy

Kyle Creyts

3:37 p.m.

Tell that to people in the third world without utilities. On Jul 3, 2012 8:32 PM, "Randy Bush" <randy@psg.com> wrote:

...

...
Also, I don't think there is an acceptable level of downtime for water.

coming soon to a planet near you

randy

Randy Bush

10:55 p.m.

...

Tell that to people in the third world without utilities.

...
...
Also, I don't think there is an acceptable level of downtime for water. coming soon to a planet near you

i work there regularly. the typical nanog kiddie does not. randy

George Herbert

3 Jul 3 Jul

6:06 a.m.

On Jul 2, 2012, at 7:19 PM, Rodrick Brown <rodrick.brown@gmail.com> wrote:

...

People are acting as if Netflix is part of some critical service they stream movies for Christ sake. Some acceptable level of loss is fine for 99.99% of Netflix's user base just like cable, electricity and running water I suffer a few hours of losses each year from those services it suck yes, is it the end of the world no..

Actually calculating - understanding - cost of downtime, and what variations on that exist over time, are keys to reliability engineering. But if you plan to cover X failure scenarios and only cover X/2 failure scenarios due to implementation glitches you goofed. The right answer may be "relax and accept the downtime" and it may be "spend $10 million dollars to avoid most of these". If you haven't thought it through and quantified, do so... George William Herbert Sent from my iPhone

Jon Lewis

5:13 p.m.

On Mon, 2 Jul 2012, david raistrick wrote:

...

On Mon, 2 Jul 2012, James Downs wrote:

...
...
back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around the problem.

Someone needs to define back-plane/control-plane in this case. (and what wasn't working)

Amazon resources are controlled (from a consumer viewpoint) by API - that API is also used by amazon's internal toolkits that support ELB (and RDS..). Those (http accessed) API interfaces were unavailable for a good portion of the outages.

It seems like if you're going to outsource your mission critical infrastructure to "cloud" you should probably pick at least 2 unrelated cloud providers and if at all possible, not outsource the systems that balance/direct traffic...and if you're really serious about it, have at least two of these setup at different facilities such that if the primary goes offline, the secondary takes over. If a cloud provider fails, you redirect to another. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

AP NANOG

2 Jul 2 Jul

7:32 p.m.

I believe in my dictionary Chaos Gorilla translates into "Time To Go Home", with a rough definition of "Everything just crapped out - The world is ending"; but then again I may have hat incorrect :-) -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 2:59 PM, Paul Graydon wrote:

...

...
On 2 July 2012 19:20, Cameron Byrne <cb.list6@gmail.com> wrote:

...
Make your chaos animal go after sites and regions instead of individual VMs.

CB

From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html

" Create More Failures Currently, Netflix uses a service called "Chaos Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>"

to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla". *"*

It would seem the Gorilla hasn't quite matured.

Tony From conversations with Adrian Cockcroft this weekend it wasn't the result of Chaos Gorilla or Chaos Monkey failing to prepare them adequately. All their automated stuff worked perfectly, the infrastructure tried to self heal. The problem was that yet again Amazon's back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around

On 07/02/2012 08:53 AM, Tony McCrory wrote: the problem.

Paul

Joly MacFie

7:36 p.m.

Good band name.

...

Chaos Gorilla

-- --------------------------------------------------------------- Joly MacFie 218 565 9365 Skype:punkcast WWWhatsup NYC - http://wwwhatsup.com http://pinstand.com - http://punkcast.com VP (Admin) - ISOC-NY - http://isoc-ny.org -------------------------------------------------------------- -

James Downs

6:30 p.m.

On Jul 2, 2012, at 9:23 AM, david raistrick wrote:

...

When the hardware is outsourced how would you propose testing the non-software components? They do simulate availability zone issues (and AZ is as close as you get to controlling which internal power/network/etc grid you're attached to).

We all know what netflix *says* they do, but they *did* have an outage. -j

AP NANOG

4:31 p.m.

This is an excellent example of how tests "should" be ran, unfortunately far too many places don't do this... -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 12:09 PM, Leo Bicknell wrote:

...

In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote:

...
from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. I want to emphasize _and test_.

Work on an infrastructure which is redundant and designed to provide "100% uptime" (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported.

I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker.

Then he would wait, to see how long before a technician came to fix it.

If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage.

I've seen too many companies who's "test" is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant.

Grant Ridder

4:42 p.m.

The problem is large scale tests take a lot of time and planning. For it to be done right, you really need a dedicated DR team. -Grant On Mon, Jul 2, 2012 at 11:31 AM, AP NANOG <nanog@armoredpackets.com> wrote:

...

This is an excellent example of how tests "should" be ran, unfortunately far too many places don't do this...

--

Thank you,

Robert Miller http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 12:09 PM, Leo Bicknell wrote:

...
In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote:

...
from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility.

I want to emphasize _and test_.

Work on an infrastructure which is redundant and designed to provide "100% uptime" (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported.

I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker.

Then he would wait, to see how long before a technician came to fix it.

If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage.

I've seen too many companies who's "test" is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant.

Dan Golding

7:25 p.m.

...

-----Original Message----- From: Leo Bicknell [mailto:bicknell@ufp.org]

...

I want to emphasize _and test_.

[snip]

...

I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would

...

the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker.

*DING DING* - we have a winner! In a previous life, I used to spend a lot of time in other people's data centers. The key question to ask was how often they pulled the plug - i.e. disconnected utility power without having backup generators running. Simulating an actual failure. That goes for pulling out an Ethernet cord or unplugging a server, or flipping a breaker. Its all the same. The problem is that if you don't do this for a while, you get SCARED of doing it, and you stop doing it. The longer you go without, the scarier it gets, to the point where you will never do it, because you have no idea what will happen, other that you probably getting fired. This is called "horrible engineering management", and is very common. The other problem, of course, is that people design under the assumption that everything will always work, and that failure modes, when they occur, are predictable and fall into a narrow set. Multiple failure modes? Not tested. Failure modes including operator error? Never tested. When was the last time you had a drill? - Dan

...

Then he would wait, to see how long before a technician came to fix it.

If these activities were service impacting to customers the engineering or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage.

I've seen too many companies who's "test" is planned months in advance, and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers.

TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are NOT redundant.

-- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

Brett Frankenberger

8:32 p.m.

On Mon, Jul 02, 2012 at 09:09:09AM -0700, Leo Bicknell wrote:

...

In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd Underwood wrote:

...
from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility.

I want to emphasize _and test_.

Work on an infrastructure which is redundant and designed to provide "100% uptime" (which is impossible, but that's another story) means that there should be confidence in a failure being automatically worked around, detected, and reported.

I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would do the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker.

Sounds like something a VP would do. And, actually, it's an important step: make sure the easy failures are covered. But it's really a very small part of resilience. What happens when one instance of a shared service starts performing slowly? What happens when one instance of a redundant database starts timing out queries or returning empty result sets? What happens when the Ethernet interface starts dropping 10% of the packets across it? When happens when the Ethernet switch linecard locks up and stops passing dataplane traffic, but link (physical layer) and/or control plane traffic flows just fine? What happens when the server kernel panics due to bad memeory, reboots, gets all the way up, runs for 30 seconds, kernel panics, lather, rinse, repeat. Reliability is hard. And if you stop looking once you get to the point where you can safely toggle the power switch without causing an impact, you're only a very small part of the way there. -- Brett

AP NANOG

3:41 p.m.

While I was working for a wireless telecom company our primary datacenter was knocked off the power grid due to weather, the generators kicked on and everything was fine, till one generator was struck by lighting and that same strike fried the control panel on the second one. Considering the second generator had no control panel we had no means of monitoring it for temp, fuel, input voltage (when it came back), output voltage, surge protection, or ultimately if the generator spiked to go full voltage due to a regulator failure. Needless to say we had to shut the second generator down for safety reasons. While in the military I seen many generators struck by lighting as well. Im not saying Amazon was not at fault here, but I can see where this is possible and happens more frequently than one might think. I hate to play devils advocate here, but you as the customer should always have backups to your backups, and practice these fail-overs on a regular basis. Otherwise you are the fault here, no one else... -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 11:01 AM, Dan Golding wrote:

...

...
-----Original Message----- From: Todd Underwood [mailto:toddunder@gmail.com]

scott,

...
...
This was not a cascading failure. It was a simple power outage Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure)

Utility Power Failed First Backup Generator Failed (shut down due to a faulty fan) Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it.

- Dan

...
...
...
Cascading failures involve interdependencies among components.

Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then "fails" itself as a result (in some form or other, but frequently different to the mode of the original failure).

indeed. and that is an interdependency among components. in particular, it is a capacity interdependency.

...
Whilst the Amazon outage might have been a "simple" power outage, it's likely that at least some of the website outages caused were a combination of not just the direct Amazon outage, but also the flow- on effect of their redundancy attempting (but failing) to kick in - potentially making the problem worse than just the Amazon outage caused.

i think you over-estimate these websites. most of them simply have no redundancy (and obviously have no tested, effective redundancy) and were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though: many of these businesses do not promise perfect uptime and can survive these kinds of failures with little loss to business or reputation. twitter has branded it's early failures with a whale that no only didn't hurt it but helped endear the service to millions. when your service fits these criteria, why would you bother doing the complicated systems and application engineering necessary to actually have functional redundancy?

it simply isn't worth it.

t

...
Scott

Brett Frankenberger

1 Jul 1 Jul

12:08 a.m.

On Sat, Jun 30, 2012 at 01:19:54PM -0700, Scott Howard wrote:

...

On Sat, Jun 30, 2012 at 12:04 PM, Todd Underwood <toddunder@gmail.com>wrote:

...
This was not a cascading failure. It was a simple power outage

Cascading failures involve interdependencies among components.

Not always. Cascading failures can also occur when there is zero dependency between components. The simplest form of this is where one environment fails over to another, but the target environment is not capable of handling the additional load and then "fails" itself as a result (in some form or other, but frequently different to the mode of the original failure).

That's an interdependency. Environment A is dependent on environment B being up and pulling some of the load away from A; B is dependent on A beingup and pulling some of the load away from B. A Crashes for reason X -> Load Shifts to B -> B Crashes due to load is a classic cascading failure. And it's not limited to software systems. It's how most major blackouts occur (except with more than three steps in the cascade, of course). -- Brett

Jay Ashworth

6:36 p.m.

New subject: It's the end of the world, as we know it (Was: FYI Netflix is down)

----- Original Message -----

...

From: "jamie rishaw" <j@arpa.com>

...

you know what's happening even more?

..Amazon not learning their lesson.

...

Please stop these crappy practices, people. Do real world DR testing. Play "What If This City Dropped Off The Map" games, because tonight, parts of VA infact did.

You know what I want everyone to do? Go read this. Right now; it's Sunday, and I'll wait: http://interdictor.livejournal.com/2005/08/27/ Start there, and click Next Date a lot, until you get to the end. Entire metropolitan areas can, and do, fall completely off the map. If your audience is larger than that area, then you need to prepare for it. And being reminded of how big it can get is occasionally necessary. The 4ESS in the third subbasement of 1WTC that was a toll switch for most of the northeast reportedly stayed on the air, talking to it's SS7 neighbors, until something like 1500EDT, 11 Sep 2001. It can get *really* big. Are you ready? Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

William Herrin

30 Jun 30 Jun

4:03 a.m.

On Fri, Jun 29, 2012 at 11:42 PM, Grant Ridder <shortdudey123@gmail.com> wrote:

...

From Amazon

Amazon Elastic Compute Cloud (N. Virginia) (http://status.aws.amazon.com/) 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. 8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

Major storm: http://www.washingtonpost.com/blogs/capital-weather-gang/post/severe-thunder... "Storms packing wind gusts of nearly 80 mph have just blown through the D.C.-Baltimore region" https://www.dom.com/storm-center/dominion-electric-outage-summary.jsp Right around 50% of northern Virginia is without power right now. Regards, Bill Herrin Whose generator worked although the first five gas stations I passed had no power. -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Seth Mattinen

4:11 a.m.

On 6/29/12 8:22 PM, Joe Blanchard wrote:

...

Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

Streaming services and web; just tried my Roku and it failed to connect. ~Seth

Seth Mattinen

3 Jul 3 Jul

5:38 p.m.

On 6/29/12 8:22 PM, Joe Blanchard wrote:

...

Seems that they are unreachable at the moment. Called and theres a recorded message stating they are aware of an issue, no details.

I didn't see anyone post this yet, so here's Amazon's summary of events: http://aws.amazon.com/message/67457/

4852

Age (days ago)

4856

Last active (days ago)

List overview

Download

84 comments

42 participants

participants (42)

Aaron Burt
Andrew D Kirch
AP NANOG
Bjorn Leffler
Brett Frankenberger
Bryan Horstmann-Allen
Cameron Byrne
Dan Golding
david raistrick
Derek Ivey
George Herbert
Grant Ridder
Ian Wilson
James Downs
James Laszko
jamie rishaw
Jared Mauch
Jason Baugher
Jay Ashworth
Jimmy Hess
Joe Blanchard
joel jaeggli
Joly MacFie
Jon Lewis
Justin M. Streiner
Kyle Creyts
Leo Bicknell
Lynda
Mike Devlin
Mike Lyon
Paul Graydon
Randy Bush
Rayson Ho
Rodrick Brown
Roy
Scott Howard
Seth Mattinen
steve pirk [egrep]
Todd Underwood
Tony McCrory
Tyler Haske
William Herrin