Re: FYI Netflix is down

newer
Requesting off-list contact from a...

older
Re: [c-nsp] automatic bgp route...

Ryan Malayter

3 Jul 2012 3 Jul '12

8 p.m.

Jon Lewis wrote:

...

It seems like if you're going to outsource your mission critical infrastructure to "cloud" you should probably pick at least 2 unrelated cloud providers and if at all possible, not outsource the systems that balance/direct traffic...and if you're really serious about it, have at least two of these setup at different facilities such that if the primary goes offline, the secondary takes over. If a cloud provider fails, you redirect to another.

Really, you need at least three independent providers. One primary (A), one backup (B), and one "witness" to monitor the others for failure. The witness site can of course be low-powered, as it is not in the data plane of the applications, but just participates in the control plane. In the event of a loss of communication, the majority clique wins, and the isolated environments shut themselves down. This is of course how any sane clustering setup has protected against "split brain" scenarios for decades. Doing it the right way makes the cloud far less cost-effective and far less "agile". Once you get it all set up just so, change becomes very difficult. All the monitoring and fail-over/fail-back operations are generally application-specific and provider-specific, so there's a lot of lock-in. Tools like RightScale are a step in the right direction, but don't really touch the application layer. You also have to worry about the availability of yet another provider! -- RPM

Show replies by date

steve pirk [egrep]

9 Jul 9 Jul

12:27 a.m.

New subject: FYI Netflix is down

On Tue, Jul 3, 2012 at 1:00 PM, Ryan Malayter <malayter@gmail.com> wrote:

...

Doing it the right way makes the cloud far less cost-effective and far less "agile". Once you get it all set up just so, change becomes very difficult. All the monitoring and fail-over/fail-back operations are generally application-specific and provider-specific, so there's a lot of lock-in. Tools like RightScale are a step in the right direction, but don't really touch the application layer. You also have to worry about the availability of yet another provider!

I am pretty sure Netflix and others were "trying to do it right", as they all had graceful fail-over to a secondary AWS zone defined. It looks to me like Amazon uses DNS round-robin to load balance the zones, because they mention returning a "list" of addresses for DNS queries, and explains the failure of the services to shunt over to other zones in their postmortem.

...

Elastic Load Balancers (ELBs) allow web traffic directed at a single IP address to be spread across many EC2 instances. They are a tool for high availability as traffic to a single end-point can be handled by many redundant servers. ELBs live in individual Availability Zones and front EC2 instances in those same zones or in other Availability Zones.

...

ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses. When a client, such as a web browser, queries DNS with a Domain Name, it receives the IP address (“A”) records of all of the ELBs in random order. While some clients only process a single IP address, many (such as newer versions of web-browsers) will retry the subsequent IP addresses if they fail to connect to the first. A large number of non-browser clients only operate with a single IP address. During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.

http://aws.amazon.com/message/67457/

...

*In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage.*

*That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According to Twitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed.*

http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/ I am a big believer in using hardware to load balance data centers, and not leave it up to software in the data center which might fail. Speaking of services like RightScale, Google announced Compute Engine at Google I/O this year. BuildFax was an early Adopter, and they gave it great reviews... http://www.youtube.com/watch?v=LCjSJ778tGU It looks like Google has entered into the VPS market. 'bout time... ;-] http://cloud.google.com/products/compute-engine.html --steve pirk

Ryan Malayter

12:52 a.m.

New subject: FYI Netflix is down

On Jul 8, 2012, at 7:27 PM, "steve pirk [egrep]" <steve@pirk.com> wrote:

...

I am pretty sure Netflix and others were "trying to do it right", as they all had graceful fail-over to a secondary AWS zone defined.

Having a single company as an infrastructure supplier is not "trying to do it right" from an engineering OR business perspective. It's lazy. No matter how many "availability zones" the vendor claims.

Rayson Ho

3:50 p.m.

New subject: FYI Netflix is down

On Sun, Jul 8, 2012 at 8:27 PM, steve pirk [egrep] <steve@pirk.com> wrote:

...

I am pretty sure Netflix and others were "trying to do it right", as they all had graceful fail-over to a secondary AWS zone defined. It looks to me like Amazon uses DNS round-robin to load balance the zones, because they mention returning a "list" of addresses for DNS queries, and explains the failure of the services to shunt over to other zones in their postmortem.

There are also bugs from the Netflix side uncovered by the AWS outage: "Lessons Netflix Learned from the AWS Storm" http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.h... For an infrastructure this large, no matter you are running your own datacenter or using the cloud, it is certain that the code is not bug free. And another thing is, if everything is too automated, then failure in one component can trigger bugs in areas that no one has ever thought of... Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/

...

...
Elastic Load Balancers (ELBs) allow web traffic directed at a single IP address to be spread across many EC2 instances. They are a tool for high availability as traffic to a single end-point can be handled by many redundant servers. ELBs live in individual Availability Zones and front EC2 instances in those same zones or in other Availability Zones.

...
ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses. When a client, such as a web browser, queries DNS with a Domain Name, it receives the IP address (“A”) records of all of the ELBs in random order. While some clients only process a single IP address, many (such as newer versions of web-browsers) will retry the subsequent IP addresses if they fail to connect to the first. A large number of non-browser clients only operate with a single IP address. During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.

http://aws.amazon.com/message/67457/

...
*In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage.*

*That’s what was supposed to happen at Netflix Friday night. But it didn’t work out that way. According to Twitter messages from Netflix Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick Branson, it looks like an Amazon Elastic Load Balancing service, designed to spread Netflix’s processing loads across data centers, failed during the outage. Without that ELB service working properly, the Netflix and Pintrest services hosted by Amazon crashed.*

http://www.wired.com/wiredenterprise/2012/06/real-clouds-crush-amazon/

I am a big believer in using hardware to load balance data centers, and not leave it up to software in the data center which might fail.

Speaking of services like RightScale, Google announced Compute Engine at Google I/O this year. BuildFax was an early Adopter, and they gave it great reviews... http://www.youtube.com/watch?v=LCjSJ778tGU

It looks like Google has entered into the VPS market. 'bout time... ;-] http://cloud.google.com/products/compute-engine.html

--steve pirk

Dave Hart

5:20 p.m.

New subject: FYI Netflix is down

On Mon, Jul 9, 2012 at 15:50 UTC, Rayson Ho wrote:

...

There are also bugs from the Netflix side uncovered by the AWS outage:

"Lessons Netflix Learned from the AWS Storm"

http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.h...

"We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly." potential translation: "We continue to shoot ourselves in the foot by filtering all ICMP without understanding the implications." Cheers, Dave Hart

steve pirk [egrep]

11 Jul 11 Jul

5 p.m.

New subject: FYI Netflix is down

On Mon, Jul 9, 2012 at 10:20 AM, Dave Hart <davehart@gmail.com> wrote:

...

"We continue to investigate why these connections were timing out during connect, rather than quickly determining that there was no route to the unavailable hosts and failing quickly."

potential translation:

"We continue to shoot ourselves in the foot by filtering all ICMP without understanding the implications."

Sorry to mention my favorite hardware vendor again, but that is what I liked about using F5 BigIP as load balancing devices... They did layer 7 url checking to see if the service was really responding (instead of just pinging or opening a connection to the IP). We performed tests that would do a complete LDAP SSL query to verify a directory server could actually look up a person. If it failed to answer within a certain time frame, then it was taken out of rotation. I do not know if that was ever implemented in production, but we did verify it worked. On the "software in the hardware can fail" point, my only defense is you do redundant testing of the watcher devices, and have enough of them to vote misbehaving ones out of service. Oh, and it is best if the global load balancing hardware/software is located somewhere else besides the data centers being monitored. -- steve pirk

4742

Age (days ago)

4750

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Dave Hart
Rayson Ho
Ryan Malayter
steve pirk [egrep]