
I believe in my dictionary Chaos Gorilla translates into "Time To Go Home", with a rough definition of "Everything just crapped out - The world is ending"; but then again I may have hat incorrect :-) -- Thank you, Robert Miller http://www.armoredpackets.com Twitter: @arch3angel On 7/2/12 2:59 PM, Paul Graydon wrote:
On 2 July 2012 19:20, Cameron Byrne <cb.list6@gmail.com> wrote:
Make your chaos animal go after sites and regions instead of individual VMs.
CB
From a previous post mortem http://techblog.netflix.com/2011_04_01_archive.html
" Create More Failures Currently, Netflix uses a service called "Chaos Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>"
to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla". *"*
It would seem the Gorilla hasn't quite matured.
Tony From conversations with Adrian Cockcroft this weekend it wasn't the result of Chaos Gorilla or Chaos Monkey failing to prepare them adequately. All their automated stuff worked perfectly, the infrastructure tried to self heal. The problem was that yet again Amazon's back-plane / control-plane was unable to cope with the requests. Netflix uses Amazon's ELB to balance the traffic and no back-plane meant they were unable to reconfigure it to route around
On 07/02/2012 08:53 AM, Tony McCrory wrote: the problem.
Paul