Cloudflare is down

older
"After Being Cut From Norway, The...

Arthur Wist

3 Mar 2013 3 Mar '13

10:46 a.m.

https://twitter.com/CloudFlare/status/308157556113698816 https://twitter.com/CloudFlare/status/308165820285083648 Apparently due to a routing issue... -AW

Show replies by date

Jay Ashworth

3 Mar 3 Mar

7:47 p.m.

----- Original Message -----

...

From: "Arthur Wist" <arthur.wist@gmail.com> To: nanog@nanog.org Sent: Sunday, March 3, 2013 5:46:15 AM

...

https://twitter.com/CloudFlare/status/308157556113698816 https://twitter.com/CloudFlare/status/308165820285083648

Apparently due to a routing issue...

Unless you're in UTC-12 or your clock is wrong, you posted that 7 hours ago and it just hit the list. (Whomever wrote that Sent header, BTW, presumably your MUA, is 5322 non-compliant: no TZ.) Anyroad, their Twitter account now seems to say they're back up. Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA #natog +1 727 647 1274

Nick Hilliard

8:02 p.m.

On 03/03/2013 10:46, Arthur Wist wrote:

...

Apparently due to a routing issue...

back up again: http://blog.cloudflare.com/todays-outage-post-mortem-82515 tl;dr: outage caused by flowspec filter tickling vendor bug. Nick

Vinod K

8:31 p.m.

I am not sure of bug... could be normal behavior for how JunOS CLI handle "extended" packet size. Will wait for Juniper comment on incident. Vinod

...

Date: Sun, 3 Mar 2013 20:02:05 +0000 From: nick@foobar.org To: arthur.wist@gmail.com Subject: Re: Cloudflare is down CC: nanog@nanog.org

On 03/03/2013 10:46, Arthur Wist wrote:

...
Apparently due to a routing issue...

back up again: http://blog.cloudflare.com/todays-outage-post-mortem-82515

tl;dr: outage caused by flowspec filter tickling vendor bug.

Nick

Constantine A. Murenin

8:46 p.m.

On 3 March 2013 12:02, Nick Hilliard <nick@foobar.org> wrote:

...

On 03/03/2013 10:46, Arthur Wist wrote:

...
Apparently due to a routing issue...

back up again: http://blog.cloudflare.com/todays-outage-post-mortem-82515

tl;dr: outage caused by flowspec filter tickling vendor bug.

Definitely smart to be delegating your DNS to the web-accelerator company and a single point of failure, especially if you are not just running a web-site, but have some other independent infrastructure, too.

...

...
...
...
...
...
CloudFlare's 23 data centers span 14 countries so the response took some time but within about 30 minutes we began to restore CloudFlare's network and services. By 10:49 UTC, all of CloudFlare's services were restored. We continue to investigate some edge cases where people are seeing outages. In nearly all of these cases, the problem is that a bad DNS response has been cached. Typically clearing the DNS cache will resolve the issue.

Yet, apparently, CloudFlare doesn't even support using any of their services with your own DNS solutions. And how exactly do they expect end-users "clearing the DNS cache"? Do I call AT&T, and ask them to clear their cache? http://serverfault.com/questions/479367/how-long-a-dns-timeout-is-cached-for C.

Florian Weimer

9:35 p.m.

* Constantine A. Murenin:

...

And how exactly do they expect end-users "clearing the DNS cache"? Do I call AT&T, and ask them to clear their cache?

Sure, and also tell them to clear their BGP cache (aka "route flap dampening"). 8-)

Saku Ytti

4 Mar 4 Mar

7:31 a.m.

On (2013-03-03 12:46 -0800), Constantine A. Murenin wrote:

...

Definitely smart to be delegating your DNS to the web-accelerator company and a single point of failure, especially if you are not just running a web-site, but have some other independent infrastructure, too.

To be fair, most of us probably have harmonized peering edge, running one vendor, with one or two software releases and as as such as susceptible to BGP update taking down whole edge. I'm not comfortable personally to point cloudflare and say this was easily avoidable and should not have happened (Not implying you are either). If fuzzing BGP was easy, vendors would provide us working software and we wouldn't lose good portion of Internet every few years due to mangled UPDATE. I know lot of vendors are fuzzing with 'codenomicon' and they appear not to have flowspec fuzzer. Lot of things had to go wrong for this to cause outage. 1. their traffic analyzer had to have bug which could claim packet size is 90k 2. their noc people had to accept it as legit data (2.5 their internal software where filter is updated, had to accept this data, unsure if it was internal system or junos directly) 3. junos cli had to accept this data 4. flowspec had to accept it and generate nlri carrying it 5. nlri -> ACL abstraction engine had to accept it and try to program to hardware Even if cloudflare had been running out-sourced anycast DNS with many vendor edge, the records had still been pointing out to a network which you couldn't reach. Probably only thing you could have done to plan against this, would have been to have solid dual-vendor strategy, to presume that sooner or later, software defect will take one vendor completely out. And maybe they did plan for it, but decided dual-vendor costs more than the rare outages. -- ++ytti

Leo Bicknell

2:51 p.m.

In a message written on Mon, Mar 04, 2013 at 09:31:13AM +0200, Saku Ytti wrote:

...

Probably only thing you could have done to plan against this, would have been to have solid dual-vendor strategy, to presume that sooner or later, software defect will take one vendor completely out. And maybe they did plan for it, but decided dual-vendor costs more than the rare outages.

From what I have heard so far there is something else they could have done, hire higher quality people. Any competent network admin would have stopped and questioned a 90,000+ byte packet and done more investigation. Competent programmers writing their internal tools would have flagged that data as out of rage. I can't tell you how many times I've sat in a post mortem meeting about some issue and the answer from senior management is "why don't you just provide a script to our NOC guys, so the next time they can run it and make it all better." Of course it's easy to say that, the smart people have diagnosed the problem! You can buy these "scripts" for almost any profession. There are manuals on how to fix everything on a car, and treatment plans for almost every disease. Yet most people intuitively understand you take your car to a mechanic and your body to a doctor for the proper diagnosis. The primary thing you're paying for is expertise in what to fix, not how to fix it. That takes experience and training. But somehow it doesn't sink in with networking. I would not at all be surprised to hear that someone over at Cloudflare right now is saying "let's make a script to check the packet size" as if that will fix the problem. It won't. Next time the issue will be different, and the same undertrained person who missed the packet size this time will miss the next issue as well. They should all be sitting around saying, "how can we hire compentent network admins for our NOC", but that would cost real money. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

Saku Ytti

4:17 p.m.

On (2013-03-04 06:51 -0800), Leo Bicknell wrote:

...

From what I have heard so far there is something else they could have done, hire higher quality people.

Your solution to mistakes seem to be not to make them. I can understand the train of thought, but I suspect the practicality of such advice. -- ++ytti

Patrick W. Gilmore

4:43 p.m.

On Mar 04, 2013, at 09:51 , Leo Bicknell <bicknell@ufp.org> wrote:

...

Any competent network admin would have stopped and questioned a 90,000+ byte packet and done more investigation. Competent programmers writing their internal tools would have flagged that data as out of rage.

The last couple words are the best thing I've read on NANOG in a very long time. :) -- TTFN, patrick

Warren Bailey

4:59 p.m.

+1.

...

From my Android phone on T-Mobile. The first nationwide 4G network.

-------- Original message -------- From: "Patrick W. Gilmore" <patrick@ianai.net> Date: 03/04/2013 11:46 AM (GMT-05:00) To: NANOG list <nanog@nanog.org> Subject: Re: Cloudflare is down On Mar 04, 2013, at 09:51 , Leo Bicknell <bicknell@ufp.org> wrote:

...

Any competent network admin would have stopped and questioned a 90,000+ byte packet and done more investigation. Competent programmers writing their internal tools would have flagged that data as out of rage.

The last couple words are the best thing I've read on NANOG in a very long time. :) -- TTFN, patrick

Jeff Wheeler

6:23 p.m.

On Mon, Mar 4, 2013 at 9:51 AM, Leo Bicknell <bicknell@ufp.org> wrote:

...

will fix the problem. It won't. Next time the issue will be different, and the same undertrained person who missed the packet size this time will miss the next issue as well. They should all be sitting around saying, "how can we hire compentent network admins for our NOC", but that would cost real money.

I think that is hard because virtually all training / education in our industry is based on procedures, not on concepts. Pick up any book about networking and you'll find examples of how to configure a lab of Cisco 2900s so you can pass an exam. Very few that go into conceptual detail or troubleshooting of any kind. Educational programs suffer from the same flaw. There are exceptions to this rule, but they are very few. I'm sure many NANOG readers are familiar with "Interdomain Multicast Routing," for example. It is an excellent book because it covers concepts and compares two popular vendor platforms on a variety of multicast topics. We have lots of stupid people in our industry because so few understand "The Way Things Work." -- Jeff S Wheeler <jsw@inconcepts.biz> Sr Network Operator / Innovative Network Concepts

Saku Ytti

6:40 p.m.

On (2013-03-04 13:23 -0500), Jeff Wheeler wrote:

> We have lots of stupid people in our industry because so few
> understand "The Way Things Work."

We have tendency to view mistakes we do as unavoidable human errors and
mistakes other people do as avoidable stupidity.

We should actively plan for mistakes/errors, if you actively plan for no
'stupid mistakes', you're gonna have bad time

>From my point of view, outages are caused by:
1) operator
2) software defect
3) hardware defect

Most people design only against 3), often with design which actually
increases likelihood of 2) and 1), reducing overall MTBF on design which
strictly theoretically increases it.

-- 
  ++ytti

Valdis.Kletnieks＠vt.edu

6:53 p.m.

On Mon, 04 Mar 2013 20:40:58 +0200, Saku Ytti said:

...

Most people design only against 3), often with design which actually increases likelihood of 2) and 1), reducing overall MTBF on design which strictly theoretically increases it.

I have to admit I've always suspect that MTBWTF would be a more useful metric of real-world performance.

George Herbert

11 p.m.

On Mon, Mar 4, 2013 at 10:40 AM, Saku Ytti <saku@ytti.fi> wrote:

...

On (2013-03-04 13:23 -0500), Jeff Wheeler wrote:

...
We have lots of stupid people in our industry because so few understand "The Way Things Work."

We have tendency to view mistakes we do as unavoidable human errors and mistakes other people do as avoidable stupidity.

We should actively plan for mistakes/errors, if you actively plan for no 'stupid mistakes', you're gonna have bad time

From my point of view, outages are caused by: 1) operator 2) software defect 3) hardware defect

Most people design only against 3), often with design which actually increases likelihood of 2) and 1), reducing overall MTBF on design which strictly theoretically increases it.

...And a lot of people who know the heirarchy solve 3 and then solve 2 in a way that increases 1 (multiple parallel environments with different vendors' equipment) only to find that 1 increased, due to additional complexity. On the other hand, I've seen people who had horrible explosions of 2 or 3 due to ignoring all but 1. If you ACTUALLY need that many 9s, you need all of redundancy, diversity of vendors, and suitably trained, exercised, process-supported net admins. That's a few multiples of 2 more expense than nearly anyone typically wants to pay for. -- -george william herbert george.herbert@gmail.com

Adam Vitkovsky

5 Mar 5 Mar

8:42 a.m.

...

From my point of view, outages are caused by: 1) operator 2) software defect 3) hardware defect

...

From my experience now days the likelihood of an outage as a result of 3) is magnitude less than 2) and same goes for 2) to 1) ratio. In other words the vast majority of the outages are caused by human error. One way to partially rule out 1) is to have a fully customized stupid proof provisioning system - customized by those who know how stuff works.

adam

Christopher Morrow

4 Mar 4 Mar

3:09 p.m.

On Mon, Mar 4, 2013 at 2:31 AM, Saku Ytti <saku@ytti.fi> wrote:

...

I know lot of vendors are fuzzing with 'codenomicon' and they appear not to have flowspec fuzzer.

i suspect they fuzz where the money is ... number of users of bgp? number of users of flowspec?

danny＠tcb.net

6 Mar 6 Mar

4:42 p.m.

On 2013-03-04 08:09, Christopher Morrow wrote:

...

On Mon, Mar 4, 2013 at 2:31 AM, Saku Ytti <saku@ytti.fi> wrote:

...
I know lot of vendors are fuzzing with 'codenomicon' and they appear not to have flowspec fuzzer.

i suspect they fuzz where the money is ...

number of users of bgp? number of users of flowspec?

While fuzzing of BGP[*] on the wire _may have identified some of this, there were many components involved (e.g., the DDoS attack on a customer's DNS servers that tickled their "attack profiler", their attack profiler was presumably confused about the suspect packet sizes as indicated in the presented "output signature", their operator didn't identify the issue before disseminating the recommended "signatures", JUNOS didn't barf when compiling the configuration (that'd be a big packet), a memory leak / thrashing triggered by the ingested flow_spec UPDATE crashed receiving routers, routers apparently recovered non-deterministically, etc..). Leo's comments remind me of the The President's Commission to Investigate the Accident at Three Mile Island (TMI) findings, where pretty much everyone was blamed, but the operators were identified as ultimately culpable (in this case, presumably, _they also wrote the "attack profiler", although "they" may not have been precisely who deployed the policy). For an interesting perspective of "normal accidents" derived from interactive complexity see [NormalAccidents], it's quite applicable to today's networks systems, methinks. -danny [NormalAccidents] Perrow, Charles, "Normal Accidents: Living with High-Risk Technologies", 1999.

Constantine A. Murenin

4 Mar 4 Mar

8:33 p.m.

On 3 March 2013 23:31, Saku Ytti <saku@ytti.fi> wrote:

...

On (2013-03-03 12:46 -0800), Constantine A. Murenin wrote:

...
Definitely smart to be delegating your DNS to the web-accelerator company and a single point of failure, especially if you are not just running a web-site, but have some other independent infrastructure, too.

To be fair, most of us probably have harmonized peering edge, running one vendor, with one or two software releases and as as such as susceptible to BGP update taking down whole edge.

I'm not comfortable personally to point cloudflare and say this was easily avoidable and should not have happened (Not implying you are either).

The issue I have is not with their network. The issue is that they require ALL of their customers to hand over DNS control, and completely disregard any kind of situation as what has just happened. * They don't provide any IP-addresses which you can set your A or AAAA records to. * They don't provide any hostnames which you can set a CNAME to. (Supposedly, they do offer CNAME support to paid customers, but if you look at their help page for CNAME support, it's clearly evident that it's highly discouraged and effectively an unsupported option.) * They don't let you AXFR and mirror the zones, either. So, the issue here, is that a second point of failure is suddenly introduced to your own harmonised network, and introduced in a way as to suggest that it's not a big deal, and will make everything better anyways. In actuality, this doesn't even stop their users from going the unsupported route: I've seen some relatively major and popular hosting provider turn over their web-site to CloudFlare when it was under attack, but they did it with an A record, potentially to not suffer a complete embarrassment of having `whois` show that they don't even use the nameservers that they provide to their own users. [...]

...

Even if cloudflare had been running out-sourced anycast DNS with many vendor edge, the records had still been pointing out to a network which you couldn't reach.

This is where you have it wrong. DNS is not only useful for http. Yet CloudFlare only provides http-acceleration. Yet they do require that you delegate your domains to the nameservers on their own single-vendor network, with no option to opt-out. I don't think they should necessarily be running an out-sourced DNS, but I do think that they should not make it a major problem for users to use http-acceleration services without DNS tie-ins. Last I checked, CloudFlare didn't even let you setup just a subdomain for their service, e.g. they do require complete DNS control from the registrar-zone level, all the time, every single time. C.

Saku Ytti

9:14 p.m.

On (2013-03-04 12:33 -0800), Constantine A. Murenin wrote:

...

to use http-acceleration services without DNS tie-ins. Last I checked, CloudFlare didn't even let you setup just a subdomain for their service, e.g. they do require complete DNS control from the registrar-zone level, all the time, every single time.

I'm not going to justify this behaviour. It would not occur to make to give zone control out to 3rd party unless all the zones point to records hosted by same 3rd party. -- ++ytti

Alex

3 Mar 3 Mar

9:54 p.m.

Is there any blog or some sort of site that has a up to date list with the latest network outages? Like, not just Cloudflare, but every major outage that has happen lately. Its really nice to see a post-mortem analysis like in this case. Bugs/hidden "features" are not "documented" in most of the books I've read, so the only way to run into them is to either step on it yourself or learn from other networks going splat before yours does. I've tried to search a few more and I've either found "electrical network outage" or reports from individual businesses as to why their service went down, but I wasn't able to find the "big list". I'd appreciate it a lot if anyone could point me in the right direction. Thanks, Alex. On 03/03/2013 10:02 PM, Nick Hilliard wrote:

...

On 03/03/2013 10:46, Arthur Wist wrote:

...
Apparently due to a routing issue... back up again: http://blog.cloudflare.com/todays-outage-post-mortem-82515

tl;dr: outage caused by flowspec filter tickling vendor bug.

Nick

Frank Bulk

10:06 p.m.

I'd start here: http://www.outages.org/ and the listserv is here: http://wiki.outages.org/index.php/Main_Page#Outages_Mailing_Lists Set aside the idea of "every" outage, and "major" is relative to the eye of the beholder. =) Frank -----Original Message----- From: Alex [mailto:dreamwaverfx@yahoo.com] Sent: Sunday, March 03, 2013 3:54 PM To: Nick Hilliard Cc: nanog@nanog.org Subject: Re: Cloudflare is down Is there any blog or some sort of site that has a up to date list with the latest network outages? Like, not just Cloudflare, but every major outage that has happen lately. Its really nice to see a post-mortem analysis like in this case. Bugs/hidden "features" are not "documented" in most of the books I've read, so the only way to run into them is to either step on it yourself or learn from other networks going splat before yours does. I've tried to search a few more and I've either found "electrical network outage" or reports from individual businesses as to why their service went down, but I wasn't able to find the "big list". I'd appreciate it a lot if anyone could point me in the right direction. Thanks, Alex. On 03/03/2013 10:02 PM, Nick Hilliard wrote:

...

On 03/03/2013 10:46, Arthur Wist wrote:

...
Apparently due to a routing issue... back up again: http://blog.cloudflare.com/todays-outage-post-mortem-82515

tl;dr: outage caused by flowspec filter tickling vendor bug.

Nick

4502

Age (days ago)

4505

Last active (days ago)

List overview

Download

21 comments

18 participants

participants (18)

Adam Vitkovsky
Alex
Arthur Wist
Christopher Morrow
Constantine A. Murenin
danny＠tcb.net
Florian Weimer
Frank Bulk
George Herbert
Jay Ashworth
Jeff Wheeler
Leo Bicknell
Nick Hilliard
Patrick W. Gilmore
Saku Ytti
Valdis.Kletnieks＠vt.edu
Vinod K
Warren Bailey