Yahoo outage summary

Marcus H. Sachs

8 Jul 2007 8 Jul '07

7:29 p.m.

I put up a diary at the Storm Center (http://isc.sans.org/diary.html?storyid=3112) that summarizes what we know about the Yahoo outage on Friday. If anybody has any additional info they want to share or comments about the write-up please let me know. Marc -- Marc Sachs SANS ISC marc@sans.org

Show replies by date

Steven M. Bellovin

8 Jul 8 Jul

9:20 p.m.

On Sun, 8 Jul 2007 15:29:10 -0400 "Marcus H. Sachs" <marc@sachsfamily.net> wrote:

...

I put up a diary at the Storm Center (http://isc.sans.org/diary.html?storyid=3112) that summarizes what we know about the Yahoo outage on Friday. If anybody has any additional info they want to share or comments about the write-up please let me know.

In other words, it was yet another BGP screw-up that secured routing could have prevented. Any clue about the root cause, i.e., malice or accident? --Steve Bellovin, http://www.cs.columbia.edu/~smb

Marcus H. Sachs

9:47 p.m.

Yep, soBGP or S-BGP could have prevented this. But that seems to be a bridge too far right now. I don't know about the cause - malicious or accidental - perhaps somebody from Level3 or Hanaro Telecom can explain the rest of the story. Marc -----Original Message----- From: Steven M. Bellovin [mailto:smb@cs.columbia.edu] Sent: Sunday, July 08, 2007 5:20 PM To: Marcus H. Sachs Cc: 'Nanog' Subject: Re: Yahoo outage summary On Sun, 8 Jul 2007 15:29:10 -0400 "Marcus H. Sachs" <marc@sachsfamily.net> wrote:

...

I put up a diary at the Storm Center (http://isc.sans.org/diary.html?storyid=3112) that summarizes what we know about the Yahoo outage on Friday. If anybody has any additional info they want to share or comments about the write-up please let me know.

In other words, it was yet another BGP screw-up that secured routing could have prevented. Any clue about the root cause, i.e., malice or accident? --Steve Bellovin, http://www.cs.columbia.edu/~smb

Ricardo V. Oliveira

9 Jul 9 Jul

12:52 a.m.

...

Yep, soBGP or S-BGP could have prevented this. But that seems to be a bridge too far right now. This seems to have been a route leak from AS9318, not a false origin announcement, so not sure if soBGP or s-BGP would help here.

--Ricardo

Marcus H. Sachs

1:56 a.m.

If we had routing registries that were accurate and authoritative, then soBGP/S-BGP would have something to verify a route change against. It should not matter if last Friday's event was a leak or a false announcement - with some sort of verification system we could mitigate errors, intentional or accidental. Marc -----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Ricardo V. Oliveira Sent: Sunday, July 08, 2007 8:52 PM To: Nanog Subject: Re: Yahoo outage summary

...

Yep, soBGP or S-BGP could have prevented this. But that seems to be a bridge too far right now. This seems to have been a route leak from AS9318, not a false origin announcement, so not sure if soBGP or s-BGP would help here.

--Ricardo

Chris L. Morrow

2:18 a.m.

On Sun, 8 Jul 2007, Marcus H. Sachs wrote:

...

If we had routing registries that were accurate and authoritative, then soBGP/S-BGP would have something to verify a route change against. It should not matter if last Friday's event was a leak or a false announcement - with some sort of verification system we could mitigate errors, intentional or accidental.

either way, in this case (and a number of other public incidents/outages in the last 3 years) simple prefix-list application would have resolved the issue. Cogent leak via (i think) turk-telecom NY-Edison leak 9918 leak this-leak all would have been prevented with the most simple of steps: "prefix filter your customers". While S*BGP seem like they may offer additional protections and additional knobs to be used for protecting 'us' from 'them', the very basics are obviously not being done so added complexity is not going to really help :( Or, perhaps its not that its not going to help its just not going to get done because even prefix-lists are 'too hard', apparently. -Chris

Sean Donelan

2:38 a.m.

On Mon, 9 Jul 2007, Chris L. Morrow wrote:

...

While S*BGP seem like they may offer additional protections and additional knobs to be used for protecting 'us' from 'them', the very basics are obviously not being done so added complexity is not going to really help :( Or, perhaps its not that its not going to help its just not going to get done because even prefix-lists are 'too hard', apparently.

Yep, if the simple steps were implemented and didn't work, then adding more complex steps may be appropriate. But in the absence of people using even the simple steps, why do people think adding more complexity will work better? The Internet is an on-going example of just-in-time engineering; and fix only when it breaks. Yes, I know someone will claim Yahoo lost gazillion dollars due to the fubared routing. On the other hand, it was fixed in a short amount of time. While lots of folks have their patent pending solutions waiting, are those solutions more cost effective than fixing the occassional fubared nature of the Internet when it happens? So far, the people who pay the bills don't think so. And the Department of Homeland Security isn't paying those bills.

Valdis.Kletnieks＠vt.edu

2:47 p.m.

On Mon, 09 Jul 2007 02:18:25 -0000, "Chris L. Morrow" said:

...

While S*BGP seem like they may offer additional protections and additional knobs to be used for protecting 'us' from 'them', the very basics are obviously not being done so added complexity is not going to really help :( Or, perhaps its not that its not going to help its just not going to get done because even prefix-lists are 'too hard', apparently.

"Wow, prefix-lists are *hard*" -- BGP Barbie.. You'd think that by now, we as an industry could do better than that. (Yes, I know the jury is still out on what really happened at L3-Hanaro. Doesn't change the fact that we collectively shoot ourselves in the foot because providers will believe the most implausible things from their neighbors, like announcements for 128/1 ;)

Chris L. Morrow

3:11 p.m.

On Mon, 9 Jul 2007 Valdis.Kletnieks@vt.edu wrote:

...

On Mon, 09 Jul 2007 02:18:25 -0000, "Chris L. Morrow" said:

...
While S*BGP seem like they may offer additional protections and additional knobs to be used for protecting 'us' from 'them', the very basics are obviously not being done so added complexity is not going to really help :( Or, perhaps its not that its not going to help its just not going to get done because even prefix-lists are 'too hard', apparently.

"Wow, prefix-lists are *hard*" -- BGP Barbie..

shopping anyone?

...

You'd think that by now, we as an industry could do better than that.

I think that over all, over a goodly period of time, we are... we occasionally step on the wrong end of the rake still :(

...

(Yes, I know the jury is still out on what really happened at L3-Hanaro.

from some other conversations about this, this seems to be a similar problem to what happened to NY-Edison about 1.5/2 years ago now (panix.com route hijackage)... 'auto filter from IRR data' without some form of checking for proper authority. Of course, now that I stirred the 'l3 shoulda filtered' pot I should probably also stir the 'large ISP customers should outbound prefix-filter' pot. It's very likely that they DO filter outbound, atleast to pref routes from place to place, perhaps twin failures caught them? :( I think Marcus, Randy, Steve, Lixia all are getting at an underlying issue: "The interwebs are not as trivial to the world as they once were" So more strict control and operational due-dilligence should be on everyone's plate... Atleast for basics like making sure the routing system functions properly going forward. Anyway, should be interesting to get some more details on what happened if they are ever to become available. -Chris

jared mauch

3:19 p.m.

On Jul 9, 2007, at 10:47 AM, Valdis.Kletnieks@vt.edu wrote:

...

On Mon, 09 Jul 2007 02:18:25 -0000, "Chris L. Morrow" said:

...
While S*BGP seem like they may offer additional protections and additional knobs to be used for protecting 'us' from 'them', the very basics are obviously not being done so added complexity is not going to really help :( Or, perhaps its not that its not going to help its just not going to get done because even prefix-lists are 'too hard', apparently.

"Wow, prefix-lists are *hard*" -- BGP Barbie..

You'd think that by now, we as an industry could do better than that.

I agree that we need something better but nobody has shown me a better system than prefix lists and irr that actually *works*. The simple truth is that prefix lists ARE hard to manage. There are a lot of folks that have complex relationships or don't see why they should register their routes. Some people lack tools and automation to make it work or to manage their networks. It would be nice to see everyone filter routes, including those from even transit and large peers. I don't think we will be able to ignore this forever. I also do not see the status quo changing soon either.

Patrick W. Gilmore

5:42 p.m.

On Jul 9, 2007, at 11:19 AM, jared mauch wrote:

...

The simple truth is that prefix lists ARE hard to manage. There are a lot of folks that have complex relationships or don't see why they should register their routes. Some people lack tools and automation to make it work or to manage their networks. It would be nice to see everyone filter routes, including those from even transit and large peers. I don't think we will be able to ignore this forever. I also do not see the status quo changing soon either.

I'm not sure we can't ignore it forever. The telephone network has been around for a lot longer than the 'Net, has way, way, way more connections, and there are corners of it which are managed even worse than the inter-web. Like Sean said, cost/benefit. If the cost of avoiding a 1 day outage per year is the same as a 5 day outage, management will not fix it. -- TTFN, patrick

Florian Weimer

3:42 p.m.

* Valdis Kletnieks:

...

(Yes, I know the jury is still out on what really happened at L3-Hanaro. Doesn't change the fact that we collectively shoot ourselves in the foot because providers will believe the most implausible things from their neighbors, like announcements for 128/1 ;)

Well, if L3 creates its filters based on RADB entries (which is still considered a RR, isn't it?), they will accept a 213/8 announcement. 8-( 128/1 isn't too far away, I fear. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99

Lixia Zhang

2:34 a.m.

On Jul 8, 2007, at 6:56 PM, Marcus H. Sachs wrote:

...

If we had routing registries that were accurate and authoritative, then soBGP/S-BGP would have something to verify a route change against. It should not matter if last Friday's event was a leak or a false announcement - with some sort of verification system we could mitigate errors, intentional or accidental.

I agree here. So the mostly needed thing is some verification system. Wonder what people think about this: http://www.nanog.org/mtg-0706/ osterweil.html something one can start using *now*. There was virtually no comment after the presentation at last NANOG. Lixia

Cat Okita

5:12 a.m.

On Sun, 8 Jul 2007, Lixia Zhang wrote:

...

I agree here. So the mostly needed thing is some verification system.

I keep on coming back to "any proposal that requires universal adoption and deployment is fatally flawed". As far as "needing a verification system", is there something deeply problematic about filtering your customers? It's a fine example of thinking globally and acting locally.

...

Wonder what people think about this: http://www.nanog.org/mtg-0706/osterweil.html something one can start using *now*. There was virtually no comment after the presentation at last NANOG.

Perhaps that's because there wasn't much to say about it? I think DNS is already badly overloaded, and adding still more random crap into it just isn't on my list. Beyond that, the phrase "leverages concepts from PGP's Web of Trust" makes my hair stand on end. Perhaps the author(s) are comfortable with using a pile of disassociated islands 'with low reliability for confirming prefix ownership -- I'd rather apply my efforts to something that doesn't boil down to the equivalent of an educated judgement call. cheers! ========================================================================== "A cat spends her life conflicted between a deep, passionate and profound desire for fish and an equally deep, passionate and profound desire to avoid getting wet. This is the defining metaphor of my life right now."

Andy Dills

6:27 a.m.

On Mon, 9 Jul 2007, Cat Okita wrote:

...

As far as "needing a verification system", is there something deeply problematic about filtering your customers? It's a fine example of thinking globally and acting locally.

That's what I'm curious about...this boils down to L3 not properly filtering Hanaro. Having recently turned up some L3 connectivity, I was happy to discover that they can use any of the routing registries as the source of their prefix filters. They told me the prefix filters are automatically constructed based on the RR of your choice...update the RR, and their filters will update that night, no need to bug them. Yay for that, wish everybody worked that way. But...why wasn't Hanaro being filtered? If the filters are being automatically generated, I would think they would just filter all of their peers regardless of number of prefixes etc. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---

Sam Stickland

10 Jul 10 Jul

11:24 a.m.

Andy Dills wrote:

...

On Mon, 9 Jul 2007, Cat Okita wrote:

...
As far as "needing a verification system", is there something deeply problematic about filtering your customers? It's a fine example of thinking globally and acting locally.

That's what I'm curious about...this boils down to L3 not properly filtering Hanaro.

Having recently turned up some L3 connectivity, I was happy to discover that they can use any of the routing registries as the source of their prefix filters. They told me the prefix filters are automatically constructed based on the RR of your choice...update the RR, and their filters will update that night, no need to bug them. Yay for that, wish everybody worked that way.

But...why wasn't Hanaro being filtered? If the filters are being automatically generated, I would think they would just filter all of their peers regardless of number of prefixes etc.

Handily it's possible to query Level3's filters through a whois interface, for example: whois -h filtergen.level3.net RIPE::AS25577 For some reason this doesn't work: whois -h filtergen.level3.net APNIC::AS9318 But this does, presumably this one is reading of an internal Level3 db: whois -h filtergen.level3.net AS9318 List of prefixes returned is: Prefix list for policy AS9318 = LEVEL3::AS9318 61.98.32.0/19 61.98.64.0/20 61.98.96.0/20 124.111.0.0/16 210.93.131.0/24 210.93.132.0/22 211.33.96.0/20 211.49.96.0/20 211.49.144.0/20 211.49.192.0/19 211.49.240.0/20 211.59.160.0/19 211.59.208.0/20 211.59.224.0/20 211.110.160.0/19 211.110.224.0/20 211.186.0.0/19 211.186.32.0/20 211.186.64.0/19 211.186.96.0/20 211.186.144.0/20 211.186.160.0/19 211.243.144.0/20 211.243.224.0/19 211.244.0.0/20 211.244.32.0/19 211.244.64.0/20 211.244.176.0/20 211.245.48.0/20 218.49.114.0/24 218.234.87.0/24 Sam

Roland Dobbins

9 Jul 9 Jul

6:25 a.m.

On Jul 9, 2007, at 8:56 AM, Marcus H. Sachs wrote:

...

If we had routing registries that were accurate and authoritative

Irrespective of the stat us of s*BGP deployment, following existing BCPs with currently-deployed techniques/functionality/features would have prevented the issue described in the post. s*BGP deployment is a separate issue, and conflating the two doesn't necessarily follow. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // 408.527.6376 voice Culture eats strategy for breakfast. -- Ford Motor Company

Randy Bush

6:31 a.m.

...

following existing BCPs with currently-deployed techniques/functionality/features would have prevented the issue described in the post.

knowing that level(3) is one of the most serious deployments of irr-based route filters and other prudent practices, perhaps we should wait for a post mortem from level(3) before jumping to conclusions? randy

Roland Dobbins

6:43 a.m.

On Jul 9, 2007, at 1:31 PM, Randy Bush wrote:

...

perhaps we should wait for a post mortem from level(3) before jumping to conclusions?

I said, 'the issue described in the post'. I've no idea whether the post was accurate or complete in its analysis. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // 408.527.6376 voice Culture eats strategy for breakfast. -- Ford Motor Company

Tony Tauber

4:05 p.m.

On Mon, Jul 09, 2007 at 02:31:10PM +0800, Randy Bush wrote:

...

...
following existing BCPs with currently-deployed techniques/functionality/features would have prevented the issue described in the post.

knowing that level(3) is one of the most serious deployments of irr-based route filters and other prudent practices, perhaps we should wait for a post mortem from level(3) before jumping to conclusions?

randy

Level3's filter implmentation is indeed well-done, however, the fact remains that the IRR (which I use and endorse) has no linkage to any other source of information for purposes of validation. It's fundamentally garbage in, garbage out. Say some ISP has a provisioning tool which updates their router configs and the IRR in one fell swoop. If the provisioner makes a typo the IRR will gladly accept the entry for, say, 12/8, and the upstream will rebuild their filters with that entry automatically and you get the same result. There's no magic bullet in updating BGP if a fundamental, verifiable data model is not accepted and agreed upon. Tony

Randy Bush

4:31 p.m.

Tony Tauber wrote:

...

On Mon, Jul 09, 2007 at 02:31:10PM +0800, Randy Bush wrote:

...
...
following existing BCPs with currently-deployed techniques/functionality/features would have prevented the issue described in the post. knowing that level(3) is one of the most serious deployments of irr-based route filters and other prudent practices, perhaps we should wait for a post mortem from level(3) before jumping to conclusions? There's no magic bullet in updating BGP if a fundamental, verifiable data model is not accepted and agreed upon.

the space of routing data validation is large, we can explore it at our leisure, and we have been for some years. but my point was that it is silly to indulge in conjecturbation on the cause of the recent event and excoriate l(3), hanaro, or john curran's grandmother until we have heard from the folk who have actual data. randy

Sean Donelan

4:45 p.m.

On Tue, 10 Jul 2007, Randy Bush wrote:

...

the space of routing data validation is large, we can explore it at our leisure, and we have been for some years. but my point was that it is silly to indulge in conjecturbation on the cause of the recent event and excoriate l(3), hanaro, or john curran's grandmother until we have heard from the folk who have actual data.

If companies thought it was in their self-interest, they might actually share that actual data. However history has shown over and over again that companies generally avoid any public discussion about their problems until they are overwhelmed. http://tech.monstersandcritics.com/news/article_1327791.php/Yahoo_outage_cau... If you wait for the companies to reveal the data, you will probably have a long wait. WorldCom still hasn't released its official investigative report into why its national frame networks failed for nearly a week in 1999.

Douglas Otis

5:18 p.m.

On Jul 9, 2007, at 9:31 AM, Randy Bush wrote:

...

Tony Tauber wrote:

...
There's no magic bullet in updating BGP if a fundamental, verifiable data model is not accepted and agreed upon.

the space of routing data validation is large, we can explore it at our leisure, and we have been for some years. but my point was that it is silly to indulge in conjecturbation on the cause of the recent event and excoriate l(3), hanaro, or john curran's grandmother until we have heard from the folk who have actual data.

I can't help but conjecturbate how this might relate to route flap damping, and whether overly aggressive RFD might related to such DoS. The other side of the coin would be that RFD might also limit the extent spoofed routes. The amount of noise within the system makes it difficult for administrators to fully comprehending what happened while it is happening. A means to even partially validate routing information might provide more timely and greater insight. This insight may help rule out nefarious causes. When it doesn't, the issue might be far more serious. Crying wolf too many times is bad, but not seeing the wolf could be worse. -Doug

Sean Donelan

8 Jul 8 Jul

11:51 p.m.

On Sun, 8 Jul 2007, Steven M. Bellovin wrote:

...

...
I put up a diary at the Storm Center (http://isc.sans.org/diary.html?storyid=3112) that summarizes what we know about the Yahoo outage on Friday. If anybody has any additional info they want to share or comments about the write-up please let me know.

In other words, it was yet another BGP screw-up that secured routing could have prevented.

Or using route registeries and filters, or any of the other dozen ideas suggested over the last decade.

...

Any clue about the root cause, i.e., malice or accident?

Does it matter? You are screwed either way.

Steven M. Bellovin

9 Jul 9 Jul

1:11 a.m.

On Sun, 8 Jul 2007 19:51:04 -0400 (EDT) Sean Donelan <sean@donelan.com> wrote:

...

On Sun, 8 Jul 2007, Steven M. Bellovin wrote:

...
...
I put up a diary at the Storm Center (http://isc.sans.org/diary.html?storyid=3112) that summarizes what we know about the Yahoo outage on Friday. If anybody has any additional info they want to share or comments about the write-up please let me know.

In other words, it was yet another BGP screw-up that secured routing could have prevented.

Or using route registeries and filters, or any of the other dozen ideas suggested over the last decade.

...
Any clue about the root cause, i.e., malice or accident?

Does it matter? You are screwed either way.

It tells us what we need to do to prevent such things from happening in the future. For example, most misconfigurations could be blocked if all routers matched prefixes against originating ASNs, and it doesn't matter much if the assertion is digitally signed or not -- all that matters is that the check is done against some authoritative database run, say, by the RIRs. (No, that's not quite the right solution, but it serves to illustrate my point.) That's completely inadequate against an attacker. --Steve Bellovin, http://www.cs.columbia.edu/~smb

Sean Donelan

2:26 a.m.

On Sun, 8 Jul 2007, Steven M. Bellovin wrote:

...

...
...
Any clue about the root cause, i.e., malice or accident?

Does it matter? You are screwed either way.

It tells us what we need to do to prevent such things from happening in the future. For example, most misconfigurations could be blocked if all routers matched prefixes against originating ASNs, and it doesn't matter much if the assertion is digitally signed or not -- all that matters is that the check is done against some authoritative database run, say, by the RIRs. (No, that's not quite the right solution, but it serves to illustrate my point.) That's completely inadequate against an attacker.

The bad guys will (almost) always say Oops, it was an accident while being very clever at deliberatly bypassing every safety feature you can design into a system. The foolish guys will (almost) always say Oops, I didn't know while also being very clever at accidently bypassing every safety feature you can design into a system. Unfortunately engineering can't rely on human intentions. Both the evil and the foolish have the same result. The hope is the foolish will give up before bypassing the last step, so you keep adding more steps to stop the fool. The hope is the evilish will go after something easier before bypassing the last step, so you keep adding more steps to stop the evil (sic). As always evil or foolish gals do the same thing as evil or foolish guys. Its not just IP addresses that exhibit misrouting. But it only occassionaly effects the important or famous enough to attract attention. http://blog.oregonlive.com/siliconforest/2007/06/rivalry_between_qwest_comca...

6788

Age (days ago)

6790

Last active (days ago)

List overview

Download

25 comments

17 participants

participants (17)

Andy Dills
Cat Okita
Chris L. Morrow
Douglas Otis
Florian Weimer
jared mauch
Lixia Zhang
Marcus H. Sachs
Patrick W. Gilmore
Randy Bush
Ricardo V. Oliveira
Roland Dobbins
Sam Stickland
Sean Donelan
Steven M. Bellovin
Tony Tauber
Valdis.Kletnieks＠vt.edu