Katrina Network Damage Report

newer
Re: Multi-6 [WAS: OT - Vint Cerf...

Todd Underwood

9 Sep 2005 9 Sep '05

6:13 p.m.

As promised, Renesys has released a brief paper on the effects of Hurricane Katrina as seen from the Internet. We cover the period of land fall in some detail and also review the recovery efforts. http://www.renesys.com/resource_library/Renesys-Katrina-Report-9sep2005.pdf People who are interested should obviously read the report (and I'm pretty sure it's on-topic, for once! This might be the second on-topic thread today. Danger!). But highlights include: --the Internet was fine --the Gulf Coast wasn't --Louisiana was hit particularly hard --many outaged prefixes still haven't been restored, 10 days later We're happy to take questions on the report, the data, the methodology, etc. t. -- _____________________________________________________________________ todd underwood director of operations & security renesys - interdomain intelligence todd@renesys.com www.renesys.com

Show replies by date

Randy Bush

10 Sep 10 Sep

10:49 a.m.

this report repeatedly uses the term "outage." how is that determined/measured? randy

Todd Underwood

12:02 p.m.

randy, On Sat, Sep 10, 2005 at 05:49:59PM +0700, Randy Bush wrote:

...

this report repeatedly uses the term "outage." how is that determined/measured?

i think this is covered in the report several times, but i'm sorry if it wasn't clear. this is based on work that we've done for a while (some of which was presented at nanog30: http://nanog.org/mtg-0402/ogielski.html). the general idea is: take a large peerset sending you full routes, keep every update forever, and take a reasonably long (at least a month or two) time horizon. calculate a consensus view for each prefix as to whether that prefix is reachable by some set of those peers. an outaged prefix is one that used to be reachable that not no longer is. in other words, one that has been withdrawn from the full table by some sufficiently large number of peers. we exclude single-peer outages and outages that only affect a few peers through some reasonable thresholding. make sense? that's the general idea. the implementation is obviously a *lot* more complicated. t. (i'm sure the question of covering prefixes will come up shortly and i'll address it when/if it does). :-) -- _____________________________________________________________________ todd underwood director of operations & security renesys - interdomain intelligence todd@renesys.com www.renesys.com

Randy Bush

12:22 p.m.

but what about existence of covering or more specific prefixes? while aggregate inferences are likely reasonable, in general, inferring unreachability of end interfaces by looking only at routing data, especially multi-hop bgp data, worries me. randy

Todd Underwood

1:25 p.m.

randy brings up two separate questions... On Sat, Sep 10, 2005 at 07:22:34PM +0700, Randy Bush wrote:

...

but what about existence of covering or more specific prefixes? while aggregate inferences are likely reasonable, in general,

see? i told y'all that this would come up! yes, covering prefixes count. there are many fewer covering prefixes than many most net geeks would like to believe. there are also many prefixes that appear (in routing data only) to cover that do not, in fact, provide forwarding for the more specific prefixes. a simple analysis that only includes a covering prefix if it has exaclty the same origination pattern (last two ASes maybe), might be sufficient. still no way to tell, for certain, whether the cover works. our analysis didn't look at covering prefixes, but a spot check of the outaged prefixes doesn't reveal many. perhaps someone else would like our list of outaged prefixes to check those for cover?

...

inferring unreachability of end interfaces by looking only at routing data, especially multi-hop bgp data, worries me.

me, too. that's why we didn't do that. two issues in this second question: 1) the multi-hop issue is bogus, i believe. i'll ignore it unless randy chooses to say what he means here. 2) yes, indeed. we chose only to comment on changes in the routing table as changes in the routing table. inferences about unreachability of end interfaces is left entirely to the reader (randy, in this case). t. -- _____________________________________________________________________ todd underwood director of operations & security renesys - interdomain intelligence todd@renesys.com www.renesys.com

Randy Bush

9:11 p.m.

Re: From: Todd Underwood <todd@renesys.com> to quote bobby dylan "you don't need a weatherman to know which way the wind blows." i.e., unless you were the president, the department of fatherland security, or fema, you probably knew there was a major disaster ongoing in nola and surrounds. if you could read the newpapers, you could even have known of it in advance. but, the geolocation stuff is cool. could it have told us, in an operationally useful/timely manner, that at&t had moved from new jersey to spain the other day?

...

1) the multi-hop issue is bogus, i believe. i'll ignore it unless randy chooses to say what he means here.

maybe use <http://nanog.org/mtg-0210/wang.html>. some siteseer entries seem a bit mangled, but [0] seems ok.

...

2) yes, indeed. we chose only to comment on changes in the routing table as changes in the routing table. inferences about unreachability of end interfaces is left entirely to the reader

but reachability is what it's all about. the folk here are paid to deliver packets. the control plane (routing) is one of the tools we use to achieve that end. Re: From: George William Herbert <gherbert@retro.com>

...

Looking at the routing tables you see failures. If a prefix goes away completely and utterly, and is truly unreachable, then anyone trying to see it is going to see an outage.

not if a covering or more specific tells us how to get packets to the destination. but perhaps that's what you mean by a prefix being unreachable and i am being too picky. randy --- [0] - bibtex @inproceedings{ wang02observation, Author = "Lan Wang and Xiaoliang Zhao and Dan Pei and Randy Bush and Daniel Massey and Allison Mankin and S. Felix Wu and Lixia Zhang", Title = "Observation and Analysis of BGP Behavior Under Stress", BookTitle = "Proc. of ACM SIGCOMM Internet Measurement Workshop 2002, Marseille, France", Month= "Nov", Year = "2002", url = "citeseer.ist.psu.edu/article/wang02observation.html" }

bmanning＠vacation.karoshi.com

9:20 p.m.

...

but reachability is what it's all about. the folk here are paid to deliver packets. the control plane (routing) is one of the tools we use to achieve that end.

Re: From: George William Herbert <gherbert@retro.com>

...
Looking at the routing tables you see failures. If a prefix goes away completely and utterly, and is truly unreachable, then anyone trying to see it is going to see an outage.

not if a covering or more specific tells us how to get packets to the destination. but perhaps that's what you mean by a prefix being unreachable and i am being too picky.

would that be that -all- your neighbors have no information on how to forward that packet, then the destination is unreachable. what if a neighbor lies about reachablity and you dump your packets into their "blackhole"? that darned policy-constrained routing ick can be tough to deal w/...

...

randy

--bill (who will return to lurking)

Todd Underwood

11 Sep 11 Sep

3:05 p.m.

randy, all, On Sun, Sep 11, 2005 at 04:11:50AM +0700, Randy Bush wrote:

...

Re: From: Todd Underwood <todd@renesys.com>

...

but, the geolocation stuff is cool. could it have told us, in an operationally useful/timely manner, that at&t had moved from new jersey to spain the other day?

yes, within about 30s. but randy, you should know better than to think that requires any geolocation. 12/8 didn't move to spain, it moved to bolivia (AS26210). and since '12956 26210' was a novel origination pattern for 12/8 (and the other /8s involved), no geolocation required. simple analysis of bgp updates tells the story. anyone who can process updates from a large peerset and compare those to recent routing history or routing policy can report that as an anomaly.

...

...
1) the multi-hop issue is bogus, i believe. i'll ignore it unless randy chooses to say what he means here.

maybe use <http://nanog.org/mtg-0210/wang.html>. some siteseer entries seem a bit mangled, but [0] seems ok.

i'm familiar with the presentation, but thanks for citing it. as you know jim cowie, andy ogielski and bj premore did some related work for renesys including http://www.renesys.com/resource_library/renesys-spie2002.pdf and http://www.renesys.com/resource_library/Renesys-NANOG23.pdf (linked from: http://www.nanog.org/mtg-0110/global.html) these are all interesting work regarding whether bgp session resets during large-scale worms are the cause of monitoring artifacts or whether those worms cause instability themselves. there are differences of opinion about the results, but it's all interesting and worth reading. good stuff. but off topic here, i believe. randy: why do you think resets of multi-hop sessions has anything to do with these results reporting individual prefix outages in the Katrina-affected regions? sorry for being slow, but i'm just not seeing any connection. maybe someone smarter than me can spell it out in small words for me.

...

...
2) yes, indeed. we chose only to comment on changes in the routing table as changes in the routing table. inferences about unreachability of end interfaces is left entirely to the reader

but reachability is what it's all about. the folk here are paid to deliver packets. the control plane (routing) is one of the tools we use to achieve that end.

yes, of course. prefixes with no entry in a routing table are not reachable from that device. what i am saying is that we are not implying that the end interface went down or that the point to point link between the end user and their provider went down (although both of these seem likely). we are saying that there was not a routed path from a consensus of our peers to that prefix. so that is definitely unreachable from that consensus of those peers.

...

Re: From: George William Herbert <gherbert@retro.com>

...
Looking at the routing tables you see failures. If a prefix goes away completely and utterly, and is truly unreachable, then anyone trying to see it is going to see an outage.

not if a covering or more specific tells us how to get packets to the destination. but perhaps that's what you mean by a prefix being unreachable and i am being too picky.

i think you may be being picky, but i've already admitted that i'm having trouble following your points. :-) we can look in more detail at coverings and more specifics. but the depressing fact of the matter is that there are very few covering prefixes for many of these that are effective (which i define to mean: have the same origination pattern). my claim, and anyone with routeviews/ripe data and a few hundred MB of space can verify this, is that these prefixes are really and truly outaged. not reachable from pretty much anywhere on the Internet. i think that at the higher level, this is probably not as controversial as it seems to be in nanog so far. :-) t. -- _____________________________________________________________________ todd underwood director of operations & security renesys - interdomain intelligence todd@renesys.com www.renesys.com

Sean Donelan

10 Sep 10 Sep

2:18 p.m.

On Sat, 10 Sep 2005, Todd Underwood wrote:

...

the general idea is: take a large peerset sending you full routes, keep every update forever, and take a reasonably long (at least a month or two) time horizon. calculate a consensus view for each prefix as to whether that prefix is reachable by some set of those peers. an outaged prefix is one that used to be reachable that not no longer is. in other words, one that has been withdrawn from the full table by some sufficiently large number of peers.

This describes a partioning, not necessarily an outage.

Todd Underwood

2:26 p.m.

sean, On Sat, Sep 10, 2005 at 10:18:25AM -0400, Sean Donelan wrote:

...

On Sat, 10 Sep 2005, Todd Underwood wrote:

...
the general idea is: take a large peerset sending you full routes, keep every update forever, and take a reasonably long (at least a month or two) time horizon. calculate a consensus view for each prefix as to whether that prefix is reachable by some set of those peers. an outaged prefix is one that used to be reachable that not no longer is. in other words, one that has been withdrawn from the full table by some sufficiently large number of peers.

This describes a partioning, not necessarily an outage.

can you explain what you mean? t. -- _____________________________________________________________________ todd underwood director of operations & security renesys - interdomain intelligence todd@renesys.com www.renesys.com

7497

Age (days ago)

7499

Last active (days ago)

List overview

Download

9 comments

4 participants

participants (4)

bmanning＠vacation.karoshi.com
Randy Bush
Sean Donelan
Todd Underwood