inter-domain link recovery

older
question on algorithm for radius...

Chengchen Hu

15 Aug 2007 15 Aug '07

4:06 a.m.

Hi, folks I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually. We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible. And here is what I'd like to disscuss with you, especially the network operators, 1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators? 2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming? 3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them? 4. what infomation is required for a network operator to find the new route? Thank you. C. Hu

Show replies by date

Roland Dobbins

15 Aug 15 Aug

5:08 a.m.

On Aug 14, 2007, at 9:06 PM, Chengchen Hu wrote:

...

1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

There are an infinitude of possible answers to these questions which have nothing to do with BGP, per se; those answers are very subjective in nature. Can you provide some specific examples (citing, say, publicly-available historical BGP tables available from route-views, RIPE, et. al.) of an instance in which you believe that the BGP protocol itself is the culprit, along with the supporting data which indicate that the prefixes in question should've remained globally (for some value of 'globally') reachable? Or are these questions more to do with the general provisioning of interconnection relationships, and not specific to the routing protocol(s) in question? Physical connectivity to a specific point in a geographical region does not equate to logical connectivity to all the various networks in that larger region; SP networks (and customer networks, for that matter) are interconnected and exchange routing information (and, by implication, traffic) based upon various economic/contractual, technical/operational, and policy considerations which vary greatly from one instance to the next. So, the assertion that there were multiple unaffected physical data links to/from Taiwan in the cited instance - leaving aside for the moment whether this was actually the case, or whether sufficient capacity existed in those links to service traffic to/from the prefixes in question - in and of itself has no bearing on whether or not the appropriate physical and logical connectivity was in place in the form of peering or transit relationships to allow continued global reachability of the prefixes in question.

...

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

All of the above, and all of the above. Again, it's very situationally dependent.

...

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Proximate physical connectivity; capacity; economic/contractual, technical/operational, and policy considerations.

...

4. what infomation is required for a network operator to find the new route?

By 'find the new route', do you mean a new physical and logical interconnection to another SP? The following references should help shed some light on the general principles involved: <http://en.wikipedia.org/wiki/Peering> <http://www.nanog.org/subjects.html#peering> <http://www.aw-bc.com/catalog/academic/product/ 0,1144,0321127005,00.html> ----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // 408.527.6376 voice Culture eats strategy for breakfast. -- Ford Motor Company

Chengchen Hu

7:07 a.m.

Thank you for your detailed explainaton. Just suppose no business fators (like multiple ASes belongs to a same ISP), is it always possible for BGP to automatically find an alternative path when failure occurs if exist one? If not, what may be the causes? C. Hu ------------------------------------------------------------- From: Roland Dobbins Data: 2007-08-15 13:21:33 To: nanog CC: Subject: Re: inter-domain link recovery On Aug 14, 2007, at 9:06 PM, Chengchen Hu wrote:

...

1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

...

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

All of the above, and all of the above. Again, it's very situationally dependent.

...

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Proximate physical connectivity; capacity; economic/contractual, technical/operational, and policy considerations.

...

4. what infomation is required for a network operator to find the new route?

Roland Dobbins

8:13 a.m.

On Aug 15, 2007, at 12:07 AM, Chengchen Hu wrote:

...

is it always possible for BGP to automatically find an alternative path when failure occurs if exist one? If not, what may be the causes?

Barring implementation bugs or network misconfigurations, I've never experienced an operational problem with BGP4 (or OSPF or EIGRP or IS- IS or RIPv2, for that matter) converging correctly due to a flaw in the routing protocol, if that's the gist of the first question. There are many other factors external to the workings of the protocol itself which may affect routing convergence, of course; it really isn't practical to provide a meaningful answer to the second question in a reasonable amount of time, please see the previous reply. The questions that you're asking essentially boil down to 'How does the Internet work?', or, even more fundamentally, 'How does routing work?'. I would strongly suggest familiarizing oneself with the reference materials cited in the previous reply, as they provide a good introduction to the fundamentals of this topic. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // 408.527.6376 voice Culture eats strategy for breakfast. -- Ford Motor Company

Joel Jaeggli

8:32 a.m.

Chengchen Hu wrote:

...

Thank you for your detailed explainaton.

Just suppose no business fators (like multiple ASes belongs to a same ISP), is it always possible for BGP to automatically find an alternative path when failure occurs if exist one? If not, what may be the causes?

If you have multiple paths to a given prefix in your rib, you're going to use the shortest one. If it's withdrawn you'll use the next shortest one. If you have no paths remaining to that prefix, you can't forward the packet anymore. I think to look back at your original question. You're asking a specfic question about the dec 06 earth quake outage... The best people to ask why to took so long to restore are the operators who were most dramatically affected. The fact of the matter is most ISP's are not in the business of buying more diversity than they think they need in order to insure business continuity, support sla's and stay in business. The earthquake and undersea landslide affected a number of fiber paths over a short period of time. I think it's fair to assume that a number of operators have updated to their risk models to account for that sort of threat in the future. it's to totally anticipate the threat of loosing ~80% of your fiber capacity in a rather dense and well connected cooridor. There were two talks on the subject of that particular event at the first 07 nanog, you can peruse them here: http://www.nanog.org/mtg-0702/topics.html In particular the second talk discusses the signature of that outage in the routing table in some detail.

...

C. Hu

------------------------------------------------------------- From: Roland Dobbins Data: 2007-08-15 13:21:33 To: nanog CC: Subject: Re: inter-domain link recovery

On Aug 14, 2007, at 9:06 PM, Chengchen Hu wrote:

...
1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

There are an infinitude of possible answers to these questions which have nothing to do with BGP, per se; those answers are very subjective in nature. Can you provide some specific examples (citing, say, publicly-available historical BGP tables available from route-views, RIPE, et. al.) of an instance in which you believe that the BGP protocol itself is the culprit, along with the supporting data which indicate that the prefixes in question should've remained globally (for some value of 'globally') reachable?

Or are these questions more to do with the general provisioning of interconnection relationships, and not specific to the routing protocol(s) in question?

Physical connectivity to a specific point in a geographical region does not equate to logical connectivity to all the various networks in that larger region; SP networks (and customer networks, for that matter) are interconnected and exchange routing information (and, by implication, traffic) based upon various economic/contractual, technical/operational, and policy considerations which vary greatly from one instance to the next. So, the assertion that there were multiple unaffected physical data links to/from Taiwan in the cited instance - leaving aside for the moment whether this was actually the case, or whether sufficient capacity existed in those links to service traffic to/from the prefixes in question - in and of itself has no bearing on whether or not the appropriate physical and logical connectivity was in place in the form of peering or transit relationships to allow continued global reachability of the prefixes in question.

...
2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

All of the above, and all of the above. Again, it's very situationally dependent.

...
3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Proximate physical connectivity; capacity; economic/contractual, technical/operational, and policy considerations.

...
4. what infomation is required for a network operator to find the new route?

By 'find the new route', do you mean a new physical and logical interconnection to another SP?

The following references should help shed some light on the general principles involved:

<http://en.wikipedia.org/wiki/Peering>

<http://www.nanog.org/subjects.html#peering>

<http://www.aw-bc.com/catalog/academic/product/ 0,1144,0321127005,00.html>

----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // 408.527.6376 voice

Culture eats strategy for breakfast.

-- Ford Motor Company

Andy Davidson

10:09 a.m.

On 15 Aug 2007, at 08:07, Chengchen Hu wrote:

...

Just suppose no business fators (like multiple ASes belongs to a same ISP), is it always possible for BGP to automatically find an alternative path when failure occurs if exist one? If not, what may be the causes?

I think everyone here has already covered a lot of the bases to do with your original question (i.e. 'Because kit might not be configured to re-converge in an optimal way'), but if I can summarise the question you are trying to answer as "how can we improve convergence times", you might like to look at the notes that Nate Kushman presented in Toronto this year http://www.nanog.org/mtg-0702/kushman.html Summary Many studies show that when Internet links go up or down, the dynamics of BGP may cause several minutes of packet loss. The loss occurs even when multiple paths between the sender and receiver domains exist, and is unwarranted given the high connectivity of the Internet. Instead, we would like to ensure that Internet domains stay connected as long as the underlying network is connected. Andy

Patrick W. Gilmore

5:22 a.m.

On Aug 15, 2007, at 12:06 AM, Chengchen Hu wrote:

...

I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually.

We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible.

And here is what I'd like to disscuss with you, especially the network operators, 1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

Why do you think BGP was supposed to find the remaining path? Is it possible that the remaining fibers were not owned or leased by the networks in question? Or are you suggesting that any capacity should be available to anyone who "needs" it, whether they pay or not? BGP cannot find a path that the business rules forbid. -- TTFN, patrick

...

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

4. what infomation is required for a network operator to find the new route?

Thank you.

C. Hu

Chengchen Hu

7:11 a.m.

Thank you for comments. I know there are economic/contractual relationships between two networks, and BGP cannot find a path that the business rules forbid. But when in these cases, how to recover it? The network operators just wait for physically reparing the link or they may manully configure an alternative path by paying another network for transit service or finding a peering network? C. Hu ------------------------------------------------------------------------------------------------------------------------ From: Patrick W. Gilmore Data: 2007-08-15 13:59:16 To: nanog CC: Patrick W. Gilmore Subject: Re: inter-domain link recovery On Aug 15, 2007, at 12:06 AM, Chengchen Hu wrote:

...

I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually.

We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible.

And here is what I'd like to disscuss with you, especially the network operators, 1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

...

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

4. what infomation is required for a network operator to find the new route?

Thank you.

C. Hu

Roland Dobbins

8:17 a.m.

On Aug 15, 2007, at 12:11 AM, Chengchen Hu wrote:

...

But when in these cases, how to recover it? The network operators just wait for physically reparing the link or they may manully configure an alternative path by paying another network for transit service or finding a peering network?

Or they've already sufficient diversity in terms of peering/transit relationships and physical interconnectivity to handle the situation in question - depending upon the situation, of course. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // 408.527.6376 voice Culture eats strategy for breakfast. -- Ford Motor Company

michael.dillon＠bt.com

9:15 a.m.

...

Thank you for comments. I know there are economic/contractual relationships between two networks, and BGP cannot find a path that the business rules forbid. But when in these cases, how to recover it? The network operators just wait for physically reparing the link or they may manully configure an alternative path by paying another network for transit service or finding a peering network?

It sounds like you are asking this question in the context of an Internet exchange point where you connect to the exchange point, and then negotiate separate peering agreements with each participant, or a telecom hotel/data centre. In the exchange point, you could theoretically have special "INSURANCE" peering agreements where you don't exchange traffic until there is an emergency, and then you can quickly turn it on, perhaps using an automated tool. In the data centre, you could theoretically have a similar sort of agreement that only requires cross-connect cables to be installed. In fact, you could already have the cross-connect cables in place, waiting to be plugged in on your end, or fully plugged in waiting for you to enable the port. I wonder if anyone on the list has such INSURANCE peering or transit arrangements in place? Given the fact that most providers will go to extra efforts to install new circuits when there is an emergency like the Taiwan quake, perhaps there isn't as much value to such insurance arrangements as you might think. If we ever get to the point where most circuit connections in the core are via switched wavelengths, then perhaps BGP will be used to find new paths when others have failed. --Michael Dillon

Valdis.Kletnieks＠vt.edu

2:48 p.m.

On Wed, 15 Aug 2007 10:15:01 BST, michael.dillon@bt.com said:

...

telecom hotel/data centre. In the exchange point, you could theoretically have special "INSURANCE" peering agreements where you don't exchange traffic until there is an emergency, and then you can quickly turn it on, perhaps using an automated tool.

And then there's the fun of doing actual live fall-over testing to make sure it works as intended. (I wonder how many people who multi-home for outage survival actually *test* their multi on a reasonably regular basis?) And as Michael noted, this only works in a telcom hotel, where you don't have to pay for dark fiber from your site to the connection point...

Randy Bush

6:11 a.m.

without specific data, everything is guesswork, folklore, and guesswork. did i say guesswork. the two biggest causes of long bgp convergence are policy and damping. policy is hidden and you can only guess and infer. if you are lucky and have records from folk near the problem (not the incident, but the folk who can not see it) damping can be seen in places like route views and ris. can you give specific data? e.g asn42 could not see prefix 666.42/16 being announced by asn96 2007.03.24 at 19:00 gmt. randy

Stephen Wilcox

12:02 p.m.

On Wed, Aug 15, 2007 at 12:06:36PM +0800, Chengchen Hu wrote:

...

I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually.

We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible.

Please see the presentation I made at AMSIX in May (original version by Todd at Renesys): http://www.thedogsbollocks.co.uk/tech/0705quakes/AMSIXMay07-Quakes.ppt BGP failover worked fine, much of the instability occurs after the cable cuts as operators found their networks congested and tried to manually change to new uncongested routes. (Check slide 4) - the simple fact was that with something like 7 of 9 cables down the redundancy is useless .. even if operators maintained N+1 redundancy which is unlikely for many operators that would imply 50% of capacity was actually used with 50% spare.. however we see around 78% of capacity is lost. There was simply to much traffic and not enough capacity.. IP backbones fail pretty badly when faced with extreme congestion.

...

And here is what I'd like to disscuss with you, especially the network operators, 1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

No, BGP was fine.. this was a congestion issue - ultimately caused by lack of resiliency in cable routes in and out of the region.

...

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

Yes, and as the data shows this only made a bad situation worse.. any routes that may have had capacity were soon overwhelmed.

...

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Pick one that works? But in this case no such option was available.

...

4. what infomation is required for a network operator to find the new route?

In the case of a BGP change presumably the operator checks that the new path appears to function without latency or delay (a traceroute would be a basic way to check). In terms of a real fix, it cant be done with BGP, you would need to find unused Layer1 capacity and plug in a new cable. Slides 28-31 show that this occurred with Asian networks picking up Westward paths to Europe but it took some manual intervention, time, and money. I think the real question given the facts around this is whether South East Asia will look to protect against a future failure by providing new routes that circumvent single points of failure such as the Luzon straights at Taiwan. But that costs a lot of money .. so the futures not hopeful! Steve

michael.dillon＠bt.com

1:54 p.m.

...

I think the real question given the facts around this is whether South East Asia will look to protect against a future failure by providing new routes that circumvent single points of failure such as the Luzon straights at Taiwan. But that costs a lot of money .. so the futures not hopeful!

In addition to the existing (fairly new) Rostelekom fiber vie Heihe, there is a new 10G fiber build by China Unicom and the Russian company TTC. On the Russian side, TTC is a fully owned subsidiary of the Russian Railways which means that they have full access to Russia's extensive rail network rights-of-way. Russia is a huge country and except for a small are in the west (known as continental Europe) the rail network is the main means of transport. It's a bit like the excellent European railways except with huge railcars like in North America. I think that TTC will become the main land route from the far East into Europe because of this. Compare this map of the Trans-Baikal region railroad with the Google satellite images of the area. http://branch.rzd.ru/wps/PA_1_0_M1/FileDownload?vp=2&col_id=121&id=9173 The Unicom/TTC project is coming across the Chinese border on the second spur from the lower right corner. It's actually a cross-border line, the map just doesn't show the Chinese railways. If you go to the 7th level of zoom-in on Google Maps, the first Russian town that shows on the Chinese border (Blagoveshchensk) is where the fibre line will cross. --Michael Dillon

Sean Donelan

3:35 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Wed, 15 Aug 2007, Stephen Wilcox wrote:

...

(Check slide 4) - the simple fact was that with something like 7 of 9 cables down the redundancy is useless .. even if operators maintained N+1 redundancy which is unlikely for many operators that would imply 50% of capacity was actually used with 50% spare.. however we see around 78% of capacity is lost. There was simply to much traffic and not enough capacity.. IP backbones fail pretty badly when faced with extreme congestion.

Remember the end-to-end principle. IP backbones don't fail with extreme congestion, IP applications fail with extreme congestion. Should IP applications respond to extreme congestion conditions better? Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench. Even if the IP protocols recover "as designed," does human impatience mean there is a maximum recovery timeout period before humans start making the problem worse?

Fred Baker

3:47 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:

...

Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench.

Source Quench wouldn't be my favored solution here. What I might suggest is taking TCP SYN and SCTP INIT (or new sessions if they are encrypted or UDP) and put them into a lower priority/rate queue. Delaying the start of new work would have a pretty strong effect on the congestive collapse of the existing work, I should think.

Sean Donelan

3:59 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Wed, 15 Aug 2007, Fred Baker wrote:

...

On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:

...
Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench.

Source Quench wouldn't be my favored solution here. What I might suggest is taking TCP SYN and SCTP INIT (or new sessions if they are encrypted or UDP) and put them into a lower priority/rate queue. Delaying the start of new work would have a pretty strong effect on the congestive collapse of the existing work, I should think.

I was joking about Source Quench (missing :-), its got a lot of problems. But I think the fundamental issue is who is responsible for controlling the back-off process? The edge or the middle? Using different queues implies the middle (i.e. routers). At best it might be the "near-edge," and creating some type of shared knowledge between past, current and new sessions in the host stacks (and maybe middle-boxes like NAT gateways). How fast do you need to signal large-scale back-off over what time period? Since major events in the real-world also result in a lot of "new" traffic, how do you signal new sessions before they reach the affected region of the network? Can you use BGP to signal the far-reaches of the Internet that I'm having problems, and other ASNs should start slowing things down before they reach my region (security can-o-worms being opened).

Valdis.Kletnieks＠vt.edu

4:50 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Wed, 15 Aug 2007 11:59:54 EDT, Sean Donelan said:

...

Since major events in the real-world also result in a lot of "new" traffic, how do you signal new sessions before they reach the affected region of the network? Can you use BGP to signal the far-reaches of the Internet that I'm having problems, and other ASNs should start slowing things down before they reach my region (security can-o-worms being opened).

I'm more worried about state getting "stuck", kind of like the total inability of the DHS worry-o-meter to move lower than yellow.

Fred Baker

4:59 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

let me answer at least twice. As you say, remember the end-2-end principle. The end-2-end principle, in my precis, says "in deciding where functionality should be placed, do so in the simplest, cheapest, and most reliable manner when considered in the context of the entire network. That is usually close to the edge." Note the presence of advice and absence of mandate. Parekh and Gallagher in their 1993 papers on the topic proved using control theory that if we can specify the amount of data that each session keeps in the network (for some definition of "session") and for each link the session crosses define exactly what the link will do with it, we can mathematically predict the delay the session will experience. TCP congestion control as presently defined tries to manage delay by adjusting the window; some algorithms literally measure delay, while most measure loss, which is the extreme case of delay. The math tells me that place to control the rate of a session is in the end system. Funny thing, that is found "close to the edge". What ISPs routinely try to do is adjust routing in order to maximize their ability to carry customer sessions without increasing their outlay for bandwidth. It's called "load sharing", and we have a list of ways we do that, notably in recent years using BGP advertisements. Where Parekh and Gallagher calculated what the delay was, the ISP has the option of minimizing it through appropriate use of routing. ie, edge and middle both have valid options, and the totality works best when they work together. That may be heresy, but it's true. When I hear my company's marketing line on intelligence in the network (which makes me cringe), I try to remind my marketing folks that the best use of intelligence in the network is to offer intelligent services to the intelligent edge that enable the intelligent edge to do something intelligent. But there is a place for intelligence in the network, and routing its its poster child. In your summary of the problem, the assumption is that both of these are operative and have done what they can - several links are down, the remaining links (including any rerouting that may have occurred) are full to the gills, TCP is backing off as far as it can back off, and even so due to high loss little if anything productive is in fact happening. You're looking for a third "thing that can be done" to avoid congestive collapse, which is the case in which the network or some part of it is fully utilized and yet accomplishing no useful work. So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work. This is a burden I would not want to put on the host, because the probability is vanishingly small - any competent network operator is going to solve the problem with money if it is other than transient. But from where I sit, it looks like the "simplest, cheapest, and most reliable" place to detect overwhelming congestion is at the congested link, and given that sessions tend to be of finite duration and present semi-predictable loads, if you want to allow established sessions to complete, you want to run the established sessions in preference to new ones. The thing to do is delay the initiation of new sessions. If I had an ICMP that went to the application, and if I trusted the application to obey me, I might very well say "dear browser or p2p application, I know you want to open 4-7 TCP sessions at a time, but for the coming 60 seconds could I convince you to open only one at a time?". I suspect that would go a long way. But there is a trust issue - would enterprise firewalls let it get to the host, would the host be able to get it to the application, would the application honor it, and would the ISP trust the enterprise/host/application to do so? is ddos possible? <mumble> So plan B would be to in some way rate limit the passage of TCP SYN/ SYN-ACK and SCTP INIT in such a way that the hosed links remain fully utilized but sessions that have become established get acceptable service (maybe not great service, but they eventually complete without failing). On Aug 15, 2007, at 8:59 AM, Sean Donelan wrote:

...

On Wed, 15 Aug 2007, Fred Baker wrote:

...
On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:

...
Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench.

Source Quench wouldn't be my favored solution here. What I might suggest is taking TCP SYN and SCTP INIT (or new sessions if they are encrypted or UDP) and put them into a lower priority/rate queue. Delaying the start of new work would have a pretty strong effect on the congestive collapse of the existing work, I should think.

I was joking about Source Quench (missing :-), its got a lot of problems.

But I think the fundamental issue is who is responsible for controlling the back-off process? The edge or the middle?

Using different queues implies the middle (i.e. routers). At best it might be the "near-edge," and creating some type of shared knowledge between past, current and new sessions in the host stacks (and maybe middle-boxes like NAT gateways).

How fast do you need to signal large-scale back-off over what time period? Since major events in the real-world also result in a lot of "new" traffic, how do you signal new sessions before they reach the affected region of the network? Can you use BGP to signal the far-reaches of the Internet that I'm having problems, and other ASNs should start slowing things down before they reach my region (security can-o-worms being opened).

Sean Donelan

16 Aug 16 Aug

3:39 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

[...Lots of good stuff deleted to get to this point...] On Wed, 15 Aug 2007, Fred Baker wrote:

...

So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work. This is a burden I would not want to put on the host, because the probability is vanishingly small - any competent network operator is going to solve the problem with money if it is other than transient. But from where I sit, it looks like the "simplest, cheapest, and most reliable" place to detect overwhelming congestion is at the congested link, and given that sessions tend to be of finite duration and present semi-predictable loads, if you want to allow established sessions to complete, you want to run the established sessions in preference to new ones. The thing to do is delay the initiation of new sessions.

I view this as part of the flash crowd family of congestion problems, a combination of a rapid increase in demand and a rapid decrease in capacity. But instead of targeting a single destination, the impact is across multiple networks in the region. In the flash crowd cases (including DDOS variations), the place to respond (Note: the word change from "detect" to "respond") to extreme congestion does not seem toe be at the congested link but several hops upstream of the congested link. Current "effective practice" seems to be 1-2 ASN's away from the congested/failure point, but that may just also be the distance to reach "effective" ISP backbone engineer response.

...

If I had an ICMP that went to the application, and if I trusted the application to obey me, I might very well say "dear browser or p2p application, I know you want to open 4-7 TCP sessions at a time, but for the coming 60 seconds could I convince you to open only one at a time?". I suspect that would go a long way. But there is a trust issue - would enterprise firewalls let it get to the host, would the host be able to get it to the application, would the application honor it, and would the ISP trust the enterprise/host/application to do so? is ddos possible? <mumble>

For the malicious DDOS, of course we don't expect the hosts to obey. However, in the more general flash crowd case, I think the expectation of hosts following the RFC is pretty strong, although it may take years for new things to make it into the stacks. It won't slow down all the elephants, but maybe can turn the stampede into just a rampage. And the advantage of doing it in the edge host is their scale grow with the Internet. But even if the hosts don't respond to the back-off, it would give the edge more in-band trouble-shooting information. For example, ICMP "Destination Unreachable - Load shedding in effect. Retry after "N" seconds" (where N is stored like the Next-Hop MTU). Sending more packets to signal congestion, just makes congestion worse. However, having an explicit Internet "busy signal" is mostly to help network operators because firewalls will probably drop those ICMP messages just like PMTU.

...

So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK and SCTP INIT in such a way that the hosed links remain fully utilized but sessions that have become established get acceptable service (maybe not great service, but they eventually complete without failing).

This would be a useful plan B (or plan F - when things are really FUBARed), but I still think you need a way to signal it upstream 1 or 2 ASNs from the Extreme Congestion to be effective. For example, BGP says for all packets for network w.x.y.z with community a, implement back-off queue plan B. Probably not a queue per network in backbone routers, just one alternate queue plan B for all networks with that community. Once the origin ASN feels things are back to "normal," they can remove the community from their BGP announcements. But what should the alternate queue plan B be? Probably not fixed capacity numbers, but a distributed percentage across different upstreams. Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue Datagram protocol packets (UDP, ICMP, GRE, etc) 20% queue Session protocol established/finish packets (TCP ACK/FIN, etc) normal queue That values session oriented protocols more than datagram oriented protocols during extreme congestion. Or would it be better to let the datagram protocols fight it out with the session oriented protocols, just like normal Internet operations Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue And finally why only do this during extreme congestion? Why not always do it?

Fred Baker

4:40 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Aug 15, 2007, at 8:39 PM, Sean Donelan wrote:

...

Or would it be better to let the datagram protocols fight it out with the session oriented protocols, just like normal Internet operations

Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue

And finally why only do this during extreme congestion? Why not always do it?

I think I would always do it, and expect it to take effect only under extreme congestion. On Aug 15, 2007, at 8:39 PM, Sean Donelan wrote:

...

On Wed, 15 Aug 2007, Fred Baker wrote:

...
So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work.

I view this as part of the flash crowd family of congestion problems, a combination of a rapid increase in demand and a rapid decrease in capacity.

In many cases, yes. I know of a certain network that ran with 30% loss for a matter of years because the option didn't exist to increase the bandwidth. When it became reality, guess what they did. That's when I got to thinking about this.

Adrian Chadd

5:13 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Wed, Aug 15, 2007, Fred Baker wrote:

...

...
And finally why only do this during extreme congestion? Why not always do it?

I think I would always do it, and expect it to take effect only under extreme congestion.

Well, emprically (on multi-megabit customer-facing links) it takes effect immediately and results in congestion being "avoided" (for values of avoided.) You don't hit a "hm, this is fine" and "hm, this is congested"; you actually notice a much smoother performance degredation right up to 95% constant link use. Another thing that I've done on DSL links (and this was spawned by some of Tony Kapela's NANOG stuff) is to actually rate limit TCP SYN, UDP DNS, ICMP, etc) but what I noticed was that during periods of 90+% load TCP connections could still be established and slowly progress forward but what really busted up stuff was various P2P stuff. By also rate-limiting per-user TCP connection establishment (doing per-IP NAT maximum session counts, all in 12.4 on little Cisco 800's) the impact on bandwidth-hoggy applications was immediate. People were also very happy that their links was suddenly magically usable. I know a lot of these tricks can't be played on fat trunks (fair queueing on 10Gig?) as I just haven't touched the equipment, but my experience in enterprise switching environments with the Cisco QoS koolaid really does show congestion doesn't have to destroy performance. (Hm, an Ixia or two and a 7600 would be useful right about now.) Adrian

Fred Baker

6:46 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Aug 15, 2007, at 10:13 PM, Adrian Chadd wrote:

...

Well, emprically (on multi-megabit customer-facing links) it takes effect immediately and results in congestion being "avoided" (for values of avoided.) You don't hit a "hm, this is fine" and "hm, this is congested"; you actually notice a much smoother performance degredation right up to 95% constant link use.

yes, theory says the same thing. It's really convenient when theory and practice happen to agree :-) There is also a pretty good paper by Sue Moon et al in INFOCOMM 2004 that looks at the Sprint network (they had special access) and looks at variation in delay pop-2-pop at a microsecond granularity and finds some fairly interesting behavior long before that.

Sean Donelan

6:48 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Wed, 15 Aug 2007, Fred Baker wrote:

...

On Aug 15, 2007, at 8:39 PM, Sean Donelan wrote:

...
On Wed, 15 Aug 2007, Fred Baker wrote:

...
So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work.

I view this as part of the flash crowd family of congestion problems, a combination of a rapid increase in demand and a rapid decrease in capacity.

In many cases, yes. I know of a certain network that ran with 30% loss for a matter of years because the option didn't exist to increase the bandwidth. When it became reality, guess what they did. That's when I got to thinking about this.

Yeah, necessity is always the mother of invention. I first tried rate limiting the TCP SYNs with the Starr/Clinton report. It worked great for a while, but then the SYN-flood started backing up not only on the "congested" link, but also started congesting in other the peering networks (those were the days of OC3 backbones and head-of-line blocking NAP switches). And then the server choked.... So that's why I keep returning to the need to pushback traffic a couple of ASNs back. If its going to get dropped anyway, drop it sooner. Its also why I would really like to try to do something about the woodpecker hosts that think congestion means try more. If the back off slows down the host re-trying, its even further pushback.

Randy Bush

7:29 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

...

So that's why I keep returning to the need to pushback traffic a couple of ASNs back. If its going to get dropped anyway, drop it sooner.

ECN

Alexander Harrowell

9:55 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

An "Internet variable speed limit" is a nice idea, but there are some serious trust issues; applications have to trust the network implicitly not to issue gratuitous slow down messages, and certainly not to use them for evil purposes (not that I want to start a network neutrality flamewar...but what with the AT&T/Pearl Jam row, it's not hard to see rightsholders/telcos/government/alien space bats leaning on your upstream to spoil your access to content X). Further, you're going to need *very good* filtration; necessary to verify the source of any such packets closely due to the major DOS potential. Scenario: Bad Guy controls some hacked machines on AS666 DubiousNet, who peer at AMS-IX. Bad Guy has his bots inject a mass of "slow down!" packets with a faked source address taken from the IX's netblock...and everything starts moving Very Slowly. Especially if the suggestion upthread that the slowdown ought to be implemented 1-2 AS away from the problem is implemented, which would require forwarding the slowdowns between networks. It has some similarities with the Chinese firewall's use of quick TCP RSTs to keep users from seeing Bad Things; in that you could tell your machine to ignore'em. There's a sort of tragedy of the commons problem - if everyone agrees to listen to the slowdown requests, it will work, but all you need is a significant minority of the irresponsible, and there'll be no gain in listening to them.

Stephen Wilcox

10:29 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, Aug 16, 2007 at 10:55:34AM +0100, Alexander Harrowell wrote:

...

An "Internet variable speed limit" is a nice idea, but there are some serious trust issues; applications have to trust the network implicitly not to issue gratuitous slow down messages, and certainly not to use them for evil purposes (not that I want to start a network neutrality flamewar...but what with the AT&T/Pearl Jam row, it's not hard to see rightsholders/telcos/government/alien space bats leaning on your upstream to spoil your access to content X).

Further, you're going to need *very good* filtration; necessary to verify the source of any such packets closely due to the major DOS potential. Scenario: Bad Guy controls some hacked machines on AS666 DubiousNet, who peer at AMS-IX. Bad Guy has his bots inject a mass of "slow down!" packets with a faked source address taken from the IX's netblock...and everything starts moving Very Slowly. Especially if the suggestion upthread that the slowdown ought to be implemented 1-2 AS away from the problem is implemented, which would require forwarding the slowdowns between networks.

It has some similarities with the Chinese firewall's use of quick TCP RSTs to keep users from seeing Bad Things; in that you could tell your machine to ignore'em. There's a sort of tragedy of the commons problem - if everyone agrees to listen to the slowdown requests, it will work, but all you need is a significant minority of the irresponsible, and there'll be no gain in listening to them.

sounds a lot like MEDs - something you have to trust an unknown upstream to send you, of dubious origin, making unknown changes to performance on your network and also like MEDs, whilst it may work for some it wont for others.. a DSL provider may try to control input but a CDN will want to ignore them to maximise throughput and revenue Steve

Sean Donelan

3:27 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, 16 Aug 2007, Alexander Harrowell wrote:

...

An "Internet variable speed limit" is a nice idea, but there are some serious trust issues; applications have to trust the network implicitly not to issue gratuitous slow down messages, and certainly not to use them for

Yeah, that's why I was limiting the need (requirement) to only 1-few ASN hops upstream. I view this as similar to some backbones offering a special blackhole everything BGP community that usually is not transitive. This is the Oh Crap, Don't Blackhole Everything but Slow Stuff Down BGP community.

...

Further, you're going to need *very good* filtration; necessary to verify the source of any such packets closely due to the major DOS potential. Scenario: Bad Guy controls some hacked machines on AS666 DubiousNet, who peer at AMS-IX. Bad Guy has his bots inject a mass of "slow down!" packets with a faked source address taken from the IX's netblock...and everything starts moving Very Slowly. Especially if the suggestion upthread that the slowdown ought to be implemented 1-2 AS away from the problem is implemented, which would require forwarding the slowdowns between networks.

For the ICMP packet, man in the middle attacks are really no different than the validation required for any other protocol. For most protocols, you "should" get at least 64 bytes back of the original packet in the ICMP error message. You "should" be validating everything against what you sent. Be conservative in what you send, be suspicious in what you receive.

...

It has some similarities with the Chinese firewall's use of quick TCP RSTs to keep users from seeing Bad Things; in that you could tell your machine to ignore'em. There's a sort of tragedy of the commons problem - if everyone agrees to listen to the slowdown requests, it will work, but all you need is a significant minority of the irresponsible, and there'll be no gain in listening to them.

Penalty box, penalty box. Yeah, this is always the argument. But as we've seen with TCP, most host stacks try (more or less) to follow the RFCs. Why implement any TCP congestion management?

Randy Bush

4:53 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

...

Yeah, that's why I was limiting the need (requirement) to only 1-few ASN hops upstream. I view this as similar to some backbones offering a special blackhole everything BGP community that usually is not transitive. This is the Oh Crap, Don't Blackhole Everything but Slow Stuff Down BGP community.

and the two hops upstream but not the source router spools the packets to the hard drive? randy

Alexander Harrowell

4:57 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On 8/16/07, Randy Bush <randy@psg.com> wrote:

...

...
Yeah, that's why I was limiting the need (requirement) to only 1-few ASN hops upstream. I view this as similar to some backbones offering a special blackhole everything BGP community that usually is not transitive. This is the Oh Crap, Don't Blackhole Everything but Slow Stuff Down BGP community.

and the two hops upstream but not the source router spools the packets to the hard drive?

Ideally you'd want to influence the endpoint protocol stack, right? (Which brings us to the user trust thing.)

Randy Bush

5 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

Alexander Harrowell wrote:

...

...
Yeah, that's why I was limiting the need (requirement) to only 1-few ASN hops upstream. I view this as similar to some backbones offering a special blackhole everything BGP community that usually is not transitive. This is the Oh Crap, Don't Blackhole Everything but Slow Stuff Down BGP community. and the two hops upstream but not the source router spools the packets to the hard drive? Ideally you'd want to influence the endpoint protocol stack, right?

ECN sally floyd ain't stoopid

Sean Donelan

5:09 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, 16 Aug 2007, Randy Bush wrote:

...

Alexander Harrowell wrote:

...
...
Yeah, that's why I was limiting the need (requirement) to only 1-few ASN hops upstream. I view this as similar to some backbones offering a special blackhole everything BGP community that usually is not transitive. This is the Oh Crap, Don't Blackhole Everything but Slow Stuff Down BGP community. and the two hops upstream but not the source router spools the packets to the hard drive? Ideally you'd want to influence the endpoint protocol stack, right?

ECN

sally floyd ain't stoopid

ECN doesn't affect the initial SYN packets. I agree, sally floyd ain't stoopid.

Sean Donelan

4:43 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Wed, 15 Aug 2007, Randy Bush wrote:

...

...
So that's why I keep returning to the need to pushback traffic a couple of ASNs back. If its going to get dropped anyway, drop it sooner.

ECN

Oh goody, the whole RED, BLUE, WRED, AQM, etc menagerie. Connections already in progress (i.e. the ones with ECN) we want to keep working and finish. We don't want those connections to abort in the middle, and then add to the congestion when they retry. The phrase everyone is trying to avoid saying is "Admission Control." The Internet doesn't do admission control well (or even badly).

Randy Bush

4:50 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

...

...
...
So that's why I keep returning to the need to pushback traffic a couple of ASNs back. If its going to get dropped anyway, drop it sooner. ECN Oh goody, the whole RED, BLUE, WRED, AQM, etc menagerie.

wow! is that what ECN stands for? somehow, in all this time, i missed that. live and learn.

...

Connections already in progress (i.e. the ones with ECN) we want to keep working and finish. We don't want those connections to abort in the middle, and then add to the congestion when they retry.

so the latest version of ECN aborts connections? wow! i am really learning a lot, and it's only the first cup of coffee today. thanks!

...

The phrase everyone is trying to avoid saying is "Admission Control."

you want "pushback traffic a couple of ASNs back," the actual question i was answering, you are talking admission control. randy

Fred Baker

4:46 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

yes. On Aug 16, 2007, at 12:29 AM, Randy Bush wrote:

...

...
So that's why I keep returning to the need to pushback traffic a couple of ASNs back. If its going to get dropped anyway, drop it sooner.

ECN

michael.dillon＠bt.com

2:46 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

...

In many cases, yes. I know of a certain network that ran with 30% loss for a matter of years because the option didn't exist to increase the bandwidth. When it became reality, guess what they did.

How many people have noticed that when you replace a circuit with a higher capacity one, the traffic on the new circuit is suddenly greater than 100% of the old one. Obviously this doesn't happen all the time, such as when you have a 40% threshold for initiating a circuit upgrade, but if you do your upgrades when they are 80% or 90% full, this does happen. --Michael Dillon

Hex Star

4:07 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

How does akamai handle traffic congestion so seamlessly? Perhaps we should look at existing setups implemented by companies such as akamai for guidelines regarding how to resolve this kind of issue...

Stephen Wilcox

17 Aug 17 Aug

10:57 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, Aug 16, 2007 at 09:07:31AM -0700, Hex Star wrote:

...

How does akamai handle traffic congestion so seamlessly? Perhaps we should look at existing setups implemented by companies such as akamai for guidelines regarding how to resolve this kind of issue...

and if you are a Content Delivery Network wishing to use a cache deployment architecture you should do just that ... but for networks with big backbones as per this discussion we need to do something else Steve

Patrick W. Gilmore

11:28 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Aug 17, 2007, at 6:57 AM, Stephen Wilcox wrote:

...

On Thu, Aug 16, 2007 at 09:07:31AM -0700, Hex Star wrote:

...
How does akamai handle traffic congestion so seamlessly? Perhaps we should look at existing setups implemented by companies such as akamai for guidelines regarding how to resolve this kind of issue...

and if you are a Content Delivery Network wishing to use a cache deployment architecture you should do just that ... but for networks with big backbones as per this discussion we need to do something else

Ignoring "Akamai" and looking at just content providers (CDN or otherwise) in general, there is a huge difference between telling a web server "do not serve more than 900 Mbps on your GigE port", and a router which simply gets bits from random sources to be forwarded to random destinations. IOW: Steve is right, those are two different topics. -- TTFN, patrick

Mikael Abrahamsson

16 Aug 16 Aug

4:29 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, 16 Aug 2007, michael.dillon@bt.com wrote:

...

How many people have noticed that when you replace a circuit with a higher capacity one, the traffic on the new circuit is suddenly greater than 100% of the old one. Obviously this doesn't happen all the time, such as when you have a 40% threshold for initiating a circuit upgrade, but if you do your upgrades when they are 80% or 90% full, this does happen.

I'd say this might happen on links connected to devices with small buffers such as with a 7600 with lan cards, foundry device or alike. If you look at the same behaviour of a deep packet buffer device such as juniper or cisco GSR/CRS-1 the behaviour you're describing doesn't exist (at least not that I have noticed). -- Mikael Abrahamsson email: swmike@swm.pp.se

Deepak Jain

6:49 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

Mikael Abrahamsson wrote:

...

On Thu, 16 Aug 2007, michael.dillon@bt.com wrote:

...
How many people have noticed that when you replace a circuit with a higher capacity one, the traffic on the new circuit is suddenly greater than 100% of the old one. Obviously this doesn't happen all the time, such as when you have a 40% threshold for initiating a circuit upgrade, but if you do your upgrades when they are 80% or 90% full, this does happen.

I'd say this might happen on links connected to devices with small buffers such as with a 7600 with lan cards, foundry device or alike. If you look at the same behaviour of a deep packet buffer device such as juniper or cisco GSR/CRS-1 the behaviour you're describing doesn't exist (at least not that I have noticed).

Depends on your traffic type and I think this really depends on the granularity of your study set (when you are calculating 80-90% usage). If you upgrade early, or your (shallow) packet buffers convince to upgrade late, the effects might be different. If you do upgrades assuming the same amount of latency and packet loss on any circuit, you should see the same effect irrespective of buffer depth. (for any production equipment by a main vendor). Deeper buffers allow you to run closer to 100% (longer) with fewer packet drops at the cost of higher latency. The assumption being that more congested devices with smaller buffers are dropping some packets here and there and causing those sessions to back off in a way the deeper buffer systems don't. Its a business case whether its better to upgrade early or buy gear that lets you upgrade later. DJ

Mikael Abrahamsson

7:03 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, 16 Aug 2007, Deepak Jain wrote:

...

Depends on your traffic type and I think this really depends on the granularity of your study set (when you are calculating 80-90% usage). If you upgrade early, or your (shallow) packet buffers convince to upgrade late, the effects might be different.

My guess is that the value comes from mrtg or alike, 5 minute average utilization.

...

If you do upgrades assuming the same amount of latency and packet loss on any circuit, you should see the same effect irrespective of buffer depth. (for any production equipment by a main vendor).

I do not agree. A shallow buffer device will give you packet loss without any major latency increase, whereas a deep buffer device will give you latency without packet loss (as most users out there will not have sufficient tcp window size to utilize a 300+ ms latency due to buffering, they will throttle back their usage of the link, and it can stay at 100% utilization without packet loss for quite some time). Yes, these two cases will both enable link utilization to get to 100% on average, and in most cases users will actually complain less as the packet loss will most likely be less noticable to them in traceroute than the latency increase due to buffering. Anyhow, I still consider a congested backbone an operational failure as one is failing to provide adequate service to the customers. Congestion should happen on the access line to the customer, nowhere else.

...

Deeper buffers allow you to run closer to 100% (longer) with fewer packet drops at the cost of higher latency. The assumption being that more congested devices with smaller buffers are dropping some packets here and there and causing those sessions to back off in a way the deeper buffer systems don't.

Correct.

...

Its a business case whether its better to upgrade early or buy gear that lets you upgrade later.

It depends on your bw cost, if your link is very expensive then it might make sense to use manpower opex and equipment capex to prolong the usage of that link by trying to cram everything you can out of it. In the long run there is of course no way to avoid upgrade, as users will notice it anyhow. -- Mikael Abrahamsson email: swmike@swm.pp.se

Fred Baker

5:15 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Aug 16, 2007, at 7:46 AM, <michael.dillon@bt.com> wrote:

...

...
In many cases, yes. I know of a certain network that ran with 30% loss for a matter of years because the option didn't exist to increase the bandwidth. When it became reality, guess what they did.

How many people have noticed that when you replace a circuit with a higher capacity one, the traffic on the new circuit is suddenly greater than 100% of the old one. Obviously this doesn't happen all the time, such as when you have a 40% threshold for initiating a circuit upgrade, but if you do your upgrades when they are 80% or 90% full, this does happen.

well, so lets do a thought experiment. First, that infocomm paper I mentioned says that they measured the variation in delay pop-2-pop at microsecond granularity with hyper- synchronized clocks, and found that with 90% confidence the variation in delay in their particular optical network was less than 1 ms. Also with 90% confidence, they noted "frequent" (frequency not specified, but apparently pretty frequent, enough that one of the authors later worried in my presence about offering VoIP services on it) variations on the order of 10 ms. For completeness, I'll note that they had six cases in a five hour sample where the delay changed by 100 ms and stayed there for a period of time, but we'll leave that observation for now. Such spikes are not difficult to explain. If you think of TCP as an on-off function, a wave function with some similarities to a sin wave, you might ask yourself what the sum of a bunch of sin waves with slightly different periods is. It is also a wave function, and occasionally has a very tall peak. The study says that TCP synchronization happens in the backbone. Surprise. Now, let's say you're running your favorite link at 90% and get such a spike. What happens? The tip of it gets clipped off - a few packets get dropped. Those TCPs slow down momentarily. The more that happens, the more frequently TCPs get clipped and back off. Now you upgrade the circuit and the TCPs stop getting clipped. What happens? The TCPs don't slow down. They use the bandwidth you have made available instead. in your words, "the traffic on the new circuit is suddenly greater than 100% of the old one". In 1995 at the NGN conference, I found myself on a stage with Phill Gross, then a VP at MCI. He was basically reporting on this phenomenon and apologizing to his audience. MCI had put in an OC-3 network - gee-whiz stuff then - and had some of the links run too close to full before starting to upgrade. By the time they had two OC-3's in parallel on every path, there were some paths with a standing 20% loss rate. Phill figured that doubling the bandwidth again (622 everywhere) on every path throughout the network should solve the problem for that remaining 20% of load, and started with the hottest links. To his surprise, with the standing load > 95% and experiencing 20% loss at 311 MBPS, doubling the rate to 622 MBPS resulted in links with a standing load > 90% and 4% loss. He still needed more bandwidth. After we walked offstage, I explained TCP to him... Yup. That's what happens. Several folks have commented on p2p as a major issue here. Personally, I don't think of p2p as the problem in this context, but it is an application that exacerbates the problem. Bottom line, the common p2p applications like to keep lots of TCP sessions flowing, and have lots of data to move. Also (and to my small mind this is egregious), they make no use of locality - if the content they are looking for is both next door and half-way around the world, they're perfectly happen to move it around the world. Hence, moving a file into a campus doesn't mean that the campus has the file and will stop bothering you. I'm pushing an agenda in the open source world to add some concept of locality, with the purpose of moving traffic off ISP networks when I can. I think the user will be just as happy or happier, and folks pushing large optics will certainly be.

Mikael Abrahamsson

8:55 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, 16 Aug 2007, Fred Baker wrote:

...

world, they're perfectly happen to move it around the world. Hence, moving a file into a campus doesn't mean that the campus has the file and will stop bothering you. I'm pushing an agenda in the open source world to add some concept of locality, with the purpose of moving traffic off ISP networks when I can. I think the user will be just as happy or happier, and folks pushing large optics will certainly be.

With the regular user small TCP window size, you still get a sense of locality as more data during the same time will flow from a source that is closer to you RTT-wise than from one that is far away. We've been pitching the idea to bittorrent tracker authors to include a BGP feed and prioritize peers that are in the same ASN as the user himself, but they're having performance problems already so they're not so keen on adding complexity. If it could be solved better at the client level that might help, but the end user who pays flat rate has little incentive to help the ISP in this case. -- Mikael Abrahamsson email: swmike@swm.pp.se

Perry Lorier

18 Aug 18 Aug

2:29 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

...

We've been pitching the idea to bittorrent tracker authors to include a BGP feed and prioritize peers that are in the same ASN as the user himself, but they're having performance problems already so they're not so keen on adding complexity. If it could be solved better at the client level that might help, but the end user who pays flat rate has little incentive to help the ISP in this case.

Many networking stacks have a "TCP_INFO" ioctl that can be used to query for more accurate statistics on how the TCP connection is fairing (number of retransmits, TCP's current estimate of the RTT (and jitter), etc). I've always pondered if bittorrent clients made use of this to better choose which connections to prefer and which ones to avoid. I'm unfortunately unsure if windows has anything similar. One problem with having clients only getting told about clients that are near to them is that the network starts forming "cliques". Each clique works as a separate network and you can end up with silly things like one clique being full of seeders, and another clique not even having any seeders at all. Obviously this means that a tracker has to send a handful of addresses of clients outside the "clique" network that the current client belongs to. You want to make hosts talk to people that are close to you, you want to make sure that hosts don't form cliques, and you want something that a tracker can very quickly figure out from information that is easily available to people who run trackers. My thought here was to sort all the IP addresses, and send the next 'n' IP addresses after the client IP as well as some random ones. If we assume that IP's are generally allocated in contiguous groups then this means that clients should be generally at least told about people nearby, and hopefully that these hosts aren't too far apart (at least likely to be within a LIR or RIR). This should be able to be done in O(log n) which should be fairly efficient.

Mikael Abrahamsson

19 Aug 19 Aug

3:12 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Sun, 19 Aug 2007, Perry Lorier wrote:

...

Many networking stacks have a "TCP_INFO" ioctl that can be used to query for more accurate statistics on how the TCP connection is fairing (number of retransmits, TCP's current estimate of the RTT (and jitter), etc). I've always pondered if bittorrent clients made use of this to better choose which connections to prefer and which ones to avoid. I'm unfortunately unsure if windows has anything similar.

Well, by design bittorrent will try to get everything as fast as possible from all peers, so any TCP session giving good performance (often low packet loss and low latency) will thus end up transmitting a lot of the data in the torrent, so by design bittorrent is kind of localised, at least in the sense that it will utilize fast peers more than slower ones and these are normally closer to you.

...

One problem with having clients only getting told about clients that are near to them is that the network starts forming "cliques". Each clique works as a separate network and you can end up with silly things like one clique being full of seeders, and another clique not even having any seeders at all. Obviously this means that a tracker has to send a handful of addresses of clients outside the "clique" network that the current client belongs to.

The idea we pitched was that of the 50 addresses that the tracker returns to the client, 25 (if possible) should be from the same ASN as the client itself, or a nearby ASN (by some definition). If there are a lot of peers (more than 50) the tracker will return a random set of clients, we wanted this to be not random but 25 of them should be by network proximity (by some definition).

...

You want to make hosts talk to people that are close to you, you want to make sure that hosts don't form cliques, and you want something that a tracker can very quickly figure out from information that is easily available to people who run trackers. My thought here was to sort all the IP addresses, and send the next 'n' IP addresses after the client IP as well as some random ones. If we assume that IP's are generally allocated in contiguous groups then this means that clients should be generally at least told about people nearby, and hopefully that these hosts aren't too far apart (at least likely to be within a LIR or RIR). This should be able to be done in O(log n) which should be fairly efficient.

Yeah, we discussed that the list of IPs should be sorted (doing insertion sort) in the data structures in the tracker already, so what you're saying is one way of defining proximity that as you're saying, would probably be quite efficient. -- Mikael Abrahamsson email: swmike@swm.pp.se

Alexander Harrowell

21 Aug 21 Aug

5:24 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

This is what I eventually upshot.. http://www.telco2.net/blog/2007/08/variable_speed_limits_for_the.html On 8/19/07, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

...

On Sun, 19 Aug 2007, Perry Lorier wrote:

...
Many networking stacks have a "TCP_INFO" ioctl that can be used to query for more accurate statistics on how the TCP connection is fairing (number of retransmits, TCP's current estimate of the RTT (and jitter), etc). I've always pondered if bittorrent clients made use of this to better choose which connections to prefer and which ones to avoid. I'm unfortunately unsure if windows has anything similar.

Well, by design bittorrent will try to get everything as fast as possible from all peers, so any TCP session giving good performance (often low packet loss and low latency) will thus end up transmitting a lot of the data in the torrent, so by design bittorrent is kind of localised, at least in the sense that it will utilize fast peers more than slower ones and these are normally closer to you.

...
One problem with having clients only getting told about clients that are near to them is that the network starts forming "cliques". Each clique works as a separate network and you can end up with silly things like one clique being full of seeders, and another clique not even having any seeders at all. Obviously this means that a tracker has to send a handful of addresses of clients outside the "clique" network that the current client belongs to.

The idea we pitched was that of the 50 addresses that the tracker returns to the client, 25 (if possible) should be from the same ASN as the client itself, or a nearby ASN (by some definition). If there are a lot of peers (more than 50) the tracker will return a random set of clients, we wanted this to be not random but 25 of them should be by network proximity (by some definition).

...
You want to make hosts talk to people that are close to you, you want to make sure that hosts don't form cliques, and you want something that a tracker can very quickly figure out from information that is easily available to people who run trackers. My thought here was to sort all the IP addresses, and send the next 'n' IP addresses after the client IP as well as some random ones. If we assume that IP's are generally allocated in contiguous groups then this means that clients should be generally at least told about people nearby, and hopefully that these hosts aren't too far apart (at least likely to be within a LIR or RIR). This should be able to be done in O(log n) which should be fairly efficient.

Yeah, we discussed that the list of IPs should be sorted (doing insertion sort) in the data structures in the tracker already, so what you're saying is one way of defining proximity that as you're saying, would probably be quite efficient.

-- Mikael Abrahamsson email: swmike@swm.pp.se

Mikael Abrahamsson

5:34 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Tue, 21 Aug 2007, Alexander Harrowell wrote:

...

This is what I eventually upshot..

http://www.telco2.net/blog/2007/08/variable_speed_limits_for_the.html

You wrote in your blog:

...

"The problem is that if there is a major problem, very large numbers of users applications will all try to resend; generating a packet storm and creating even more congestion."

Do you have any data/facts to back up this statement? I'd be very interested to hear them, as I have heard this statement a few times before but it's a contradiction to the way I understand things to work. -- Mikael Abrahamsson email: swmike@swm.pp.se

Joe Provo

19 Aug 19 Aug

2:48 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, Aug 16, 2007 at 10:55:59PM +0200, Mikael Abrahamsson wrote: [snip]

...

We've been pitching the idea to bittorrent tracker authors to include a BGP feed and prioritize peers that are in the same ASN as the user himself, but they're having performance problems already so they're not so keen on adding complexity. If it could be solved better at the client level that might help, but the end user who pays flat rate has little incentive to help the ISP in this case.

Some of those maligned middleboxes deployed in last-mile networks don't just throttle, but also use available topology data to optimize locality per the ISPs policies. -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE

michael.dillon＠bt.com

16 Aug 16 Aug

10:20 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

...

The TCPs don't slow down. They use the bandwidth you have made available instead.

in your words, "the traffic on the new circuit is suddenly greater than 100% of the old one".

Exactly! To be honest, I first encountered this when Avi Freedman upgraded one of his upstream connections from T1 to DS3 and either Avi, or one of his employees mentioned this on inet-access or nanog. So I did a bit of digging and discovered that other people had noticed that TCP traffic tends to be fractal (or multi-fractal) in nature. That means that the peaks which cause this effect are hard to get rid of entirely.

...

To his surprise, with the standing load > 95% and experiencing 20% loss at 311 MBPS, doubling the rate to 622 MBPS resulted in links with a standing load > 90% and 4% loss. He still needed more bandwidth. After we walked offstage, I explained TCP to him...

That is something that an awful lot of operations and capacity planning people do not understand. They still think in terms of pipes with TCP flavoured water flowing in them. But this is exactly the behavior that you would expect from fractal traffic. The doubled capacity gave enough headroom for some of the peaks to get through, but not enough for all of them. On Ebone in Europe we used to have 40% as our threshold for upgrading core circuits.

...

I'm pushing an agenda in the open source world to add some concept of locality, with the purpose of moving traffic off ISP networks when I can. I think the user will be just as happy or happier, and folks pushing large optics will certainly be.

When you hear stories like the Icelandic ISP who discovered that P2P was 80% of their submarine bandwidth and promptly implemented P2P throttling, I think that the open source P2P will be driven to it by their user demand. --Michael Dillon

Adrian Chadd

17 Aug 17 Aug

2:20 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On Thu, Aug 16, 2007, michael.dillon@bt.com wrote:

...

...
I'm pushing an agenda in the open source world to add some concept of locality, with the purpose of moving traffic off ISP networks when I can. I think the user will be just as happy or happier, and folks pushing large optics will certainly be.

When you hear stories like the Icelandic ISP who discovered that P2P was 80% of their submarine bandwidth and promptly implemented P2P throttling, I think that the open source P2P will be driven to it by their user demand.

.. or we could start talking about how Australian ISPs are madly throttling P2P traffic. Not just because of its impact on international trunks, but their POP/wholesale DSL infrastructure method just makes P2P even between clients on the same ISP mostly horrible. Adrian

Alexander Harrowell

9:24 a.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

On 8/17/07, Adrian Chadd <adrian@creative.net.au> wrote:

...

On Thu, Aug 16, 2007, michael.dillon@bt.com wrote:

...
...
I'm pushing an agenda in the open source world to add some concept of locality, with the purpose of moving traffic off ISP networks when I can. I think the user will be just as happy or happier, and folks pushing large optics will certainly be.

This is badly needed in my humble opinion; regarding the wireless LAN case described, it's true that this behaviour would be technically suboptimal, but interestingly the real reason for implementing it would be maintained - economics. After all, the network operator (the owner of the wireless LAN) isn't consuming any more upstream as a result.

...

...
When you hear stories like the Icelandic ISP who discovered that P2P was 80% of their submarine bandwidth and promptly implemented P2P throttling, I think that the open source P2P will be driven to it by their user demand.

Yes. An important factor in future design will be "network friendliness/responsibility". .. or we could start talking about how Australian ISPs are madly throttling

...

P2P traffic. Not just because of its impact on international trunks, but their POP/wholesale DSL infrastructure method just makes P2P even between clients on the same ISP mostly horrible.

Similar to the pre-LLU, BT IPStream ops in the UK. Charging flat rates to customers and paying per-bit to wholesalers is an obvious economic problem; possibly even more expensive to localise the p2p traffic, if the price of wholesale access bits is greater than peering/transit ones!

Stephen Wilcox

15 Aug 15 Aug

4:12 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

Hey Sean, On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote:

...

On Wed, 15 Aug 2007, Stephen Wilcox wrote:

...
(Check slide 4) - the simple fact was that with something like 7 of 9 cables down the redundancy is useless .. even if operators maintained N+1 redundancy which is unlikely for many operators that would imply 50% of capacity was actually used with 50% spare.. however we see around 78% of capacity is lost. There was simply to much traffic and not enough capacity.. IP backbones fail pretty badly when faced with extreme congestion.

Remember the end-to-end principle. IP backbones don't fail with extreme congestion, IP applications fail with extreme congestion.

Hmm I'm not sure about that... a 100% full link dropping packets causes many problems: L7: Applications stop working, humans get angry L4: TCP/UDP drops cause retransmits, connection drops, retries etc L3: BGP sessions drop, OSPF hellos are lost.. routing fails L2: STP packets dropped.. switching fails I believe any or all of the above could occur on a backbone which has just failed massively and now has 20% capacity available such as occurred in SE Asia

...

Should IP applications respond to extreme congestion conditions better? alert('Connection dropped') "Ping timed out"

kinda icky but its not the applications job to manage the network

...

Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench.

yes and no.. for a private network perhaps, but for the Internet backbone where all traffic is important (right?), differentiation is difficult unless applied at the edge and you have major failure and congestion i dont see what you can do that will have any reasonable effect. perhaps you are a government contractor and you reserve some capacity for them and drop everything else but what is really out there as a solution? FYI I have seen telephone networks fail badly under extreme congestion. CO's have small CPUs that dont do a whole lot - setup calls, send busy signals .. once a call is in place it doesnt occupy CPU time as the path is locked in place elsewhere. however, if something occurs to cause a serious amount of busy ccts then CPU usage goes thro the roof and you can cause cascade failures of whole COs telcos look to solutions such as call gapping to intervene when they anticipate major congestion, and not rely on the network to handle it

...

Even if the IP protocols recover "as designed," does human impatience mean there is a maximum recovery timeout period before humans start making the problem worse?

i'm not sure they were designed to do this.. the arpanet wasnt intended to be massively congested.. the redundant links were in place to cope with loss of a node and usage was manageable. Steve

Chiloé Temuco

5:06 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

Congestion and applications... My opinion: A tier 1 provider does not care what traffic it carries. That is all a function of the application not the network. A tier 2 provider may do traffic shaping, etc. A tier 3 provider may decide to block traffic paterns. ------------------------------ More or less... The network was intended to move data from one machine to another... The less manipulation in the middle the better... No manipulation of the payload is the name of the game. That being said. It's entirely a function of the application to timeout and drop out of order packets, etc. ONS is designed around this principle. In streaming data... often it is better to get bad or missing data than to try and put out of order or bad data in the buffer... A good example is digital over-the-air tv... If you didn't build in enough error correction... then you'll have digital breakup, etc. It is impossible to recover any of that data. If reliable transport of data is required... That is a function of the application. ONS is an Optical Networking Standard in the development stage. -Chiloe Temuco On 8/15/07, Stephen Wilcox <steve.wilcox@packetrade.com> wrote:

...

Hey Sean,

On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote:

...
On Wed, 15 Aug 2007, Stephen Wilcox wrote:

...
(Check slide 4) - the simple fact was that with something like 7 of 9 cables down the redundancy is useless .. even if operators maintained N+1 redundancy which is unlikely for many operators that would imply 50% of capacity was actually used with 50% spare.. however we see around 78% of capacity is lost. There was simply to much traffic and not enough capacity.. IP backbones fail pretty badly when faced with extreme congestion.

Remember the end-to-end principle. IP backbones don't fail with extreme congestion, IP applications fail with extreme congestion.

Hmm I'm not sure about that... a 100% full link dropping packets causes many problems: L7: Applications stop working, humans get angry L4: TCP/UDP drops cause retransmits, connection drops, retries etc L3: BGP sessions drop, OSPF hellos are lost.. routing fails L2: STP packets dropped.. switching fails

I believe any or all of the above could occur on a backbone which has just failed massively and now has 20% capacity available such as occurred in SE Asia

...
Should IP applications respond to extreme congestion conditions better? alert('Connection dropped') "Ping timed out"

kinda icky but its not the applications job to manage the network

...
Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone network special information tone -- All Circuits are Busy. Maybe we've found a new use for ICMP Source Quench.

yes and no.. for a private network perhaps, but for the Internet backbone where all traffic is important (right?), differentiation is difficult unless applied at the edge and you have major failure and congestion i dont see what you can do that will have any reasonable effect. perhaps you are a government contractor and you reserve some capacity for them and drop everything else but what is really out there as a solution?

FYI I have seen telephone networks fail badly under extreme congestion. CO's have small CPUs that dont do a whole lot - setup calls, send busy signals .. once a call is in place it doesnt occupy CPU time as the path is locked in place elsewhere. however, if something occurs to cause a serious amount of busy ccts then CPU usage goes thro the roof and you can cause cascade failures of whole COs

telcos look to solutions such as call gapping to intervene when they anticipate major congestion, and not rely on the network to handle it

...
Even if the IP protocols recover "as designed," does human impatience mean there is a maximum recovery timeout period before humans start making the problem worse?

i'm not sure they were designed to do this.. the arpanet wasnt intended to be massively congested.. the redundant links were in place to cope with loss of a node and usage was manageable.

Steve

Rod Beck

7:40 p.m.

New subject: Extreme congestion (was Re: inter-domain link recovery)

Is this a declaration of principles? There is no reason why 'Tier 1' means that the carrier will not have an incentive to shape or even block traffic. Particularly, if they have a lot of eyeballs. Roderick S. Beck Director of EMEA Sales Hibernia Atlantic 1, Passage du Chantier, 75012 Paris http://www.hiberniaatlantic.com Wireless: 1-212-444-8829. Landline: 33-1-4346-3209 AOL Messenger: GlobalBandwidth rod.beck@hiberniaatlantic.com rodbeck@erols.com ``Unthinking respect for authority is the greatest enemy of truth.'' Albert Einstein. -----Original Message----- From: owner-nanog@merit.edu on behalf of Chiloé Temuco Sent: Wed 8/15/2007 6:06 PM To: nanog@merit.edu Subject: Re: Extreme congestion (was Re: inter-domain link recovery) Congestion and applications... My opinion: A tier 1 provider does not care what traffic it carries. That is all a function of the application not the network. A tier 2 provider may do traffic shaping, etc. A tier 3 provider may decide to block traffic paterns. ________________________________ More or less... The network was intended to move data from one machine to another... The less manipulation in the middle the better... No manipulation of the payload is the name of the game. That being said. It's entirely a function of the application to timeout and drop out of order packets, etc. ONS is designed around this principle. In streaming data... often it is better to get bad or missing data than to try and put out of order or bad data in the buffer... A good example is digital over-the-air tv... If you didn't build in enough error correction... then you'll have digital breakup, etc. It is impossible to recover any of that data. If reliable transport of data is required... That is a function of the application. ONS is an Optical Networking Standard in the development stage. -Chiloe Temuco On 8/15/07, Stephen Wilcox <steve.wilcox@packetrade.com> wrote: Hey Sean, On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote: > On Wed, 15 Aug 2007, Stephen Wilcox wrote: > >(Check slide 4) - the simple fact was that with something like 7 of 9 > >cables down the redundancy is useless .. even if operators maintained > >N+1 redundancy which is unlikely for many operators that would imply > >50% of capacity was actually used with 50% spare.. however we see > >around 78% of capacity is lost. There was simply to much traffic and > >not enough capacity.. IP backbones fail pretty badly when faced with > >extreme congestion. > > Remember the end-to-end principle. IP backbones don't fail with extreme > congestion, IP applications fail with extreme congestion. Hmm I'm not sure about that... a 100% full link dropping packets causes many problems: L7: Applications stop working, humans get angry L4: TCP/UDP drops cause retransmits, connection drops, retries etc L3: BGP sessions drop, OSPF hellos are lost.. routing fails L2: STP packets dropped.. switching fails I believe any or all of the above could occur on a backbone which has just failed massively and now has 20% capacity available such as occurred in SE Asia > Should IP applications respond to extreme congestion conditions better? alert('Connection dropped') "Ping timed out" kinda icky but its not the applications job to manage the network > Or should IP backbones have methods to predictably control which IP > applications receive the remaining IP bandwidth? Similar to the telephone > network special information tone -- All Circuits are Busy. Maybe we've > found a new use for ICMP Source Quench. yes and no.. for a private network perhaps, but for the Internet backbone where all traffic is important (right?), differentiation is difficult unless applied at the edge and you have major failure and congestion i dont see what you can do that will have any reasonable effect. perhaps you are a government contractor and you reserve some capacity for them and drop everything else but what is really out there as a solution? FYI I have seen telephone networks fail badly under extreme congestion. CO's have small CPUs that dont do a whole lot - setup calls, send busy signals .. once a call is in place it doesnt occupy CPU time as the path is locked in place elsewhere. however, if something occurs to cause a serious amount of busy ccts then CPU usage goes thro the roof and you can cause cascade failures of whole COs telcos look to solutions such as call gapping to intervene when they anticipate major congestion, and not rely on the network to handle it > Even if the IP protocols recover "as designed," does human impatience mean > there is a maximum recovery timeout period before humans start making the > problem worse? i'm not sure they were designed to do this.. the arpanet wasnt intended to be massively congested.. the redundant links were in place to cope with loss of a node and usage was manageable. Steve This e-mail and any attachments thereto is intended only for use by the addressee(s) named herein and may be proprietary and/or legally privileged. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, without the prior written permission of the sender is strictly prohibited. If you receive this e-mail in error, please immediately telephone or e-mail the sender and permanently delete the original copy and any copy of this e-mail, and any printout thereof. All documents, contracts or agreements referred or attached to this e-mail are SUBJECT TO CONTRACT. The contents of an attachment to this e-mail may contain software viruses that could damage your own computer system. While Hibernia Atlantic has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage that you sustain as a result of software viruses. You should carry out your own virus checks before opening any attachment

6529

Age (days ago)

6535

Last active (days ago)

List overview

Download

54 comments

20 participants

participants (20)

Adrian Chadd
Alexander Harrowell
Andy Davidson
Chengchen Hu
Chiloé Temuco
Deepak Jain
Fred Baker
Hex Star
Joe Provo
Joel Jaeggli
michael.dillon＠bt.com
Mikael Abrahamsson
Patrick W. Gilmore
Perry Lorier
Randy Bush
Rod Beck
Roland Dobbins
Sean Donelan
Stephen Wilcox
Valdis.Kletnieks＠vt.edu