Re: Extreme congestion (was Re: inter-domain link recovery)

15 Aug 2007

      let me answer at least twice.

As you say, remember the end-2-end principle. The end-2-end  
principle, in my precis, says "in deciding where functionality should  
be placed, do so in the simplest, cheapest, and most reliable manner  
when considered in the context of the entire network. That is usually  
close to the edge." Note the presence of advice and absence of mandate.

Parekh and Gallagher in their 1993 papers on the topic proved using  
control theory that if we can specify the amount of data that each  
session keeps in the network (for some definition of "session") and  
for each link the session crosses define exactly what the link will  
do with it, we can mathematically predict the delay the session will  
experience. TCP congestion control as presently defined tries to  
manage delay by adjusting the window; some algorithms literally  
measure delay, while most measure loss, which is the extreme case of  
delay. The math tells me that place to control the rate of a session  
is in the end system. Funny thing, that is found "close to the edge".

What ISPs routinely try to do is adjust routing in order to maximize  
their ability to carry customer sessions without increasing their  
outlay for bandwidth. It's called "load sharing", and we have a list  
of ways we do that, notably in recent years using BGP advertisements.  
Where Parekh and Gallagher calculated what the delay was, the ISP has  
the option of minimizing it through appropriate use of routing.

ie, edge and middle both have valid options, and the totality works  
best when they work together. That may be heresy, but it's true. When  
I hear my company's marketing line on intelligence in the network  
(which makes me cringe), I try to remind my marketing folks that the  
best use of intelligence in the network is to offer intelligent  
services to the intelligent edge that enable the intelligent edge to  
do something intelligent. But there is a place for intelligence in  
the network, and routing its its poster child.

In your summary of the problem, the assumption is that both of these  
are operative and have done what they can - several links are down,  
the remaining links (including any rerouting that may have occurred)  
are full to the gills, TCP is backing off as far as it can back off,  
and even so due to high loss little if anything productive is in fact  
happening. You're looking for a third "thing that can be done" to  
avoid congestive collapse, which is the case in which the network or  
some part of it is fully utilized and yet accomplishing no useful work.

So I would suggest that a third thing that can be done, after the  
other two avenues have been exhausted, is to decide to not start new  
sessions unless there is some reasonable chance that they will be  
able to accomplish their work. This is a burden I would not want to  
put on the host, because the probability is vanishingly small - any  
competent network operator is going to solve the problem with money  
if it is other than transient. But from where I sit, it looks like  
the "simplest, cheapest, and most reliable" place to detect  
overwhelming congestion is at the congested link, and given that  
sessions tend to be of finite duration and present semi-predictable  
loads, if you want to allow established sessions to complete, you  
want to run the established sessions in preference to new ones. The  
thing to do is delay the initiation of new sessions.

If I had an ICMP that went to the application, and if I trusted the  
application to obey me, I might very well say "dear browser or p2p  
application, I know you want to open 4-7 TCP sessions at a time, but  
for the coming 60 seconds could I convince you to open only one at a  
time?". I suspect that would go a long way. But there is a trust  
issue - would enterprise firewalls let it get to the host, would the  
host be able to get it to the application, would the application  
honor it, and would the ISP trust the enterprise/host/application to  
do so? is ddos possible? <mumble>

So plan B would be to in some way rate limit the passage of TCP SYN/ 
SYN-ACK and SCTP INIT in such a way that the hosed links remain fully  
utilized but sessions that have become established get acceptable  
service (maybe not great service, but they eventually complete  
without failing).

On Aug 15, 2007, at 8:59 AM, Sean Donelan wrote:
...
On Wed, 15 Aug 2007, Fred Baker wrote:
...
On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:
...
Or should IP backbones have methods to predictably control which  
IP applications receive the remaining IP bandwidth?  Similar to  
the telephone network special information tone -- All Circuits  
are Busy.  Maybe we've found a new use for ICMP Source Quench.
Source Quench wouldn't be my favored solution here. What I might  
suggest is taking TCP SYN and SCTP INIT (or new sessions if they  
are encrypted or UDP) and put them into a lower priority/rate  
queue. Delaying the start of new work would have a pretty strong  
effect on the congestive collapse of the existing work, I should  
think.
I was joking about Source Quench (missing :-), its got a lot of  
problems.
But I think the fundamental issue is who is responsible for  
controlling the back-off process? The edge or the middle?
Using different queues implies the middle (i.e. routers). At best  
it might be the "near-edge," and creating some type of shared  
knowledge between past, current and new sessions in the host stacks  
(and maybe middle-boxes like NAT gateways).
How fast do you need to signal large-scale back-off over what time  
period? Since major events in the real-world also result in a lot  
of "new" traffic, how do you signal new sessions before they reach  
the affected region of the network? Can you use BGP to signal the  
far-reaches of the Internet that I'm having problems, and other  
ASNs should start slowing things down before they reach my region  
(security can-o-worms being opened).