[...Lots of good stuff deleted to get to this point...] On Wed, 15 Aug 2007, Fred Baker wrote:
So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work. This is a burden I would not want to put on the host, because the probability is vanishingly small - any competent network operator is going to solve the problem with money if it is other than transient. But from where I sit, it looks like the "simplest, cheapest, and most reliable" place to detect overwhelming congestion is at the congested link, and given that sessions tend to be of finite duration and present semi-predictable loads, if you want to allow established sessions to complete, you want to run the established sessions in preference to new ones. The thing to do is delay the initiation of new sessions.
I view this as part of the flash crowd family of congestion problems, a combination of a rapid increase in demand and a rapid decrease in capacity. But instead of targeting a single destination, the impact is across multiple networks in the region. In the flash crowd cases (including DDOS variations), the place to respond (Note: the word change from "detect" to "respond") to extreme congestion does not seem toe be at the congested link but several hops upstream of the congested link. Current "effective practice" seems to be 1-2 ASN's away from the congested/failure point, but that may just also be the distance to reach "effective" ISP backbone engineer response.
If I had an ICMP that went to the application, and if I trusted the application to obey me, I might very well say "dear browser or p2p application, I know you want to open 4-7 TCP sessions at a time, but for the coming 60 seconds could I convince you to open only one at a time?". I suspect that would go a long way. But there is a trust issue - would enterprise firewalls let it get to the host, would the host be able to get it to the application, would the application honor it, and would the ISP trust the enterprise/host/application to do so? is ddos possible? <mumble>
For the malicious DDOS, of course we don't expect the hosts to obey. However, in the more general flash crowd case, I think the expectation of hosts following the RFC is pretty strong, although it may take years for new things to make it into the stacks. It won't slow down all the elephants, but maybe can turn the stampede into just a rampage. And the advantage of doing it in the edge host is their scale grow with the Internet. But even if the hosts don't respond to the back-off, it would give the edge more in-band trouble-shooting information. For example, ICMP "Destination Unreachable - Load shedding in effect. Retry after "N" seconds" (where N is stored like the Next-Hop MTU). Sending more packets to signal congestion, just makes congestion worse. However, having an explicit Internet "busy signal" is mostly to help network operators because firewalls will probably drop those ICMP messages just like PMTU.
So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK and SCTP INIT in such a way that the hosed links remain fully utilized but sessions that have become established get acceptable service (maybe not great service, but they eventually complete without failing).
This would be a useful plan B (or plan F - when things are really FUBARed), but I still think you need a way to signal it upstream 1 or 2 ASNs from the Extreme Congestion to be effective. For example, BGP says for all packets for network w.x.y.z with community a, implement back-off queue plan B. Probably not a queue per network in backbone routers, just one alternate queue plan B for all networks with that community. Once the origin ASN feels things are back to "normal," they can remove the community from their BGP announcements. But what should the alternate queue plan B be? Probably not fixed capacity numbers, but a distributed percentage across different upstreams. Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue Datagram protocol packets (UDP, ICMP, GRE, etc) 20% queue Session protocol established/finish packets (TCP ACK/FIN, etc) normal queue That values session oriented protocols more than datagram oriented protocols during extreme congestion. Or would it be better to let the datagram protocols fight it out with the session oriented protocols, just like normal Internet operations Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue And finally why only do this during extreme congestion? Why not always do it?