On Fri, 6 Sep 2002 sgorman1@gmu.edu wrote:
You also have the problem of cascading failures. Just because there are redundant paths and alternate peering locations does not mean those facilites have the bandwidth to handle all the redirected traffic. If A gets swamped you go to B if the redrected traffic is to much for B then you go to C and so on - each time the amount of traffic increases and the avialble bandwidth decreases. According to the analysis I've seen and run on the the Baltimore incident this is the jest of how a few cut lines rippled across the Internet. I would think Alex's scenario would have a bigger impact than that incident.
For some reason, I guess since Baltimore is near Washington DC, this incident seems to have captured the imagination of folks in Washington DC. Although some brand-name providers were impacted by this incident, it had minimal impact on other providers. Essentially every major Internet exchange point has failed at one time or another. In the past, there has been simultaneous failures in at least three different locations. The problem with your analysis is that's not what happens on the Internet. One of the current issues of Internet traffic engineering is traffic doesn't roll over to alternate paths B or C when the primary path A is congested. This is a traditional design in the switched telephone network, but not common in the Internet. Internet traffic tends to follow the "best" available route. Unlike phone calls, TCP traffic doesn't occur in fixed bandwidth increments. TCP traffic, 90% of Internet traffic, is elastic. By design, TCP adjusts the traffic rate to keep the bottleneck congested. As the bottleneck moves, traffic reacts by increasing or decreasing the rate to match the available capacity. This feedback occurs independently of what is happening on nearby traffic paths. Even if there is available capacity on elsewhere, the current Internet design is not very good at using it. Some people view this as an inefficient use of available capacity, other people view it as a self-protective mechanism. In today's Internet, the type of cascading failure you postulated probably won't happen. The design goal of the Internet is not to keep every part of the network operating under every condition, but failures in part of the network should not disrupt other parts of the network. That's why during the Baltimore train tunnel you saw some providers with severe problems in parts of their network, but other providers didn't experience any slowdowns in their networks. I wouldn't be surprised if a few people even experienced an improvement in their traffic that day. There are vendors trying to sell systems which will "steer" traffic through alternate paths seeking to avoid congestion. In addition there are things like IEPREP which are seeking to bypass the congestion feedback controls for selected traffic. It is unclear to me what impact these will have on Internet traffic during a crisis. It is possible these improvements will in fact make the Internet more brittle.