Re: Cascading Failures Could Crash the Global Internet
I think the key is that the failures described in the paper are caused by overload rather than other things - too much demand for power blows out the generator, and without it, the grid tries to get the power from the next nearest generators, which overload and fail, and try to pull an even large amount from the _next_ nearest, etc. So the bit about heterogeneity is probably referring to the fact that some nodes are bigger or better-connected than others, and are more likely to blow out a bunch of their neighbors when they fail and shed a big load. That's not really how Internet systems usually fail. Overload can cause problems, and we've seen congestion collapse in the past, but TCP is usually tuned to discourage it; when a system is overloaded, well-behaved applications (which is most of them) back off, gradually or rapidly, but unless the load is weird enough to blow out router CPUs or crowd out BGP and OSPF packets, usually the network itself stays up and running. If what's failing is an overload of BGP routes or something, that's different - and sometimes the load on the system shrinks as components fail, but sometimes that just makes everything flap all at once, increasing load and delaying convergence.
--- "Stewart, William C (Bill), SALES" <billstewart@att.com> wrote:
If what's failing is an overload of BGP routes or something, that's different - and sometimes the load on the system shrinks as components fail, but sometimes that just makes everything flap all at once, increasing load and delaying convergence.
I seem to recall a massive routing failure in October which was caused by BGP getting imported to a major ISPs IGP... The core ${VENDOR 1}routers were able to handle the influx of routes, but the edge ${VENDOR 2} routers could not handle the influx - so the failure didn't exactly cascade, but did more of a ripple. However, the reloading of all of the edge devices increased the BGP instability. -David Barak -fully RFC 1925 compliant- __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
From: "Stewart, William C (Bill), SALES"
I think the key is that the failures described in the paper are caused by overload rather than other things - too much demand for power blows out the generator, and without it, the grid tries to get the power from the next nearest generators, which overload and fail, and try to pull an even large amount from the _next_ nearest, etc. So the bit about heterogeneity is probably referring to the fact that some nodes are bigger or better-connected than others, and are more likely to blow out a bunch of their neighbors when they fail and shed a big load.
That's not really how Internet systems usually fail.
A prime example of this theory was the large network I was using back when IE5 first came out. They had one circuit bad which overloaded an ATM circuit at another NAP causing it to generate bit errors. Shutting down the second circuit overloaded both MAE circuits effectively shutting down the network. However, it required manual intervention to create full failure, otherwise TCP would pull back to being useless, effectively killing all connections going that path, but not causing an issue with other paths until the manual intervention of shutting down the cirucit. While in theory it was still a cascade failure, it was also poor planning/policy on the part of the network to not be able to compensate in case of failure. The information provided may be partially inaccurate and is only hearsay concerning actual outages and effects when various interventions were tried; no hard fact. Thus it could be taken as solely my conjecture and not actual fact. -Jack
Hello; A packet switched network can be engineered against cascading failures in a way that's hard for a circuit switched network. Every time you see a random wait in a protocol, it's a good bet that the protocol writers were trying to protect against the tight coupling that leads to cascading failures. Regards Marshall Eubanks On Sunday, February 9, 2003, at 10:07 AM, Jack Bates wrote:
From: "Stewart, William C (Bill), SALES"
I think the key is that the failures described in the paper are caused by overload rather than other things - too much demand for power blows out the generator, and without it, the grid tries to get the power from the next nearest generators, which overload and fail, and try to pull an even large amount from the _next_ nearest, etc. So the bit about heterogeneity is probably referring to the fact that some nodes are bigger or better-connected than others, and are more likely to blow out a bunch of their neighbors when they fail and shed a big load.
That's not really how Internet systems usually fail.
A prime example of this theory was the large network I was using back when IE5 first came out. They had one circuit bad which overloaded an ATM circuit at another NAP causing it to generate bit errors. Shutting down the second circuit overloaded both MAE circuits effectively shutting down the network. However, it required manual intervention to create full failure, otherwise TCP would pull back to being useless, effectively killing all connections going that path, but not causing an issue with other paths until the manual intervention of shutting down the cirucit.
While in theory it was still a cascade failure, it was also poor planning/policy on the part of the network to not be able to compensate in case of failure. The information provided may be partially inaccurate and is only hearsay concerning actual outages and effects when various interventions were tried; no hard fact. Thus it could be taken as solely my conjecture and not actual fact.
-Jack
participants (4)
-
David Barak
-
Jack Bates
-
Marshall Eubanks
-
Stewart, William C (Bill), SALES