RE: IS-IS protocol implementation problem
| I had a bizarre event occur on Thursday night/Friday morning, and this is | likely the culprit. Some of your symptoms are consistent with a badly-broken sloshing IGP, notably the drop in traffic load and large numbers of dying TCPs passing through the afflicted network. This is two sides of the same coin: a destination in your network, learned through (e)BGP is mapped to a next-hop address (typically the interface across which you are talking (e)BGP)and propagated through their network via iBGP. The IGP is used so that each iBGP-talking router knows how to get to each next-hop address. A sloshing IGP will break connectivity between a given router and all the addresses associated with a broken next-hop. A hypothesis: for each afflicted router, the failure of one next-hop-address to be reachable will cause your ENTIRE network to be unreachable by sources relying upon traffic passing through that router. This may mean a sizeable proportion of their customer base simply could not reach you reliably enough to maintain a TCP connection in equilibrium, or at all. Frequent transition to slow-start due to loss/out-of-order-packets *and* a reduction in the overall number of TCP "mice", would severly reduce traffic. An interesting question, however, is why would their iBGP TCP connections appear to remain functional (you aren't losing eBGP routes) in this sort of mess? Did loopback addresses not come and go, but interface addresses did? (That would be interesting to consider in the face of possible aggregation of interface addresses into the IGP). Is there significant partitioning because of, for example, AS confederating, mitigatiing the problem by removing iBGP's need to know about distant loopback addresses, but not distant next-hop-addresses? We are lucky to have what could be a very interesting case study in routing scalability trade-offs. What a pity nothing like outage@sprint.net exists any more, where we might find useful information from the victim provider. :-( Sean.
On Mon, 30 Oct 2000 07:27:41 -0800 smd@clock.org wrote:
We are lucky to have what could be a very interesting case study in routing scalability trade-offs. What a pity nothing like outage@sprint.net exists any more, where we might find useful information from the victim provider. :-(
Indeed. outage@sprint.net still exists BTW - it just doesn't have any other information other than "A router did/will/might be reloaded" Regards, Neil. -- Neil J. McRae C O L T I N T E R N E T neil@COLT.NET "In this world there's two kinds of people my friend: Those with loaded guns and those who dig. You dig?"
participants (2)
-
Neil J. McRae
-
smd@clock.org