RE: IS-IS protocol implementation problem

30 Oct 2000

      No, I'm a single-AS hosting provider, no confederation.  The more I think
about it, the more I'm convinced that CEF simply stopped working; all my
interfaces were active, and there were no apparent problems with my IGP,
which is OSPF.

I think that major BGP wigginess caused the CEF problem; thanks very much
for you insight, I definitely need to think about it some more.

-----Original Message-----
From: smd@clock.org [mailto:smd@clock.org]
Sent: Monday, October 30, 2000 7:28 AM
To: nanog@merit.edu; rdobbins@netmore.net
Cc: neil@colt.net; sean@donelan.com
Subject: RE: IS-IS protocol implementation problem

| I had a bizarre event occur on Thursday night/Friday morning, and this is
| likely the culprit.

Some of your symptoms are consistent with a badly-broken sloshing
IGP, notably the drop in traffic load and large numbers of dying
TCPs passing through the afflicted network.   This is two sides of
the same coin:  a destination in your network, learned through
(e)BGP is mapped to a next-hop address (typically the interface
across which you are talking (e)BGP)and propagated through their
network via iBGP.  The IGP is used so that each iBGP-talking router
knows how to get to each next-hop address.  A sloshing IGP will 
break connectivity between a given router and all the addresses
associated with a broken next-hop.   A hypothesis: for each
afflicted router, the failure of one next-hop-address to be reachable
will cause your ENTIRE network to be unreachable by sources relying
upon traffic passing through that router.  This may mean a sizeable
proportion of their customer base simply could not reach you 
reliably enough to maintain a TCP connection in equilibrium, or at all.
Frequent transition to slow-start due to loss/out-of-order-packets 
*and* a reduction in the overall number of TCP "mice", would severly
reduce traffic.

An interesting question, however, is why would their iBGP TCP connections
appear to remain functional (you aren't losing eBGP routes) in this
sort of mess?   Did loopback addresses not come and go, but interface
addresses did? (That would be interesting to consider in the face of
possible aggregation of interface addresses into the IGP).  Is there
significant partitioning because of, for example, AS confederating,
mitigatiing the problem by removing iBGP's need to know about distant
loopback addresses, but not distant next-hop-addresses?   

We are lucky to have what could be a very interesting case
study in routing scalability trade-offs.  What a pity nothing like
outage@sprint.net exists any more, where we might find useful
information from the victim provider. :-(

	Sean.

rdobbins＠netmore.net

tags

participants (1)