Re: seeing the trees in the forest of confusion

26 Apr 1997

      I agree that there appears to be some underlying problem with the BGP code
on the backbone that is delaying route withdrawals beyond a reasonable
time.  We ran into a similar problem Wednesday night where one of our
customers started advertising more specifics for our network blocks to
another transit provider (who does not filter customer routes). After
shutting down the customer's BGP peering, the bogus routes were still in
the table an hour later at which time we started advertising our own more
specifics to restore service to our other customers -- this lead to our
unfortunate position in Thursday's CIDR report.

On a possibly related note, when we stopped advertising the more specifics
4 hours later, one of our transit providers (call them X) continued to
hold some of the more specific routes in a _portion_ of their BGP tables
with a next hop pointing to another of our transit providers (call them Y)
despite the fact that the Y no longer had the more specifics routes
anywhere in there tables.  This continued to cause a routing loop in X's
network (due to the inconsistent routes within their IBGP mesh) for 5
hours as X attempted to isolate the problem.  After that point, X's
solution was for us to announce more specifics for the affected networks
until they could schedule some core router reloads. 

These cases seem to point to a problem with BGP route withdrawls that will
continue to increase the time it takes to recover from network problems.
Perhaps the router vendors would like to comment.

- Doug

 /  Douglas A. Junkins    |   Network Engineering        \
/   Network Engineer      |   NorthWestNet                \
\   junkins@nwnet.net     |   Bellevue, Washington, USA   /
 \  +1-206-649-7419       |                              /

On Sat, 26 Apr 1997, Alex.Bligh wrote:
...
...
I suppose it is more fun to criticize policy and NSPs, but it
  may well be a hole in the BGP protocol, or more likely
  implementations in vendor's code [or user's implementation
  of twiddleable holddown timers].
My (possibly misinformed) understanding was that certain NSPs running
Cisco backbones had holddown timers configured to delay withdrawls. Even
after 7007 was disconnected, there were 7007 routes still being advertised
well over an hour later. I do not believe these NSPs are going to have
timers configured for >1hr.
We've seen a problem before where a transit provider (Cisco based) was
causing us problems, and we decided to turn them off. They were still
advertising our routes an hour later. (Provider unconnected with any
in this case). Pulling the session back up and clearing it did not
help things.
I'd therefore suggest that your analysis is correct. >80% of the
downtime is due either to a protocol bug or a s/w bug somewhere, not
NOC failure.
Alex Bligh
Xara Networks

Re: seeing the trees in the forest of confusion

Doug Junkins