I agree that there appears to be some underlying problem with the BGP code on the backbone that is delaying route withdrawals beyond a reasonable time. We ran into a similar problem Wednesday night where one of our customers started advertising more specifics for our network blocks to another transit provider (who does not filter customer routes). After shutting down the customer's BGP peering, the bogus routes were still in the table an hour later at which time we started advertising our own more specifics to restore service to our other customers -- this lead to our unfortunate position in Thursday's CIDR report. On a possibly related note, when we stopped advertising the more specifics 4 hours later, one of our transit providers (call them X) continued to hold some of the more specific routes in a _portion_ of their BGP tables with a next hop pointing to another of our transit providers (call them Y) despite the fact that the Y no longer had the more specifics routes anywhere in there tables. This continued to cause a routing loop in X's network (due to the inconsistent routes within their IBGP mesh) for 5 hours as X attempted to isolate the problem. After that point, X's solution was for us to announce more specifics for the affected networks until they could schedule some core router reloads. These cases seem to point to a problem with BGP route withdrawls that will continue to increase the time it takes to recover from network problems. Perhaps the router vendors would like to comment. - Doug / Douglas A. Junkins | Network Engineering \ / Network Engineer | NorthWestNet \ \ junkins@nwnet.net | Bellevue, Washington, USA / \ +1-206-649-7419 | / On Sat, 26 Apr 1997, Alex.Bligh wrote:
I suppose it is more fun to criticize policy and NSPs, but it may well be a hole in the BGP protocol, or more likely implementations in vendor's code [or user's implementation of twiddleable holddown timers].
My (possibly misinformed) understanding was that certain NSPs running Cisco backbones had holddown timers configured to delay withdrawls. Even after 7007 was disconnected, there were 7007 routes still being advertised well over an hour later. I do not believe these NSPs are going to have timers configured for >1hr.
We've seen a problem before where a transit provider (Cisco based) was causing us problems, and we decided to turn them off. They were still advertising our routes an hour later. (Provider unconnected with any in this case). Pulling the session back up and clearing it did not help things.
I'd therefore suggest that your analysis is correct. >80% of the downtime is due either to a protocol bug or a s/w bug somewhere, not NOC failure.
Alex Bligh Xara Networks