seeing the trees in the forest of confusion
The source was isolated within 60 minutes of the problem. The routes propagated for greater than 120 minutes after that without the source contributing. (yes, the routes with an originating AS path of 7007) This is the interesting problem that noone seems to focus on. I suppose it is more fun to criticize policy and NSPs, but it may well be a hole in the BGP protocol, or more likely implementations in vendor's code [or user's implementation of twiddleable holddown timers]. As well, for your question: ] where were all the network people .... The people who could and did fix this problem were on a bridged (voice) conference call understanding what was going on, sharing information and working to resolve the issues. The 24 hour NOC functionality worked quite properly. There is a balance between informing people what is going on and working to fix the problem. Most people would prefer the problem be resolved and the postmortem take lengths of time than to prolong the incident to inform the masses of what actions are being taken. Having intelligent people answer the phone and explain what was going on wouldn't have helped solve the problem, just make people feel better. You could achieve the same feeling by taking a walk in the outdoors. This wasn't an Internet outage, this wasn't a catastrophe. It was a brownout that sporadically hit providers at various strengths. At least one NSP measured their backbone load go down by only 15% during the incident. The Internet infrastructure did not react as expected to this technical problem. It continued to propagate the bad information when withdrawls should have caught up and canceled the announcement. To make a security analogy, an entity designs a security policy inherent to risks. We openly acknowledged that this could happen and would hurt. The risk assesment was not accurate because the infrastructure did not behave as expected. We did not expect it to hurt for 3 hours. It should have stopped earlier. Why it didn't is the only interesting question left. -a ] There are already 2 articles on the net about it ] ] http://www.news.com/News/Item/0,4,10083,00.html?latest ] http://www.wired.com/news/technology/story/3442.html ] ] and I am sure there are more to come. It seems too easy for one person/company ] to bring down the net. Yes, we all agree it was some kind of accident, one ] that wasn't supposed to happen BUT I have to ask, where were all the network ] people that should have caught this before it hit the internet full force. ] ] Yes, there is talk about having the RAs setup and to prevent this type of ] thing, but not much else. We put our stuff in the RA when we remember to but I ] don't think that many people look much at them anyways. ] ] Hopefully next time this can be stopped before long, like maybe within 10 - 20 ] mins. Maybe it is time for all companies, even the smaller ones, or those that ] use BGP to have and really USE a 24 hour noc. Where there are people there ] that really know about routers and BGP. Maybe as the internet grows, those ] making it grow need a more active role. Sorry, Voice mail doesn't cut it, or a ] busy number when the internet comes down.... ]
I suppose it is more fun to criticize policy and NSPs, but it may well be a hole in the BGP protocol, or more likely implementations in vendor's code [or user's implementation of twiddleable holddown timers].
My (possibly misinformed) understanding was that certain NSPs running Cisco backbones had holddown timers configured to delay withdrawls. Even after 7007 was disconnected, there were 7007 routes still being advertised well over an hour later. I do not believe these NSPs are going to have timers configured for >1hr. We've seen a problem before where a transit provider (Cisco based) was causing us problems, and we decided to turn them off. They were still advertising our routes an hour later. (Provider unconnected with any in this case). Pulling the session back up and clearing it did not help things. I'd therefore suggest that your analysis is correct. >80% of the downtime is due either to a protocol bug or a s/w bug somewhere, not NOC failure. Alex Bligh Xara Networks
I agree that there appears to be some underlying problem with the BGP code on the backbone that is delaying route withdrawals beyond a reasonable time. We ran into a similar problem Wednesday night where one of our customers started advertising more specifics for our network blocks to another transit provider (who does not filter customer routes). After shutting down the customer's BGP peering, the bogus routes were still in the table an hour later at which time we started advertising our own more specifics to restore service to our other customers -- this lead to our unfortunate position in Thursday's CIDR report. On a possibly related note, when we stopped advertising the more specifics 4 hours later, one of our transit providers (call them X) continued to hold some of the more specific routes in a _portion_ of their BGP tables with a next hop pointing to another of our transit providers (call them Y) despite the fact that the Y no longer had the more specifics routes anywhere in there tables. This continued to cause a routing loop in X's network (due to the inconsistent routes within their IBGP mesh) for 5 hours as X attempted to isolate the problem. After that point, X's solution was for us to announce more specifics for the affected networks until they could schedule some core router reloads. These cases seem to point to a problem with BGP route withdrawls that will continue to increase the time it takes to recover from network problems. Perhaps the router vendors would like to comment. - Doug / Douglas A. Junkins | Network Engineering \ / Network Engineer | NorthWestNet \ \ junkins@nwnet.net | Bellevue, Washington, USA / \ +1-206-649-7419 | / On Sat, 26 Apr 1997, Alex.Bligh wrote:
I suppose it is more fun to criticize policy and NSPs, but it may well be a hole in the BGP protocol, or more likely implementations in vendor's code [or user's implementation of twiddleable holddown timers].
My (possibly misinformed) understanding was that certain NSPs running Cisco backbones had holddown timers configured to delay withdrawls. Even after 7007 was disconnected, there were 7007 routes still being advertised well over an hour later. I do not believe these NSPs are going to have timers configured for >1hr.
We've seen a problem before where a transit provider (Cisco based) was causing us problems, and we decided to turn them off. They were still advertising our routes an hour later. (Provider unconnected with any in this case). Pulling the session back up and clearing it did not help things.
I'd therefore suggest that your analysis is correct. >80% of the downtime is due either to a protocol bug or a s/w bug somewhere, not NOC failure.
Alex Bligh Xara Networks
I agree that there appears to be some underlying problem with the BGP code on the backbone that is delaying route withdrawals beyond a reasonable time. We ran into a similar problem Wednesday night where one of our customers started advertising more specifics for our network blocks to another transit provider (who does not filter customer routes). After shutting down the customer's BGP peering, the bogus routes were still in the table an hour later at which time we started advertising our own more specifics to restore service to our other customers -- this lead to our unfortunate position in Thursday's CIDR report.
Were they in as dampened; history; or just in as if they were in and had not flapped? Dampening of more specific bogus announcements is a problem I'd like to see addressed, since the more general (and correct) routes won't be used if more specifics are dampened.
- Doug
Avi
Avi Freedman wrote:
I agree that there appears to be some underlying problem with the BGP code on the backbone that is delaying route withdrawals beyond a reasonable time. We ran into a similar problem Wednesday night where one of our customers started advertising more specifics for our network blocks to another transit provider (who does not filter customer routes). After shutting down the customer's BGP peering, the bogus routes were still in the table an hour later at which time we started advertising our own more specifics to restore service to our other customers -- this lead to our unfortunate position in Thursday's CIDR report.
Were they in as dampened; history; or just in as if they were in and had not flapped? Dampened is what I saw looking at the digex looking glasses. And some of them had times >1 hour.
Dampening of more specific bogus announcements is a problem I'd like to see addressed, since the more general (and correct) routes won't be used if more specifics are dampened. I agree that this is a problem.
Larry Rosenman CyberRamp.net (AS6243)
From Home.
- Doug
Avi
-- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-399-0210 (voice) Internet: ler@lerami.lerctr.org US Mail: 900 Lake Isle Circle, Irving, TX 75060-7726
These cases seem to point to a problem with BGP route withdrawls that will continue to increase the time it takes to recover from network problems. Perhaps the router vendors would like to comment.
This seems inappropriate to me. You have just said: "I sat and watched a provider keep routes around long past their being withdrawn, and they didn't know what to do so suggested two kludges: 1) advertising more-specifics and 2) rebooting routers. Could some vendor comment on this problem?". This is every vendor's worst nightmare. Every vendor necessarily (and rightly so!) provides all users enough rope to hang themselves with. It seems inappropriate for someone who doesn't know what the full story is to call vendors to account. If the provider in question adjusted some knobs and settings so as to cause such a problem, what is the vendor to do? How could the vendor even come close to trying to explain the problem without detailed information about the problems and configurations? Pessimistically speaking, it seems that there are two ways that this thread could come to a close: 1) People will keep badgering the vendor and the vendor will come out looking ugly if they cannot account for the problem based on insufficient data. 2) People will all be quiet and stop complaining until the operator(s) in question and vendor(s) have information and communicate it. 2) seems obviously preferable, but I suspect that the people on this list will go for 1) since it will allow everyone to flame and chatter incessantly, increasing NANOG mail volume and everyone's productivity. If anyone who has seen this problem first hand has detailed technical information to provide, that is of course useful and welcome in this forum. But complaining without having any of the data? What's the point? --jhawk
If anyone who has seen this problem first hand has detailed technical information to provide, that is of course useful and welcome in this forum.
Thanks, anyway. Why would that help? I already have enough have enough clueless rantings in my mailbox, We sent it to the operator(s) and the vendor(s). randy
On Sat, 26 Apr 1997, John Hawkinson wrote:
These cases seem to point to a problem with BGP route withdrawls that will continue to increase the time it takes to recover from network problems. Perhaps the router vendors would like to comment.
This seems inappropriate to me.
You have just said: "I sat and watched a provider keep routes around long past their being withdrawn, and they didn't know what to do so suggested two kludges: 1) advertising more-specifics and 2) rebooting routers. Could some vendor comment on this problem?".
Perhaps I should have been more clear with what the provider did during the 5 hours that the routing loop continued in there backbone. It didn't take 5 hours to for the provider to identify that there was a problem with the routes in their tables (i.e. a few of their routers in their IBGP mesh had more specifics from Provider Y while most did not). Instead, it took the provider 5 hours to troubleshoot the problem with the router vendor before both agreed that it was a software bug and identified the need to reload some of the routers. The hack of advertising more specifics was used to buy time before reloading the routers to minimize the impact.
This is every vendor's worst nightmare.
Every vendor necessarily (and rightly so!) provides all users enough rope to hang themselves with. It seems inappropriate for someone who doesn't know what the full story is to call vendors to account.
If the provider in question adjusted some knobs and settings so as to cause such a problem, what is the vendor to do?
How could the vendor even come close to trying to explain the problem without detailed information about the problems and configurations?
Pessimistically speaking, it seems that there are two ways that this thread could come to a close:
1) People will keep badgering the vendor and the vendor will come out looking ugly if they cannot account for the problem based on insufficient data.
2) People will all be quiet and stop complaining until the operator(s) in question and vendor(s) have information and communicate it.
2) seems obviously preferable, but I suspect that the people on this list will go for 1) since it will allow everyone to flame and chatter incessantly, increasing NANOG mail volume and everyone's productivity.
If I'm the only person that's seen this type of problem, I'll shut up about it. But if this type of problem has impacted more providers, I think it's appropriate in this forum to ask the router vendors to comment on any known problems with BGP route withdrawals. If they don't have enough information to account for the problem, then they should tell us that so we can get the data to them the next time something like this happens. - Doug
If anyone who has seen this problem first hand has detailed technical information to provide, that is of course useful and welcome in this forum. But complaining without having any of the data? What's the point?
--jhawk
participants (7)
-
alan@mindvision.com
-
Alex.Bligh
-
Avi Freedman
-
Doug Junkins
-
John Hawkinson
-
Larry Rosenman
-
randy@psg.com