Brett Frankenberger wrote:
Out of curiosity - did anyone see a duration of significanlt instability in the global routing tables on Saturday afternoon? Without violating NDA, all I can say is that it resembled a historic event involve a bad route, Ciscos, and Bay routers (only this time, it was a bad route, Ciscos, and <X> vendor whom I cannot name but is being soundly beaten with wet noodles to resolve the issue). The bad route, and instability, were seen across all of our transit vendors (all "household" names of transit service).
Hmm ... why is <X> being beaten? Was the problem reversed this time?
The only historic event I can recall involving a bad route, Cisco, and Bay (actually, events would be better, since it happened at least twice) was a case of (a) someone injecting a bad route, (b) the cisco at the other end accepting it in violation of the RFC, (c) ciscos passing that bad route all around the internet, all in violation of the RFC, (d) that route eventually hitting a cisco<->bay peering connection, and (e) the Bay (although the problem wasn't limited to Bay, as gated, and possible other implementations as well, behaved the same way) properly sending a NOTIFY and taking down the BGP session, as required by the RFC.
A) Ciscos flap sessions, according to the only reports I've heard. B) <X> routers were crashing, either due to the bug, or the session resets. Thus, <X> is being flogged. I have reports of at least one <Y> having problems, as well. C) I would post the BugID, but the only source I have is under NDA. However, having now heard this much in a public forum (IE, not covered), I can say "Invalid AS path data bug".
It only took two major outages before Cisco fixed the problem. (The BGP advertisement was posted to NANOG both times, as was the BugID the second time.)
I have the guilty announcement, but again, it's under NDA. However, I can say that we are now seeing this announcement from all of our upstreams, non-blocked, so it appears that they fixed the origionating point.
So if this is the same issue, Cisco would be the vendor to flog, although assuming they didn't re-introduce it, the flogging might more correctly be directed at providers still running code old enough to have this particular problem.
I would flog Cisco as well, but A) they have a bug on it already, and B) we're not using Ciscos for our core (note: this is my personal email, and I am not speaking for my employer; however, this is publically documented on my employers website, so it's not NDAed).
Both my transits (Bay on my end, Cisco on the other end) made it through just fine, though. (This time. The last two times it happened, the cisco's on the other end happily passed the invalid route to me and the Bay on my end happily dropped the BGP session, and this was repeated ad infinitum until the bogus route was removed from the other end.)
I have no data on Bay; my apologies if this wasn't clear. Bay was *only* being referenced as a historical point of note. No attempt at FUD, and my apologies if anyone read it that way. -- *************************************************************************** Joel Baker System Administrator - lightbearer.com lucifer@lightbearer.com http://www.lightbearer.com/~lucifer