Brett Frankenberger wrote:
A) Ciscos flap sessions, according to the only reports I've heard.
Is it an invalid AS_PATH? If so, if such is received by a Cisco, the Cisco is required by the RFC to drop the session. Failing to do so (and then propogating the bogus advertisement) was the cause of the original problem ... AFAIK, the fix (which was released a long time ago, but may not yet be running everywhere) causes the Cisco to behave properly, which is to drop the session.
Clarification: Ciscos take a buggy route, and turn it into an invalid one. This causes Cisco peers to flap the session (yes, as they should), and some other vendors (B, below) appear to have more serious issues.
B) <X> routers were crashing, either due to the bug, or the session resets. Thus, <X> is being flogged. I have reports of at least one <Y> having problems, as well.
Well, OK. If <X> is crashing, then <X> has a problem. And I didn't mean to imply that they didn't. Mostly, I was posting because I frequently hear the "Bay vs. Cisco" crashes of yore reported as "Bay's were dropping BGP sessions". That implies that the Bay was broke, when in reality Bay (and most other non-Cisco implementations) was doing what was required by the RFC.
The reason for my post, not knowing who <X> is (although I could probably guess) or what <X> was doing, was to clarify that routers that drop BGP sessions upon receiving invalid advertisements are not broken; but rather, they are doing what is required.
A good point, and entirely true. I apologize for not being clear about the bug, but I was/am trying to step carefully around the NDAs. And yes, they're annoying, and there are probably some people who believe I'm violating it even now. (Hopefully not the lawyers...)
I have no data on Bay; my apologies if this wasn't clear. Bay was *only* being referenced as a historical point of note. No attempt at FUD, and my apologies if anyone read it that way.
And I wasn't attempting to defend them, either -- I'm just curious about the problem.
Anyway, someone had to be passing this advertisement around ... if the Ciscos were dropping the session in response to it, and <X>'s were crashing, who's left to pass the bad advertisement around? Cisco with older code that propogated the advertisement upon receipt, instead of issuing a NOTIFY and tearing the session down?
I'm not entirely clear on this; from the bug ID, it implies that iBGP may be treated differently than external peers (specifically, part of it appears to involve appending one's own ASN, possibly; again, I'm not entirely clear on it, even reading the bug report).
Naturally, you might be unable to answer the above, due to NDA ... mostly, I'm just fishing for details (from anywhere) on what happened.
Sorry. As Sean said... most of it is covered by NDAs, and this is exactly what will lead to required outage reporting for everyone, if they don't start relaxing it some. From our point of view (here), a lot of the issues were second-order, caused by the number of flaps in the global table from various directions, and/or the bug in vendor <X>'s equipment causing the reboots rapidly. Though, to their credit, <X> was good about handling the ticket, and had engineers talking to us rapidly, etc etc. Reasonable handling, IMO. -- *************************************************************************** Joel Baker System Administrator - lightbearer.com lucifer@lightbearer.com http://www.lightbearer.com/~lucifer