On Mar 28, 2010, at 12:00 PM, Anton Kapela wrote:
I guess what I'm hinting at is precisely something finer-grained (path not prefix), as you suggest. Per-neighbor enabled, versus "entire bgp RIB" would be preferred. I'm also interested in the *chronic* nature of these apparent instabilities. An average of one flap per minute could imply that the end-site is not getting allot of useful TCP moved, and as such, after something on the (n)-hour timescale, perhaps it's worth suppressing it.
So, I'd ask for a long-timescale dampening function, indexed against per-path, and enforced per neighbor. Perhaps as-path lists could be combined with relaxed timers on existing implementations to achieve this today (in a VRF target/context).
It's not just AS_PATH, a lot of the reason so many duplicate updates occur (nearly 50% of all updates at times, and often more during the busiest times) is because on the other end implementations don't keep egress advertisement state per attribute (e.g., if cluster_list length just triggered an internal transition then a new update is sent to external peers with no new information because the determining internal attributes are stripped before transmitting the new update), yet those *prefixes* might well be suppressed as a result of the implementation and/or network architecture on the other end of the BGP connection. Then you couple what Joe was pointing out, where intermediate nodes with consistently unstable links or "paths" result in penalizing an entire prefix, not just the unstable paths, and it makes for more brokenness than benefit when route flap damping is employed. It's not that people haven't studied and understand why this occurs, the issue is that implementation optimizations seem to always win out today over systemic state effects (i.e., that "be conservative in what you send" thing doesn't seem to apply in practice, unfortunately). -danny