On Tue, Jun 8, 2010 at 7:27 AM, Andy B. <globichen@gmail.com> wrote:
I finally decided to shut down all peerings and brought them back one by one.
Everything is stable again, but I don't like the way I had to deal with it since it will most likely happen again when DECIX or an other IX we're at is having issues.
I've seen a few BGP convergence discussions on NANOG, but none about deadlock situations and what could be done to avoid them. Setting higher MTU or bigger hold queues did not help.
- Andy
Some people have found that upgrading to an alternate router vendor helps. ^_^; Fundamentally, the CPU on your router is underpowered for the amount of state information that needs to be updated in the time window of the hold timers. If you can't move to a faster/more efficient platform, then you may need to negotiate raising the keepalive interval and corresponding hold timers with your neighbors, to give your router time to finish processing updates. Alternately, if you aren't in a position to be able to upgrade platforms, but have spare routers around, connecting a second router up to the exchange and splitting your neighbors up among two links into the exchange would reduce the load on each router during reconvergence, and buy you time until you can move to a more capable platform. Matt