BGP convergence problem
Hi, This morning there was an ethernet loop problem on DECIX, causing many BGP sessions to flap throughout the entire platform. While this can happen, I am myself facing with BGP convergence problems on our DECIX router (SUP720-3BXL with IOS SXI3). De DECIX loop has been solved two hours ago, but my BGP sessions are still flapping and not converging at all. This has been flooding our logs, and is still going on: Jun 8 11:47:03 x.x.x.131 239447: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.32 Up Jun 8 11:47:03 x.x.x.131 239448: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.231 Up Jun 8 11:47:03 x.x.x.131 239449: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.109 Up Jun 8 11:47:03 x.x.x.131 239450: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.50 Up Jun 8 11:47:03 x.x.x.131 239451: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.81 Up Jun 8 11:47:03 x.x.x.131 239452: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.28 Up Jun 8 11:47:03 x.x.x.131 239453: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.212 Up Jun 8 11:47:03 x.x.x.131 239454: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.147 Up Jun 8 11:47:03 x.x.x.131 239455: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.74 Up Jun 8 11:47:03 x.x.x.131 239456: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.241 Up Jun 8 11:47:03 x.x.x.131 239457: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.5 Up Jun 8 11:47:03 x.x.x.131 239458: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.40 Up Jun 8 11:47:03 x.x.x.131 239459: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1A44:0:1 Up Jun 8 11:47:03 x.x.x.131 239460: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8605:0:1 Up Jun 8 11:47:03 x.x.x.131 239461: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1A0B:0:1 Up Jun 8 11:47:03 x.x.x.131 239462: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::3029:0:1 Up Jun 8 11:47:03 x.x.x.131 239463: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::6E4:0:1 Up Jun 8 11:47:03 x.x.x.131 239464: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::CB0:0:1 Up Jun 8 11:47:03 x.x.x.131 239465: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::21C8:0:1 Up Jun 8 11:47:03 x.x.x.131 239466: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8463:0:2 Up Jun 8 11:47:04 x.x.x.131 239467: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::31AA:0:1 Up Jun 8 11:47:04 x.x.x.131 239468: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.29 Up Jun 8 11:47:04 x.x.x.131 239469: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::62BF:0:1 Up Jun 8 11:47:04 x.x.x.131 239470: Jun 8 11:48:39.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.101 Down BGP Notification sent Jun 8 11:47:04 x.x.x.131 239471: Jun 8 11:48:39.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.101 4/0 (hold time expired) 0 bytes Jun 8 11:47:07 x.x.x.131 239472: Jun 8 11:48:41.696 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.104 Up Jun 8 11:47:10 x.x.x.131 239473: Jun 8 11:48:44.488 CEST: %BGP-3-BGP_NO_REMOTE_READ: 80.81.193.187 connection timed out - has not accepted a message from us for 20000ms (hold time), 1 messages pending transmition. Jun 8 11:47:10 x.x.x.131 239474: Jun 8 11:48:44.488 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.187 Down BGP Notification sent Jun 8 11:47:10 x.x.x.131 239475: Jun 8 11:48:44.488 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.193.187 4/0 (hold time expired) 0 bytes Jun 8 11:47:10 x.x.x.131 239476: Jun 8 11:48:44.900 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.61 Up Jun 8 11:47:10 x.x.x.131 239477: Jun 8 11:48:44.900 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.149 Up Jun 8 11:47:10 x.x.x.131 239478: Jun 8 11:48:44.900 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.136 Up Jun 8 11:47:10 x.x.x.131 239479: Jun 8 11:48:44.904 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8463:0:1 Up Jun 8 11:47:10 x.x.x.131 239480: Jun 8 11:48:46.352 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::6268:0:1 Up Jun 8 11:47:14 x.x.x.131 239481: Jun 8 11:48:48.084 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.78 Up Jun 8 11:47:14 x.x.x.131 239482: Jun 8 11:48:49.172 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.239 Up Jun 8 11:47:14 x.x.x.131 239483: Jun 8 11:48:49.172 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.24 Up Jun 8 11:47:17 x.x.x.131 239484: Jun 8 11:48:52.160 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.45 Up Jun 8 11:47:17 x.x.x.131 239485: Jun 8 11:48:52.160 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.108 Up Jun 8 11:47:17 x.x.x.131 239486: Jun 8 11:48:52.160 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.164 Up Jun 8 11:47:17 x.x.x.131 239487: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.49 Up Jun 8 11:47:17 x.x.x.131 239488: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.139 Up Jun 8 11:47:17 x.x.x.131 239489: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1536:0:1 Up Jun 8 11:47:17 x.x.x.131 239490: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8601:0:1 Up Jun 8 11:47:17 x.x.x.131 239491: Jun 8 11:48:53.788 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.45 Up Jun 8 11:47:17 x.x.x.131 239492: Jun 8 11:48:53.788 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::A2DC:0:1 Up Jun 8 11:47:21 x.x.x.131 239493: Jun 8 11:48:55.056 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.91 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239583: Jun 8 11:50:37.684 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.14 Down Peer closed the session Jun 8 11:49:04 x.x.x.131 239584: Jun 8 11:50:38.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.120 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239585: Jun 8 11:50:38.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.120 4/0 (hold time expired) 0 bytes Jun 8 11:49:04 x.x.x.131 239586: Jun 8 11:50:38.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.229 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239587: Jun 8 11:50:38.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.229 4/0 (hold time expired) 0 bytes Jun 8 11:49:04 x.x.x.131 239588: Jun 8 11:50:38.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.108 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239589: Jun 8 11:50:38.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.108 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239590: Jun 8 11:50:41.944 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.73 Down BGP Notification sent Jun 8 11:49:07 x.x.x.131 239591: Jun 8 11:50:41.944 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.73 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239592: Jun 8 11:50:41.944 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::20AD:0:2 Down BGP Notification sent Jun 8 11:49:07 x.x.x.131 239593: Jun 8 11:50:41.944 CEST: %BGP-3-NOTIFICATION: sent to neighbor 2001:7F8::20AD:0:2 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239594: Jun 8 11:50:41.944 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.115 Down BGP Notification sent Jun 8 11:49:07 x.x.x.131 239595: Jun 8 11:50:41.944 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.193.115 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239596: Jun 8 11:50:44.124 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.3 Down BGP Notification sent Jun 8 11:49:11 x.x.x.131 239597: Jun 8 11:50:44.124 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.3 4/0 (hold time expired) 0 bytes Jun 8 11:49:11 x.x.x.131 239598: Jun 8 11:50:45.200 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.215 Down BGP Notification sent Jun 8 11:49:11 x.x.x.131 239599: Jun 8 11:50:45.200 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.215 4/0 (hold time expired) 0 bytes Jun 8 11:49:11 x.x.x.131 239600: Jun 8 11:50:47.336 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.141 Down BGP Notification sent Jun 8 11:49:11 x.x.x.131 239601: Jun 8 11:50:47.336 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.141 4/0 (hold time expired) 0 bytes Jun 8 11:49:14 x.x.x.131 239602: Jun 8 11:50:48.432 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::3B41:0:1 Down BGP Notification sent Jun 8 11:49:14 x.x.x.131 239603: Jun 8 11:50:48.432 CEST: %BGP-3-NOTIFICATION: sent to neighbor 2001:7F8::3B41:0:1 4/0 (hold time expired) 0 bytes Jun 8 11:49:14 x.x.x.131 239604: Jun 8 11:50:49.720 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.239 Down BGP Notification sent Jun 8 11:49:14 x.x.x.131 239605: Jun 8 11:50:49.720 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.239 4/0 (hold time expired) 0 bytes Jun 8 11:49:17 x.x.x.131 239606: Jun 8 11:50:50.976 CEST: %BGP-5-ADJCHANGE: neighbor 2001:2000:3080:B4::1 Down Peer closed the session Jun 8 11:49:17 x.x.x.131 239607: Jun 8 11:50:52.976 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.20 Down BGP Notification sent Jun 8 11:49:17 x.x.x.131 239608: Jun 8 11:50:52.976 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.20 4/0 (hold time expired) 0 bytes Jun 8 11:49:17 x.x.x.131 239609: Jun 8 11:50:54.044 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.21 Down BGP Notification sent Jun 8 11:49:17 x.x.x.131 239610: Jun 8 11:50:54.044 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.193.21 4/0 (hold time expired) 0 bytes Jun 8 11:49:20 x.x.x.131 239611: Jun 8 11:50:56.204 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1A0B:0:1 Down BGP Notification sent Jun 8 11:49:20 x.x.x.131 239612: Jun 8 11:50:56.204 CEST: %BGP-3-NOTIFICATION: sent to neighbor 2001:7F8::1A0B:0:1 4/0 (hold time expired) 0 bytes Jun 8 11:49:23 x.x.x.131 239613: Jun 8 11:50:58.400 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.63 Down BGP Notification sent Jun 8 11:49:23 x.x.x.131 239614: Jun 8 11:50:58.400 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.63 4/0 (hold time expired) 0 bytes Jun 8 11:49:23 x.x.x.131 239615: Jun 8 11:50:59.448 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.97 Down BGP Notification sent Jun 8 11:49:23 x.x.x.131 239616: Jun 8 11:50:59.448 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.97 4/0 (hold time expired) 0 bytes Jun 8 11:49:27 x.x.x.131 239617: Jun 8 11:51:01.664 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.131 Down BGP Notification sent Jun 8 11:49:27 x.x.x.131 239618: Jun 8 11:51:01.664 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.131 4/0 (hold time expired) 0 bytes Jun 8 11:49:27 x.x.x.131 239619: Jun 8 11:51:03.872 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.83 Down BGP Notification sent Jun 8 11:49:27 x.x.x.131 239620: Jun 8 11:51:03.872 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.83 4/0 (hold time expired) 0 bytes Jun 8 11:49:27 x.x.x.131 239621: Jun 8 11:51:03.872 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.156 Down BGP Notification sent Jun 8 11:49:30 x.x.x.131 239622: Jun 8 11:51:03.872 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.156 4/0 (hold time expired) 0 bytes Jun 8 11:49:30 x.x.x.131 239623: Jun 8 11:51:06.056 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.50 Down BGP Notification sent Jun 8 11:49:30 x.x.x.131 239624: Jun 8 11:51:06.056 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.50 4/0 (hold time expired) 0 bytes CPU load is constantly at 100% doing BGP and more BGP. We have around 200 BGP sessions on DECIX and I would not want to shut them down and bring them up individually. How can I get out of this deadlock? Andy
Dear Andy
This morning there was an ethernet loop problem on DECIX, causing many BGP sessions to flap throughout the entire platform. While this can happen, I am myself facing with BGP convergence problems on our DECIX router (SUP720-3BXL with IOS SXI3).
De DECIX loop has been solved two hours ago, but my BGP sessions are still flapping and not converging at all. This has been flooding our logs, and is still going on:
route half or more of the peering-network to Null -> lowering bgp session up's. (at the other side, your bgp-router seems to be overloaded). Kind regards, Ingo Flaschberger
I finally decided to shut down all peerings and brought them back one by one. Everything is stable again, but I don't like the way I had to deal with it since it will most likely happen again when DECIX or an other IX we're at is having issues. I've seen a few BGP convergence discussions on NANOG, but none about deadlock situations and what could be done to avoid them. Setting higher MTU or bigger hold queues did not help. - Andy On Tue, Jun 8, 2010 at 2:35 PM, Ingo Flaschberger <if@xip.at> wrote:
Dear Andy
This morning there was an ethernet loop problem on DECIX, causing many BGP sessions to flap throughout the entire platform. While this can happen, I am myself facing with BGP convergence problems on our DECIX router (SUP720-3BXL with IOS SXI3).
De DECIX loop has been solved two hours ago, but my BGP sessions are still flapping and not converging at all. This has been flooding our logs, and is still going on:
route half or more of the peering-network to Null -> lowering bgp session up's. (at the other side, your bgp-router seems to be overloaded).
Kind regards, Ingo Flaschberger
On Jun 8, 2010, at 10:27 AM, Andy B. wrote:
I finally decided to shut down all peerings and brought them back one by one.
Everything is stable again, but I don't like the way I had to deal with it since it will most likely happen again when DECIX or an other IX we're at is having issues.
I've seen a few BGP convergence discussions on NANOG, but none about deadlock situations and what could be done to avoid them. Setting higher MTU or bigger hold queues did not help.
The Cisco 7600 and 6500 platforms are getting fairly old and have underpowered cpus these days. Starting in SXH the control plane did not scale quite as well as in SXF. This got better in SXI, but is not back on par with SXF performance yet. I mostly attribute this to a combination of bloat in software and routing tables. I would start to look for a replacement sooner rather than later. - Jared
On Tue, Jun 08, 2010 at 12:22:04PM -0400, Jared Mauch wrote:
The Cisco 7600 and 6500 platforms are getting fairly old and have underpowered cpus these days.
Starting in SXH the control plane did not scale quite as well as in SXF. This got better in SXI, but is not back on par with SXF performance yet.
I mostly attribute this to a combination of bloat in software and routing tables. I would start to look for a replacement sooner rather than later.
Place blame where blame is due, the cpu may be slow, but the crappy ios scheduler is the real problem here. We saw a huge reduction in the number of self-sustaining protocols timeouts cycles on these boxes (where the process of trying to bring up a new neighbor and converge routing uses so much cpu that it causes other neighbors to time out, resulting in a never-ending cycle of fail until you shut down everything and bring them up one neighbor at a time) with the move from SXF to the SR branches. We never really went down the SXH/SXI road, but I'd have assumed they would have introduced the same improvements there too. I guess you know what they say about assuming. :) Try the usual suspects: * Configure "process-max-time 20" at the top level, this improves interactivity by making the scheduler switch processes more often. * Make sure you don't have an overly aggressive control-plane policer. In my experience the COPP rate-limits are quite harsh, and if you end up bumping against them you don't get a graceful slowing of the exchange of routes, you get protocol timeouts. * Make sure you don't have any stupid mls rate-limits, such as cef receive. I don't know why anyone would ever want to configure this, all it does is make your box fall over faster (as if these things need any help) by rate-limiting all traffic to the msfc. * You might want to try something like "scheduler allocate 400 4000", which gives the vast majority of the cpu time to the control plane rather than process switching on the data plane (which in theory shouldn't happen on an entirely hw forwarded box like 6500/7600, though of course we all know that isn't true :P). Oh and also the OP should take this to the cisco-nsp mailing list, where all the good bitching about broken Crisco routers takes place. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
The Cisco 7600 and 6500 platforms are getting fairly old and have underpowered cpus these days.
the hamsters in them were never well fed, ever. though i have never run one, too yucchhy, i have measured receiving a research feed from one. over ten minutes for a full table while a router takes two. some researcher into archeology might try to measure if is just a sick tcp or if it is closer to rib-out. randy
On Tue, Jun 8, 2010 at 7:27 AM, Andy B. <globichen@gmail.com> wrote:
I finally decided to shut down all peerings and brought them back one by one.
Everything is stable again, but I don't like the way I had to deal with it since it will most likely happen again when DECIX or an other IX we're at is having issues.
I've seen a few BGP convergence discussions on NANOG, but none about deadlock situations and what could be done to avoid them. Setting higher MTU or bigger hold queues did not help.
- Andy
Some people have found that upgrading to an alternate router vendor helps. ^_^; Fundamentally, the CPU on your router is underpowered for the amount of state information that needs to be updated in the time window of the hold timers. If you can't move to a faster/more efficient platform, then you may need to negotiate raising the keepalive interval and corresponding hold timers with your neighbors, to give your router time to finish processing updates. Alternately, if you aren't in a position to be able to upgrade platforms, but have spare routers around, connecting a second router up to the exchange and splitting your neighbors up among two links into the exchange would reduce the load on each router during reconvergence, and buy you time until you can move to a more capable platform. Matt
* globichen@gmail.com (Andy B.) [Tue 08 Jun 2010, 16:28 CEST]:
I finally decided to shut down all peerings and brought them back one by one.
Sadly that's often the way it has to be done, modulo mild tweaks.
Everything is stable again, but I don't like the way I had to deal with it since it will most likely happen again when DECIX or an other IX we're at is having issues.
As others have said upthread in more polite wordings, get a better router if yours can't handle the load. (Or use the route servers more - it's what they're there for.)
I've seen a few BGP convergence discussions on NANOG, but none about deadlock situations and what could be done to avoid them. Setting higher MTU or bigger hold queues did not help.
I hope you didn't change the MTU to anything different from what everybody else on the DE-CIX Peering LAN uses - that only leads to suffering. -- Niels. -- "It's amazing what people will do to get their name on the internet, which is odd, because all you really need is a Blogspot account." -- roy edroso, alicublog.blogspot.com
Hi Andy, We have had similar problems with s720/3bxl on exchanges with large numbers of peers. Exact same symptoms, can be triggered by any significant UPDATE flux, even iBGP originated path-hunts. This problem is compounded if you are taking full tables on the same device, to the extent that the bgp scanner and bgp IO processes grind the control-plane to halt causing ISIS/OSPF adjacencies to drop, SNMP and SSH unresponsive, etc. Same behavior is seen regardless of IOS train. As others have pointed out, the sad fact of the matter is that the s720/3BXL simply does not have the CPU power to cope with hundreds of neighbor sessions and the growing numbers of paths. Here are some things that we tried with varied success to remedy bgp deadlock on this platform: * lower process-max-time to prevent bgp scanner/bgp io processes from completely consuming the control-plane * Take soft-reconfiguration off of neighbors/peer-groups where you can, this will help tremendously * Split the load of neighbor sessions between multiple devices, move full table feeds to other devices The 'final solution' is to simply replace this platform with a newer more powerful alternative, and there are numerous candidates :) Best Regards, Kevin Hodle On Tue, Jun 8, 2010 at 4:58 AM, Andy B. <globichen@gmail.com> wrote:
Hi,
This morning there was an ethernet loop problem on DECIX, causing many BGP sessions to flap throughout the entire platform. While this can happen, I am myself facing with BGP convergence problems on our DECIX router (SUP720-3BXL with IOS SXI3).
De DECIX loop has been solved two hours ago, but my BGP sessions are still flapping and not converging at all. This has been flooding our logs, and is still going on:
Jun 8 11:47:03 x.x.x.131 239447: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.32 Up Jun 8 11:47:03 x.x.x.131 239448: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.231 Up Jun 8 11:47:03 x.x.x.131 239449: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.109 Up Jun 8 11:47:03 x.x.x.131 239450: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.50 Up Jun 8 11:47:03 x.x.x.131 239451: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.81 Up Jun 8 11:47:03 x.x.x.131 239452: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.28 Up Jun 8 11:47:03 x.x.x.131 239453: Jun 8 11:48:38.364 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.212 Up Jun 8 11:47:03 x.x.x.131 239454: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.147 Up Jun 8 11:47:03 x.x.x.131 239455: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.74 Up Jun 8 11:47:03 x.x.x.131 239456: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.241 Up Jun 8 11:47:03 x.x.x.131 239457: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.5 Up Jun 8 11:47:03 x.x.x.131 239458: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.40 Up Jun 8 11:47:03 x.x.x.131 239459: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1A44:0:1 Up Jun 8 11:47:03 x.x.x.131 239460: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8605:0:1 Up Jun 8 11:47:03 x.x.x.131 239461: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1A0B:0:1 Up Jun 8 11:47:03 x.x.x.131 239462: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::3029:0:1 Up Jun 8 11:47:03 x.x.x.131 239463: Jun 8 11:48:38.368 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::6E4:0:1 Up Jun 8 11:47:03 x.x.x.131 239464: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::CB0:0:1 Up Jun 8 11:47:03 x.x.x.131 239465: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::21C8:0:1 Up Jun 8 11:47:03 x.x.x.131 239466: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8463:0:2 Up Jun 8 11:47:04 x.x.x.131 239467: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::31AA:0:1 Up Jun 8 11:47:04 x.x.x.131 239468: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.29 Up Jun 8 11:47:04 x.x.x.131 239469: Jun 8 11:48:38.372 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::62BF:0:1 Up Jun 8 11:47:04 x.x.x.131 239470: Jun 8 11:48:39.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.101 Down BGP Notification sent Jun 8 11:47:04 x.x.x.131 239471: Jun 8 11:48:39.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.101 4/0 (hold time expired) 0 bytes Jun 8 11:47:07 x.x.x.131 239472: Jun 8 11:48:41.696 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.104 Up Jun 8 11:47:10 x.x.x.131 239473: Jun 8 11:48:44.488 CEST: %BGP-3-BGP_NO_REMOTE_READ: 80.81.193.187 connection timed out - has not accepted a message from us for 20000ms (hold time), 1 messages pending transmition. Jun 8 11:47:10 x.x.x.131 239474: Jun 8 11:48:44.488 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.187 Down BGP Notification sent Jun 8 11:47:10 x.x.x.131 239475: Jun 8 11:48:44.488 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.193.187 4/0 (hold time expired) 0 bytes Jun 8 11:47:10 x.x.x.131 239476: Jun 8 11:48:44.900 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.61 Up Jun 8 11:47:10 x.x.x.131 239477: Jun 8 11:48:44.900 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.149 Up Jun 8 11:47:10 x.x.x.131 239478: Jun 8 11:48:44.900 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.136 Up Jun 8 11:47:10 x.x.x.131 239479: Jun 8 11:48:44.904 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8463:0:1 Up Jun 8 11:47:10 x.x.x.131 239480: Jun 8 11:48:46.352 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::6268:0:1 Up Jun 8 11:47:14 x.x.x.131 239481: Jun 8 11:48:48.084 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.78 Up Jun 8 11:47:14 x.x.x.131 239482: Jun 8 11:48:49.172 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.239 Up Jun 8 11:47:14 x.x.x.131 239483: Jun 8 11:48:49.172 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.24 Up Jun 8 11:47:17 x.x.x.131 239484: Jun 8 11:48:52.160 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.45 Up Jun 8 11:47:17 x.x.x.131 239485: Jun 8 11:48:52.160 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.108 Up Jun 8 11:47:17 x.x.x.131 239486: Jun 8 11:48:52.160 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.164 Up Jun 8 11:47:17 x.x.x.131 239487: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.49 Up Jun 8 11:47:17 x.x.x.131 239488: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.139 Up Jun 8 11:47:17 x.x.x.131 239489: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1536:0:1 Up Jun 8 11:47:17 x.x.x.131 239490: Jun 8 11:48:52.164 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::8601:0:1 Up Jun 8 11:47:17 x.x.x.131 239491: Jun 8 11:48:53.788 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.45 Up Jun 8 11:47:17 x.x.x.131 239492: Jun 8 11:48:53.788 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::A2DC:0:1 Up Jun 8 11:47:21 x.x.x.131 239493: Jun 8 11:48:55.056 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.91 Down BGP Notification sent
Jun 8 11:49:04 x.x.x.131 239583: Jun 8 11:50:37.684 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.14 Down Peer closed the session Jun 8 11:49:04 x.x.x.131 239584: Jun 8 11:50:38.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.120 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239585: Jun 8 11:50:38.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.120 4/0 (hold time expired) 0 bytes Jun 8 11:49:04 x.x.x.131 239586: Jun 8 11:50:38.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.229 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239587: Jun 8 11:50:38.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.229 4/0 (hold time expired) 0 bytes Jun 8 11:49:04 x.x.x.131 239588: Jun 8 11:50:38.656 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.108 Down BGP Notification sent Jun 8 11:49:04 x.x.x.131 239589: Jun 8 11:50:38.656 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.108 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239590: Jun 8 11:50:41.944 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.73 Down BGP Notification sent Jun 8 11:49:07 x.x.x.131 239591: Jun 8 11:50:41.944 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.73 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239592: Jun 8 11:50:41.944 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::20AD:0:2 Down BGP Notification sent Jun 8 11:49:07 x.x.x.131 239593: Jun 8 11:50:41.944 CEST: %BGP-3-NOTIFICATION: sent to neighbor 2001:7F8::20AD:0:2 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239594: Jun 8 11:50:41.944 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.115 Down BGP Notification sent Jun 8 11:49:07 x.x.x.131 239595: Jun 8 11:50:41.944 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.193.115 4/0 (hold time expired) 0 bytes Jun 8 11:49:07 x.x.x.131 239596: Jun 8 11:50:44.124 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.3 Down BGP Notification sent Jun 8 11:49:11 x.x.x.131 239597: Jun 8 11:50:44.124 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.3 4/0 (hold time expired) 0 bytes Jun 8 11:49:11 x.x.x.131 239598: Jun 8 11:50:45.200 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.215 Down BGP Notification sent Jun 8 11:49:11 x.x.x.131 239599: Jun 8 11:50:45.200 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.215 4/0 (hold time expired) 0 bytes Jun 8 11:49:11 x.x.x.131 239600: Jun 8 11:50:47.336 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.141 Down BGP Notification sent Jun 8 11:49:11 x.x.x.131 239601: Jun 8 11:50:47.336 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.141 4/0 (hold time expired) 0 bytes Jun 8 11:49:14 x.x.x.131 239602: Jun 8 11:50:48.432 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::3B41:0:1 Down BGP Notification sent Jun 8 11:49:14 x.x.x.131 239603: Jun 8 11:50:48.432 CEST: %BGP-3-NOTIFICATION: sent to neighbor 2001:7F8::3B41:0:1 4/0 (hold time expired) 0 bytes Jun 8 11:49:14 x.x.x.131 239604: Jun 8 11:50:49.720 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.239 Down BGP Notification sent Jun 8 11:49:14 x.x.x.131 239605: Jun 8 11:50:49.720 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.239 4/0 (hold time expired) 0 bytes Jun 8 11:49:17 x.x.x.131 239606: Jun 8 11:50:50.976 CEST: %BGP-5-ADJCHANGE: neighbor 2001:2000:3080:B4::1 Down Peer closed the session Jun 8 11:49:17 x.x.x.131 239607: Jun 8 11:50:52.976 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.20 Down BGP Notification sent Jun 8 11:49:17 x.x.x.131 239608: Jun 8 11:50:52.976 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.20 4/0 (hold time expired) 0 bytes Jun 8 11:49:17 x.x.x.131 239609: Jun 8 11:50:54.044 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.193.21 Down BGP Notification sent Jun 8 11:49:17 x.x.x.131 239610: Jun 8 11:50:54.044 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.193.21 4/0 (hold time expired) 0 bytes Jun 8 11:49:20 x.x.x.131 239611: Jun 8 11:50:56.204 CEST: %BGP-5-ADJCHANGE: neighbor 2001:7F8::1A0B:0:1 Down BGP Notification sent Jun 8 11:49:20 x.x.x.131 239612: Jun 8 11:50:56.204 CEST: %BGP-3-NOTIFICATION: sent to neighbor 2001:7F8::1A0B:0:1 4/0 (hold time expired) 0 bytes Jun 8 11:49:23 x.x.x.131 239613: Jun 8 11:50:58.400 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.63 Down BGP Notification sent Jun 8 11:49:23 x.x.x.131 239614: Jun 8 11:50:58.400 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.63 4/0 (hold time expired) 0 bytes Jun 8 11:49:23 x.x.x.131 239615: Jun 8 11:50:59.448 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.97 Down BGP Notification sent Jun 8 11:49:23 x.x.x.131 239616: Jun 8 11:50:59.448 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.97 4/0 (hold time expired) 0 bytes Jun 8 11:49:27 x.x.x.131 239617: Jun 8 11:51:01.664 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.131 Down BGP Notification sent Jun 8 11:49:27 x.x.x.131 239618: Jun 8 11:51:01.664 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.131 4/0 (hold time expired) 0 bytes Jun 8 11:49:27 x.x.x.131 239619: Jun 8 11:51:03.872 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.83 Down BGP Notification sent Jun 8 11:49:27 x.x.x.131 239620: Jun 8 11:51:03.872 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.83 4/0 (hold time expired) 0 bytes Jun 8 11:49:27 x.x.x.131 239621: Jun 8 11:51:03.872 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.192.156 Down BGP Notification sent Jun 8 11:49:30 x.x.x.131 239622: Jun 8 11:51:03.872 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.192.156 4/0 (hold time expired) 0 bytes Jun 8 11:49:30 x.x.x.131 239623: Jun 8 11:51:06.056 CEST: %BGP-5-ADJCHANGE: neighbor 80.81.194.50 Down BGP Notification sent Jun 8 11:49:30 x.x.x.131 239624: Jun 8 11:51:06.056 CEST: %BGP-3-NOTIFICATION: sent to neighbor 80.81.194.50 4/0 (hold time expired) 0 bytes
CPU load is constantly at 100% doing BGP and more BGP.
We have around 200 BGP sessions on DECIX and I would not want to shut them down and bring them up individually.
How can I get out of this deadlock?
Andy
-- ================================================================ :: :: Kevin Hodle | http://www.linkedin.com/in/kevinhodle :: :: PGP Key ID | fingerprint :: :: 0x803F24BE | 1094 FB06 837F 2FAB C86B E4BE 4680 3679 803F 24BE "Elegance is not a dispensable luxury but a factor that decides between success and failure. " -Edsgar Dijkstra ================================================================
participants (8)
-
Andy B.
-
Ingo Flaschberger
-
Jared Mauch
-
Kevin Hodle
-
Matthew Petach
-
Niels Bakker
-
Randy Bush
-
Richard A Steenbergen