EGP-related routing problem on T3 backbone
During the last 72 hours, we have seen a multiple instances of routing instability across the T3 network including RS6000 CPU starvation on several ENSS nodes and route flapping on E139, E138, E136, and E134. Several sites have called to complain about route flapping and instability. We spent considerable time over the last couple of days trying to isolate this. The following is our current understanding of the problem, and our plan for correction. We will update the list as we learn more. With the new network announcements that resulted from the scheduled configuration update last Friday (8/7), the T3 ENSS nodes started sending EGP updates to regional peer routers that exceed 8KB in size at all sites configured for explicit routing. The problem starts when the ENSS sends the regional peer an 8KB+ update. The peer router may flap if it is operating with software that will NOT support 8KB+ EGP updates. Also, there is a dormant bug in the RS6000 rcp_routed EGP code which involves routes getting imported via IBGP which do not get flushed out of a queue. When the regional router flaps (misses one message from the peer) due the 8KB+ update described above, the rcp_routed routes derived from EGP sitting in the queue do not get installed and this might indirectly result in a routing inconsistency between the ENSS and its CNSS neighbor. We suspect that a simple fix to the T3 network problem is to flush the queue every time we timeout the EGP derived routes. We have a new version of rcp_routed that flushes the queue and has a trace statement that logs the event. We will install the new rcp_routed on C99, C51 and if successful, we will install it on E138 at 5am EST 8/11 (with SURAnet approval). If this is successful, we will install it on several other nodes as emergency maintainence on 8/12pm. We will send another note to the nwg list tomorrow with an update to the problem. However this fix not solve the problem of the 8KB+ updates causing route flapping on several regional peer routers. Peer networks that are running BGP should not experience this problem. We have contacted Cisco, Proteon, and Wellfleet, and have learned the following regarding their suggested software fixes to this problem. Cisco ----- Experiencing this problem depends on the version number of software that you are running. To find out what you're currently running at, do a "show buffers" and note the size of the huge buffers. This is the maximum size IP packet that the router can reassemble. If EGP updates come in that are larger than this, then you will get reassembly failures which can be seen in "show ip traffic". In later Cisco releases, there is now a knob so that you can change the buffer size on huge buffers. Using this, you can reassemble up to 64k IP packets. The following releases support the following buffer sizes: v8.1 8KB buffer v8.2 12KB buffer v8.3(1) 12KB buffer v8.3(>=3) 18KB buffer, but configurable v9.0 18KB buffer, but configurable Proteon Routers --------------- The Proteon router has a fixed size reassembly buffer. Any packets bigger than the reassembly buffer will be dropped. Proteon will generally increase the size of the reassembly buffer in each release. Its current size (in Proteon releases 11.0 and greater) is 12K. In release 10.0b, a large number of which are probably still in the field, the reassembly size was 8K. If there is a site that thinks it is having this problem with a Proteon router, they can contact Proteon customer service to get the latest software revision. Customer service is familiar with the reassembly buffer issues. Wellfleet --------- The Wellfleet router also has a fixed size reassembly buffer. Any packets bigger than the reassembly buffer will be dropped. Any Wellfleet router running a software release of v5.6 or older will have a 4KB buffer. This release stopped shipping 2 years ago. All new releases are compiled with a buffer size of 16KB and should not experience this problem. --Jordan Becker, ANS Mark Knopper, Merit
participants (1)
-
Mark Knopper