The good news is that since Friday morning, we now have a stable gated/BGP4 router interoperating with 90+ rcp_routed peers within AS690, and we are preparing for further gated/BGP4 deployment on AS690 routers. The bad news is that we suffered a major system-wide rcp_routed failure along the way. On Thursday evening around 22:00EST, we exercised a dormant bug within rcp_routed where the gated router (ENSS205) sent a valid IBGP message to its 90+ rcp_routed peers that they did not process correctly and the routing daemon died. This resulted in about a 45 minute outage, although the gated peer (ENSS205) stayed up. The problem was immediately traced to an rcp_routed bug. The rcp_routed bug (which was not found on our testnet) was fixed and deployed during Friday mornings configuration window. The ENSS205 router is a semi-production system that has been used to stage new gated versions onto AS690. The last update to gated on this system was done outside the normal configuration maintenance window in an effort to expedite the deployment on other full-production nodes for CIDR transition. While this outage could have occurred at any time (it was induced by a random external peer interaction) a procedural improvement will be administered for all gated installations by establishing special scheduled maintenance windows to minimize the impact of any problems, and to work towards a timely AS690 CIDR deployment. There are a two non-critical known problems with the AS690 gated that have been (or will be addressed) prior to the next installation. I have summarized these problems below for those interested. Another new development is that the rcp_routed support for a default route that was previously deployed on AS690 was successfully tested during the last scheduled configuration maintenance window. A non- disruptive configuration change can be administered at any time to enable AS690 to default to the AS1133 (gated BGP4/CIDR) peer which is connected to both MAE-East, and to CNSS57 (AS690) over a private 10Mb/s ethernet segment. This may be required to assist specific MAE-East peers with a timely CIDR deployment. We have established two specific upcoming maintenance windows for gated deployment on AS690 routers. The next window will be Sunday Morning Feb 14 (00:30EST - 08:00EST). The AS690 routers that we have picked as candidates to convert to run gated (pending consent of the peers) include: ENSS194 (ANS Ann Arbor Backup Router) ENSS158 (Maui HPCC) CNSS120 (Honolulu CNSS) ENSS205 (new Gated re-deployment) ENSS160 (ANS Elmsford) ENSS139 (Rice U) ENSS131 (Ann Arbor) There is an overlapping power maintenance scheduled window for the MCI New York POP (Sunday morning 00:30EST-02:30EST) that we will work around, and should not interfere with the gated deployment activities. The ANS NOC has contacted each of the peers that will be affected by service disruptions on these nodes to acknowledge their consent, and NSR messages have been sent for each of these maintenance windows describing the gated deployment, and the potential for AS690 routing instabilities that may occur during these windows. Following this maintenance we would like to validate with the peers that all configurations with external peers are behaving as expected. If this maintenance is successful we would like to continue the AS690 deployment during the Tuesday morning Feb 15th configuration maintenance window (05:00EST - 08:00EST) for the following candidate nodes: ENSS136 (College Park) ENSS145 (FIX-East) ENSS144 (FIX-West) We would try to identify any problems within the maintenance window. If we could not complete this maintenance within the window, these routers would be rolled back to run rcp_routed. Once we have stabilized the above set of nodes, the rest of the AS690 could proceed as rapidly as can be scheduled (within maintenance windows). For general interest, the two known problems with AS690 gated that we expect to fix before the AS690 deployment concludes. These may be summarized as: 1. Jurassic LSPs. At the time rcp_routed was originally developed, there was an IS-IS LSP packet format that was designed to carry external routes, before IBGP was implemented to carry these. Modern rcp_routed does no longer cares about this, however still occasionally generates these "Jurassic LSPs" which have header but carry no external routes, from specific production rcp_routed nodes at rcp_routed startup time (we can not reproduce this on our testnet). Gated does not know what to do with these packets so it drops them and logs an error message. Rcp_routed acks them and goes on about its business. Rather than modifying rcp_routed to eliminate these LSPs, we instead worked around this by adding a few lines of code to the gated SLSP to ack these packets and ignore them. This does not cause problems running gated on a few production network nodes and will not bother ENSS205, but we should and will fix this before we scale the deployment up further. 2. BGP memory leak. We have previously known from our testnet experience that there is a slow memory leak in the gated BGP code that is exercised when a connection to an external peer fails and gated tries to re-establish the session. The only way we could observe this problems on the testnet was when a gated machine was forced to try and establish sessions (2 times per second) with 80 external peers that refused to connect for about 4 days. We would not expect to see problems with this on nodes that are configured properly, but in any case we expect to have a fix for this shortly and this will be retrofitted on existing gated nodes during a future installation window.
participants (1)
-
Jordan Becker