update regarding 12/3/94 service disruption
All, ANS has a fix being exercised on our testnet for the routing software (known as "GateD" - gateway daemon) bug which caused the service disruption on Saturday, 12/3/94. The sequence of events leading to the problem is extremely obscure, as should be evident from the description below. This particular bug has been exercised only twice before in the history of our use of this software and will not appear again following deployment of the new software. We will implement a phased rollout of the new routing software. The rollout will begin this coming week on a small number of routers. During the course of the week the behavior of the new software will be observed. Pending successful results, a network-wide deployment will take place the week of 12/19. Steve Heimlich Manager, Infrastructure Development ANS ------- We have these prerequisites: - there is a network X which is announced into our backbone - there is a primary announcement (1), a secondary announcement (2), and a tertiary announcement (3) - one ENSS A acts as (1) and one ENSS B acts as both (2) and (3) (e.g., MAE-East may speak with Sprint and Alternet, which may be secondary and tertiary providers, respectively) - ENSS A must have a lower router ID than ENSS B (i.e., A < B) and this sequence of events: - ENSS A goes away non-gracefully such that iGP connectivity from the backbone to ENSS A is withdrawn but the iBGP session stays up (e.g., a power loss or circuit outage but not a clean GateD shutdown) - all routers notice loss of iGP connectivity to ENSS A within one minute and reset the next hop for route (1) to network X to be null, keeping the route in the BGP RIB in case iGP connectivity is restored - in addition to the above, ENSS B injects (2) into the backbone via iBGP - the exterior peer providing (2) withdraws the route to network X within 2 minutes of the initial AS 690 loss of iGP connectivity to ENSS A - ENSS B then injects (3) into the backbone via iBGP - all other routers see that the preference for network X has worsened and therefore traverse the BGP RIB to find the best current route to network X, attempting to verify as well that any route under consideration has a valid next hop - during the traversal, the routers mistakenly use an incorrect pointer to verify existence of a good next hop, not realizing that the former primary route (1) has a null next hop - due to a bug in some comparison logic, the formerly primary route (1) is selected from the BGP RIB if A > B and is installed into the kernel - the iBGP sessions from all backbone machines to ENSS A time out three minutes after loss of iGP connectivity to ENSS A - GateD crashes when it attempts to delete the mistakenly installed formerly primary route (1) from the kernel
participants (1)
-
Steve Heimlich