ANSNET/NSFNET Backbone Engineering Report December 31, 1991 Jordan Becker Mark Knopper becker@ans.net mak@merit.edu Advanced Network & Services Inc. Merit Network Inc. Summary ======= The T3 network continued to perform reliably during the month of December. A change freeze and stability test period was observed on the T3 network during December 13-20 in preparation for the cutover of additional traffic from the T1 network which is scheduled to begin in mid-January. During the stability test period, two test outages during off-peak hours were scheduled. Some final software changes are being deployed following the stability period, and a routing migration plan has been developed to support the continued traffic migration from T1->T3. Changes were deployed on the T1 backbone during December to improve the reliability. A new routing daemon was deployed to fix the chronic EGP peer loss problem that had been exhibited most notably on NSS10 in addition to some other nodes. The T1 network continues to experience a low level of congestion, primarily at the ethernet interfaces on some nodes. The December 1991 T1 and T3 traffic statistics are in available for FTP in pub/stats on merit.edu. The total inbound packet count for the T1 network was 9,683,414,659, down 4.4% from November. 529,316,363 of these packets entered from the T3 network. The total inbound packet count for the T3 network was 2,201,976,944, up 38.7% from November. 489,233,191 of these packets entered from the T1 network. The combined total inbound packet count for the T1 and T3 networks (less cross network traffic) was 10,866,842,049, down 3.3% from November. Finally, the plan to deploy the RS/960 technology for Phase III of the T3 backbone is taking shape. Testing on the T3 Research Network will begin in January with the possibility for deployment in late February. T1 Backbone Update ================== NSS Software Problems and Changes We had been experiencing an EGP session loss problem on several T1 backbone NSS systems. This was occuring most frequently on NSS10 at Ithaca. This problem has been fixed by a change to the rcp_routed program running on the RCP nodes in the backbone NSS's. The problem was due to the timing between the creation of routing packets, and the transmission of those packets during transient conditions. This new software prevents the simultaneous loss of EGP sessions across multiple PSPs in an NSS that had been observed at some nodes. Since this problem was corrected, we have experienced a few isolated disconnects with an EGP/BGP peer on NSS10 at Ithaca, which are believed to be unrelated to the earlier problem. This symptom happens less frequently, and involves only one PSP at a time. The latest occurences have been traced to an isolation of the PSP from the RCP. This is due to CPU looping on the PSP during a flood of traffic sourced from the local ethernet interface. We are working to attach a sniffer to the local ethernet to determine the source of these traffic floods on the PSP ethernet interface. Another problem that we have seen roughly three times a month is a crash of a T1 RCP node due to a known virtual memory bug in the RT kernel. We are working on both of these problems now and hope to have them corrected soon. We continue to experience congestion on the T1 backbone. We are collecting 15 minute peak and average traffic measures via SNMP on all interfaces, and we also sample some interfaces at shorter intervals to look at burst traffic. We have occasionally measured sustained T1 backbone link utilization around 50% average, and peaks above 70% on several T1 lines. We also have experienced high burst data streaming on the local EPSP ethernet interface (3500PPS bursts at an average 200 bytes/packet). We have already taken a number of actions to reduce the congestion including the addition of split-EPSP routers and T1 links, and installation of dual-ethernet EPSP systems where we split the routes across each ethernet interface. These have been deployed at Ithaca, College Park, and Champaign. There are a number of things we can still do to improve this, however the greatest reduction in congestion has, and will continue to come from migration of traffic from the T1 to the T3 network. ICMP Network Unreachable Messages The T1 network has exhibited some problematic behavior that was previously addressed on the T3 network where ICMP network unreachable messages are generated and transmitted external to the backbone by core backbone nodes during routing transients. This was addressed in the T3 network by implementing an option to limit the generation of these unreachable messages only to nodes equipped with an external LAN interface rather than allowing the core backbone nodes to generate them. An equivalent implementation of this option is now being tested for deployment on the T1 network. This manifests itself as a problem for host software implementations when routing transients occur in the backbone due to circuit problems or other reasons. T1 Backbone Circuit Problems On the T1 backbone nodes, CSU circuit error reporting data is not made available to the RT router software via SNMP as is the case on the T3 backbone. This makes it more difficult to generate NOC alerts that correspond to circuit performance problems recorded by the CSU equipment. However the PSP nodes are able to detect DCD transitions (known as "DCD waffles") and record them in the router log files. An increase in circuit performance problems on the T1 backbone has been observed on several links lately as evidenced by DCD waffles, and some actions have been taken to resolve the problems. These include work in progress to provide more timely reports on all DCD waffle events, as well as direct integration of carrier monitored T1 ESF data with our SNMP based network monitoring tools. Procedures have been improved for the diagnosis and troubleshooting for T1 backbone circuit problems in cooperation with MCI and the local exchange carriers. We have also worked to improve the procedures and communications between our operators & engineers, and our peer network counterparts. T3 Backbone Update ================== Summary During the week of December 13-20, a change freeze was conducted and stability measurements were performed. No software or hardware changes were administered to the backbone nodes, with the exception of the normal Tuesday and Friday morning routing configuration updates. Prior to the change freeze and stability week, several changes and software enhancements were introduced on the T3 system to address known problems. During stability week, two problems were identified. One problem involves the loss of connectivity to an adjacent IS-IS neighbor. Following stability week, a new rcp_routed program has been installed on the network which has instrumentation developed in order to identify the problem. Unfortunately this problem has not been observed again since the new code has been installed. A new plan for T1-T3 routing and traffic exchanges has been developed. This will support the continued migration of traffic from the T1 to the T3 system which is expected to commence in January 1992. Pre-Stability Period Changes --Safety Net The remaining two links for "Safety Net" were installed and configured. Safety Net is a collection of 12 T1 links connecting the CNSS nodes in a redundant fashion within the core of the T3 network. Safety Net has proven to be useful backup path on a couple of occasions for several minutes in duration where all T3 paths out of a core node become unusable due to a T3 interface or CNSS failure. Safety Net will remain in place until the existing T3 interface hardware is replaced with the newer RS/960 interface hardware, and this is no longer necessary. --Routing Software Changes Three changes were made to the rcp_routed daemon software. A digital signature from MD4 was implemented in the rcp_routed software to ensure the integrity of IBGP messages between systems. An enhancement to increase the level of route aggregation was made in the BGP software that reduces the size of external routing updates to peer routers. This provided a workaround to a problem in some of the regional routers that are supporting external BGP where the peer router would freeze after receiving a BGP update message. The "route loss" problem mentioned in the November 1991 report was identified and fixed prior to the commencement of the stability period. This was identified as a bug involving the exchange of information between the external and internal routing software. --AIX Build 3.0.62 Kernel Installed A new system software build was deployed on all RS/6000 nodes in the T3 backbone to fix several problems. One of these was the T960 "sticky T1" problem, which would cause a delay on packet forwarding across a T1 interface. Another problem that was fixed involved a delay in the download of routes from the RS/6000 system and T960 ethernet and T1 "smart" interface cards. Change Freeze and Stability Week 12/13-12/20 During this period, no hardware configuration or software changes were administered and several reliability and stability tests were performed. Some of these tests included scheduled test outages of selected nodes during off-peak hours. A test outage of the T1/T3 interconnect gateway was performed. The external BGP sessions on the Ann Arbor interconnect gateway were disconnected, forcing the Houston backup interconnect gateway to become operational. This transition occurred automatically over a 15 minute time period. After the switchover, the Ann Arbor primary gateway was put back into production. Another test that was performed was a node outage of the Denver T3 backbone CNSS. This node was chosen since it does not yet support any production ENSS nodes. The routing daemon on this node was taken down and brought back up again. This had no unexpected results, and did not have any noticeable impact on other network traffic during the IS-IS routing convergence which was measured to be on the order of 25 seconds across the T3 network. As a result of these tests and the measurement of improved T3 backbone stability, the change freeze week was concluded successfully on 12/20. Plans are described below to migrate additional traffic from the T1 to the T3 network in January. Post-Stability Week Actions and Plans --New Routing Software The new rcp_routed with instrumentation to debug the IS-IS adjacency loss problem was installed. This problem has not occured since 12/22. --AIX Kernel Build 3.0.63 Targeted for Installation A new software build is being tested at this time to address the T960 ethernet freeze problem, and to support a full CRC-32 link layer error check computed in software. This new software build will be deployed in two phases. Build 63 also includes a version of the NNstat feature which allows the net-to-net traffic statistics matrix to be collected. This is a necessary change targeted for deployment prior to migrating a major portion of the T1 backbone traffic over to T3. --Routing Architecture Plan With the traffic migration from T1 to T3, it will be necessary to split the announcements of routes from the T3 network to the T1 network (for networks that are connected to both T1 and T3) across multiple T1/T3 interconnect gateways in order to load balance, and ensure that the size of the IS-IS packets contained in the announcement does not get excessively large. Routing announcements from the T1 to the T3 networks will be made on all primary interconnect gateways, as will routing announcements for networks which are only connected to the T3 network. The routing configuration database modifications and configuration updates are already underway to support this design. In order to provide improved redundancy for traffic between the T1 and T3 networks, additional T1/T3 interconnect gateways will be established. A fourth T1/T3 gateway is being installed at the Princeton site to act as backup to the Ann Arbor primary gateway. A fifth and sixth gateway are planned for future expansion with the expectation that the IS-IS packet size will increase with additional growth in the total number of networks announced to the T3 and T1 backbones. --T1->T3 Traffic Migration Plan A plan has been drafted that addresses the T1->T3 traffic migration in support of peer networks that are not already using the T3 network. Regional networks maintain a co-located peer router with both a T1 NSS and T3 ENSS are requested to maintain EGP/BGP peer sessions with both the T1 and T3 networks. This will allow them to announce their networks to both the T1 and T3 systems. It is advised that regionals have their peer routers learn default routes from the T1 NSS, and explicit routes for all destinations from the T3 ENSS. This will result in all traffic destined for a site primarily reachable on T3 to take the T3 path, and likewise for T1. The goal here is to minimize traffic on the interconnect gateways. Primary reachability via T3 or T1 will be managed through the adjustment of routing metrics on the T1 and T3 systems. An analysis of the traffic associated with each Autonomous System has been conducted. Migration of traffic will be implemented by choosing AS pairs that account for the largest inter-AS traffic flows. These will be moved over together in a pairwise fashion as part of a scheduled routing configuration update. We are working with some regionals now to schedule this. We will proceede slowly with this migration where no more than one pair of AS's will be moved over in a single week at first. We are working to coordinate this with the regionals and we hope to have a signficant portion of the T1 traffic cut over to T3 by the end of February. Some traffic will likely remain on T1 backbone for several reasons. Since the T3 nodes do not yet support the OSI CLNP protocol, that traffic will remain on the T1 backbone. There are also some other international networks that do not directly peer with the T3 network which will announce themselves only to the T1 backbone. Phase III T3 Network RS/960 T3 Adapter Upgrade ============================================== A phased implementation plan for the new RS/960 T3 adapters is being developed, and testing will begin on the T3 Research Network in mid-January. The testing phase will take over a month and exercize many features and test cases. Redundant backup facilities to be used during the phased upgrade will be supported and tested on the research network. Performance and stress testing will also be conducted. Test outages and adapter swaps will take place to simulate expected maintainance scenarios. The RS/960 T3 adapters do not interoperate across a DS3 serial link with the existing T3 adapters, and so the phased upgrade must be administered on a link-by-link rather than node-by-node basis. Deployment will be coordinated with the peer networks to ensure that adequate advanced notice and backup planning is afforded. The deployment could begin in late-February depending upon the test results from the T3 Research network.