Hi. I may have procrastinated this too long for it to be in the official Internet Monthly Report, but here is the report in any case. Mark ------------------------------------------------------------------------ ANSNET/NSFNET Backbone Engineering Report October 1992 Jordan Becker, ANS Mark Knopper, Merit becker@ans.net mak@merit.edu T3 Backbone Status ================== During October, the T3 network router software was upgraded to support 10,000 destination networks with up to four routes per destination. Development work in support of migration to GATED routing software continues on schedule. Problems that were addressed in October included a major routing software bug that resulted in 3 separate routing instability events, an RS960 memory parity problem, and an AIX TCP software bug. The phase-4 backbone upgrade activities were completed in October. Significant positive experience was gained with the RS960 FDDI interface in October that will lead to additional FDDI deployments on T3 ENSS nodes. The preparations continued for dismantling of the T1 backbone which is scheduled for November. Activities included testing of OSI CLNP encapsulation over the T1 backbone, deployment of the redundant backup circuits for the T3 ENSS gateways at each regional network, collection of network source/destination traffic statistics on the T3 backbone, and cutover of the ESnet networks to the T3 backbone. The EASInet and CA*Net systems are not yet using the T3 backbone, and will be cutover in November. Backbone Traffic and Routing Statistics ======================================= The total inbound packet count for the T1 network was 4,213,308,629, up 20.0% from September. 469,891,322 of these packets entered from the T3 network. The total inbound packet count for the T3 network was 18,940,301,000 up 20.8% from September. 185,369,688 of these packets entered from the T1 network. The combined total inbound packet count for the T1 and T3 networks (less cross network traffic) was 22,498,348,619 up 20.1% from September. As of October 31, the number of networks configured in the NSFNET Policy Routing Database was 7354 for the T1 backbone, and 7046 for the T3 backbone. Of these, 1343 networks were never announced to the T1 backbone and 1244 were never announced to the T3 backbone. For the T1, the maximum number of networks announced to the backbone during the month (from samples collected every 15 minutes) was 5378; on the T3 the maximum number of announced networks was 5124. Average announced networks on 10/31 were 5335 to T1, and 5085 to T3. Routing Software on the T3 Network ================================== During October, the T3 network routing and system software was upgraded to support an increased on-card routing table size on the RS960 interfaces (T3/FDDI) and T960 interface (T1/ethernet) to 10,000 destination networks with up to 4 alternate routes per destination network. The previous limit was 6000 destinations networks with up to 4 alternate routes per destination. A serious routing bug was exposed that caused instabilities across the entire T3 system during three different events, the first on 10/19, and the 2nd & 3rd on 10/23. We successfully installed a new version of rcp_routed software on all T3 backbone nodes to fix the problem. This bug involved the interface between the routing daemon, and the SNMP subagent. With the addition of the 86th AS peer on the T3 backbone, the buffer between the routing daemon and the SNMP subagent would get corrupted and induce a failure of the routing software. With the increased number of routes in the on-card routing tables, we have begun to observe problems with the performance for route installs/deletes between the on-card forwarding tables, and the AIX kernel. During transient routing conditions we may experience high card/system traffic due to route downloads which can cause transient instabilities. We plan to deploy new software that will improve the efficiency for route installs/deletes between the on-card forwarding tables, and the AIX kernel in November. Also during November, we plan to install new routing software that will support static routes. This can be used for situations where there is no peer router available to announce the shared interface and any networks behind it. This version of the routing software will also selectively filter routes out of the local routing tables on a network and/or AS basis. The software will also increase the limit to 16 peer AS numbers per ENSS, and improve the checks for the size of BGP updates and attributes. The development of GATED software to replace the rcp_routed software base is proceeding on schedule. During October, the BGP4 protocol was developed and unit tested in GATED along with the interim link state IGP that will be used to interoperate with internal nodes running rcp_routed. We expect to deploy GATED software on the T3 network in early 1993 following the upgrade to the AIX 3.2 operating system. RS960 Memory Parity Problem =========================== During October, we continued to experience some problems on CNSS nodes due to the failure of parity checking logic within the on-card memory on selected RS960 T3 adapters. These problems have largely been isolated to a few specific nodes including CNSS97 (Denver), CNSS32 (New York), and CNSS40 (Cleveland). These outages do not generally result in ENSS isolation from the network since only a single interface will be affected, and redundant connectivity is employed on other CNSS interfaces. The problem can be cleared by a software reset of the interface. Some of these problems have been alleviated with hardware replacement (e.g. CNSS97 in Denver). AIX TCP Software Problem ======================== During October we experienced a problem involving TCP session deadlock involving I-BGP sessions between particular ENSS routers. A bug was found in the TCP implementation of the AIX 3.1 operating system (same bug in BSD) where an established TCP session between two ENSS routers (e.g. for I-BGP information transfer) would hang and induce a high traffic condition between an RS960 T3 interface and the RS/6000 system processor. This would cause one of the ENSS routers on either end of the TCP session to suffer from performance problems, until the problem was cleared with a reboot. This problem occurred on several ENSS nodes in October including ENSS134 (Boston), ENSS144 (Ames), ENSS131 (Ann Arbor), and ENSS129 (Champaign). A fix to this problem was identified, and successfully tested in October. This will be released as part of a new system software build for the RS/6000 router in November. Phase-4 Deployment Complete =========================== The phase-4 network upgrade was completed in October '92. The final steps in the upgrade involved the installation of CNSS36 in the New York POP, and the completion of T3 DSU PROM upgrades. The DSU firmware upgrade supports new c-bit parity monitoring features, and incorporates several bug fixes. RS960 FDDI Deployment Status ============================ Since the installation of the new RS960 FDDI adapters for the RS/6000 router on ENSS128 (Palo Alto), ENSS135 (San Diego), ENSS129 (Champaign), and ENSS132 (Pittsburgh), there have been only two operational problems. One involved an instance of backlevel microcode following a software update, and the other involved a failure of the original hardware installed. The experience with RS960 FDDI has been extremely positive so far. There are performance tests under way involving Pittsburgh Supercomputer Center, San Diego Supercomputer Center, and National Center for Supercomputer Applications that are designed to exploit the FDDI high bandwidth capability. Following the completion of these tests, additional RS960 FDDI adapters will be deployed. All production network interfaces are still configured for a 1500 byte Maximum Transmission Unit (MTU). We will soon reconfigure the MTU on most network interfaces to maximize performance for applications designed to exploit T3/FDDI bandwidth, while maintaining satisfactory performance for sites that interconnect to the T3 routers via an ethernet- only interface. The new configuration will be: o Any ENSS with an RS960 FDDI interface will have a 4000 byte MTU except for the ethernet interfaces which will remain at 1500 bytes. The FDDI interface MTU will be set to 4352 bytes following the deployment of AIX 3.2. o All other ENSS ethernet interfaces will have a 1500 byte MTU o All T3 CNSS interfaces will have a 4000 byte MTU except T3 interfaces connecting to an ENSS with ethernet only, and interfaces connecting to a T1 CNSS. o All T1 CNSS interfaces and T1 ENSS interfaces will have a 1500 byte MTU. Dismantling the T1 Backbone =========================== Plans to dismantle the T1 backbone have proceeded on schedule. We will begin dismantling the T1 backbone in November '92. This will occur as soon as (1) the remaining networks using the T1 backbone are cut over to the T3 backbone (EASInet and CA*net); (2) the OSI CLNP encapsulation support for the T3 backbone is deployed; (3) the T3 ENSS nodes are backed up by additional T1 circuits terminating at alternate backbone POPs. These activities are described below. T1 Routing Announcement Change ------------------------------ In early November a change will be made to eliminate the redundant announcements of networks from the T3 to the T1 backbone via the interconnect, for those networks which are announced to both backbones. This has the effect of eliminating the use of the T1/T3 interconnect gateway for CA*Net and EASInet in the case of isolation of a multi-homed regional from the T1 backbone. This change is necessary for the interim to allow these duplicate routes to be removed from the overloaded routing tables on the T1 RCP nodes, and to allow the T1 backbone to be used for a few more weeks. Remaining Network Cutovers to T3 -------------------------------- A new T1 ENSS will be installed in CERN, Switzerland to provide connectivity to the T3 backbone for EASInet. Cutover of EASInet traffic will occur when this installation is complete. The Seattle RT E-PSP for CA*net is being converted to run the CA*net software and operate as part of CA*net's domain. It will run GATED and speak BGP with the T3 ENSS. Once this has been debugged and tested the Princeton and Ithaca connections will similarly be upgraded. OSI Support Plan ---------------- We have successfully tested the RT/PC OSI encapsulator software (EON) that was described in the August '92 engineering report. Because the encapsulator software uses IP to route encapsulated OSI traffic, this can be tested over the production T1 network. Encapsulation is enabled in one way mode from NSS17 in Ann Arbor to selected NSAP prefixes (i.e. encapsulation outgoing, native CLNP incoming). The half duplex capability is important for testing and deployment. When any of the T1 NSFNET EPSP nodes receive an encapsulated OSI packet, it decodes it and proceeds to switch it via native CLNP. The OSI encapsulator EPSP is configured on a per prefix basis (e.g. any prefix configured to use EON at a given NSS will have its outgoing packets encapsulated). This flexibility in configuration will allow us to switch two regional networks over to OSI encapsulation during the 2nd week of November, with full conversion to OSI encapsulation during the 3rd week of November. Native CLNP switching services will be available in an upcoming release of the RS/6000 AIX 3.2 system software which is scheduled for deployment on the T3 network in early 1993. T3 ENSS Backup Plan ------------------- The plan to provide backup connectivity to T3 ENSS nodes proceeded on schedule in October. We have begun to install dedicated T1 leased line circuits between all T3 ENSS nodes, and a CNSS T1 router in a secondary POP. These T1 circuits are replacing T1 router ports that were formerly used by T1 safetynet circuits on the T1 CNSS routers. This work is expected to be complete by the end of November. The planned topology for T3 ENSS backup is illustrated in a postscript map that is available via anonymous FTP on ftp.ans.net in the file </pub/info/t3enss-backup.ps>. Once the backup infrastructure is in place, we will begin to work on subsequent optimizations to further improve backup connectivity. We have already begun discussions with several regional networks on this. Several regionals have indicated that they will stop peering with the T1 backbone when their T1 ENSS backup circuit is in place. The final phaseout of the T1 backbone will occur after OSI encapsulation, final traffic cutovers, and these backup circuits are installed. Network Source/Destination Statistics Collection ================================================ During October we tested and deployed software on the T3 backbone to collect network source/destination traffic statistics. This is a feature that has been supported on the T1 backbone in the past, and was supported for a brief period on the T3 backbone prior to the migration to RS960 switching technology. For each ENSS local area network interface, we will collect the following information for each source/destination network pair: packets (in and out), bytes (in and out), packets distributed by port#, packets distributed by protocol type (UDP, TCP). Packets are sampled on the card (1 in 50 packets sampled) and forwarded to the system processor for reduction and storage. We expect to have collected a full month of network source/destination statistics by the end of November. Notable Outages in October ========================== MCI Fiber Outage - 10/17 ---------------- At 12:33EST on 10/17 we experienced a major MCI fiber outage on the east coast. A truck accident damaged a fiber-line between Trenton and New Brunswick (New Jersey). This caused an extended loss of connectivity for several T3 and T1 circuits that transited the MCI junction in West Orange New Jersey. All circuits affected by the fiber cut were back on their original path as of 10/17/92 22:00 EDT. During the outage, several circuits were moved to backup restoration facilities and were moved back during the early morning of 10/18. There were some periods of routing instability with circuits going down and coming back up that caused temporary loss of connectivity for other network sites as well. Routing Software Instabilities - 10/19, 10/23 ------------------------------ During the week of October 19th, the T3 network experienced 3 unscheduled outages (roughly 1 hour each in duration). We successfully installed a new version of rcp_routed software on all T3 backbone nodes that fixed the problem on 10/25. This bug involved the interface between the routing daemon, and the SNMP subagent. With the addition of the 86th AS peer on the T3 backbone, the buffer between the routing daemon and the SNMP subagent would get corrupted and induce a crash of the routing software. This problem occurred first during the cutover of the ESnet peers at FIX-E, and FIX-W on 10/19. Following the resolution of this problem, the ESnet peers were successfully cutover to use the T3 backbone.