November Backbone Engineering Report
This report is included in this month's Internet Monthly Report. Mark ANSNET/NSFNET Backbone Engineering Report November 1992 Jordan Becker, ANS Mark Knopper, Merit becker@ans.net mak@merit.edu Network Status Summary ====================== All remaining T1 backbone traffic was cutover to the T3 backbone and the T1 backbone network was turned off on December 2nd. There were network stability problems observed at several sites prior to, and during the IETF MBONE video/audio multicast event in November. The problems were mostly due to increased traffic, and the inefficient processing of source route packets that caused routing instabilities. Software changes have been deployed on the T3 backbone to reduce the routing instability problems due the MBONE, and to improve the efficiency for downloading large routing updates to the packet forwarding interfaces on the T3 routers. New RS960 FDDI interfaces are being scheduled for deployment to selected T3 ENSS nodes in December. Performance measurements over the T3 network are proceeding at sites that already have FDDI interfaces installed. Backbone Traffic and Routing Statistics ======================================= The total inbound packet count for the T1 network was 3,589,916,970, down 14.8% from October. 598,015,432 of these packets entered from the T3 network. The total inbound packet count for the T3 network was 20,968,465,293, up 10.7% from October. 134,269,388 of these packets entered from the T1 network. The combined total inbound packet count for the T1 and T3 networks (less cross network traffic) was 23,826,097,443 up 5.9% from October. As of November 30, the number of networks configured in the NSFNET Policy Routing Database was 7833 for the T1 backbone, and 7581 for the T3 backbone. Of these, 1642 networks were never announced to the T1 backbone and 1602 were never announced to the T3 backbone. For the T1, the maximum number of networks announced to the backbone during the month (from samples collected every 15 minutes) was 5772; on the T3 the maximum number of announced networks was 5548. Average announced networks on 11/30 were 5707 to T1, and 5495 to T3. T1 NSFNET Backbone Turned Off ============================= The activities required to turn off the T1 backbone were completed during November and the network was officially turned off by disabling routing on the NSS nodes starting at 00:01 EST on 12/2. Several actions were taken in advance of the routing shutdown including installation of the T1 ENSS206 at CERN in Switzerland, reconfiguration of the T1 NSS routers to gateway traffic between CA*net and NSFNET, the deployment of the EON software for OSI encapsulation over IP, and the final installation of T1 backup circuit infrastructure connecting T3 ENSS nodes to secondary CNSS nodes. Remaining Network Cutovers to T3 -------------------------------- AS 68, Los Alamos National Laboratory networks and AS22, operated by MilNet at the San Diego Supercomputer Center were cutover to use the T3 backbone in November. A new T1 ENSS (ENSS206) was installed in CERN, Switzerland to provide connectivity to the T3 backbone for EASInet. ENSS206 is interconnected via a T1 circuit with CNSS35 in New York City. The ENSS was initially configured with less than the recommended memory, and had to be upgraded to overcome some performance problems. Other than that, the installation went smoothly, and EASInet traffic was cutover to the T3 backbone on 12/1. The NSS nodes in Seattle, Ithaca and Princeton were converted for use by CA*net to allow CA*net to peer with the T3 network until the longer term GATED hardware peer configurations are available. The E-PSP nodes for CA*net will be converted to run the CA*net software and operate as part of CA*net's domain. These nodes will run GATED and exchange routes via BGP with the T3 ENSS. OSI Support on T3 Backbone -------------------------- OSI CLNP forwarding over the T3 backbone was configured via encapsulation of CLNP over IP packets using the EON method (RFC1070) until native CLNP switching services are available on the T3 routers. RT PSP routers were configured as EON encapsulators at most regional and peer networks. CLNP traffic on the regional is first routed to the EON machine. The EON machine encapsulates the CLNP packet in an IP packet. The EON machine will send the IP packet to the remote EON machine that is associated with the destination NSAP address prefix in the CLNP packet. The IP packet generated by EON will contain the source address of the EON machine, and the destination address of the EON machine. The following static mapping tables will exist in the EON machines: NSAP Prefix -> remote IP address of EON machine to decapsulate the IP packet into a CLNP packet. For local NETs of Router: NSAP Prefix -> local NET of router on the ethernet to route the traffic off the NSFNET service Changes or requests to be added to these tables should be sent to nsfnet-admin@merit.edu. The support for CLNP native switching services on the T3 backbone proceeded to be tested in November. The AIX 3.2 system software that supports CLNP switching is in system test and is expected to be available on the T3 backbone in February '93. T3 ENSS Backup Plan ------------------- The installation and testing of dedicated T1 leased line circuits between all T3 ENSS nodes, and a CNSS T1 router in a secondary POP was completed in November. The topology for T3 ENSS backup is illustrated in a postscript map that is available via anonymous FTP on ftp.ans.net in the file </pub/info/t3enss-backup.ps>. We have begun to work on subsequent optimizations to further improve backup connectivity. There may be situations where a secondary ENSS router is used to terminate T1 backup circuits. T1 Backbone Turned Off ---------------------- These activities required to precede the T1 backbone were concluded on 12/1 and the shutdown of the T1 backbone commenced on 12/2. There were a couple of problems that were quickly corrected during the T1 backbone shutdown. Some regionals that maintained default routing configurations pointing to the T1 backbone lost connectivity for a brief period. Some regional router configurations were changed, and the T3 backbone will continue to announce the T1 backbone address (129.140) from several ENSS nodes for a while longer to ease the transition. Also, there was a problem discovered with the RCP nodes in the "NSS-router" configuration used for the CA*Net interconnection to the T3 network. The RCPs could not manage the full 6000+ network destination routing tables. As a workaround, the three NSS-routers are now configured to advertise the T3 backbone network as a default route to CA*Net. ANSNET/NSFNET Operational Experiences with MBONE ================================================ During the week of 11/9 and 11/16, there were a number of operational problems during the preparation and actual operation of the IETF MBONE packet video/audiocast. The use of loose source route packets, and the large volume of MBONE traffic appears to have caused fairly widespread problems for several Internet service providers. However, the volume of MBONE traffic and source route optioned packets did not seem to adversely affect the ANSNET/NSFNET, as was earlier believed. There were severe routing instabilities with peer networks at several ANSNET/NSFNET border gateways including E128 (Palo Alto), E144 (FIX-E), E145 (FIX-W) and most notably at E133 (Ithaca) due to the MBONE traffic and processing of source route packets. The instability in these peer networks coupled with inefficient handling of very large and frequent routing changes introduced through EGP resulted in some ANSNET/NSFNET instabilities. Networks carrying MBONE traffic frequently stopped being advertised by external peers, and were timed out by the ENSS. The external peer then stabilized and these networks were then advertised to the ENSS by the external peer soon thereafter. This process repeated itself in a cyclical fashion. This caused a few connectivity problems at various places on the ANSNET, but was by far the worst at ENSS133 (Ithaca). One reason that the problem was worse at ENSS133 than at other places was due to the fact that Cornell was on the sending end of a fair number of MBONE tunnels, which meant the card-to-system traffic for unreachable destinations tended to be higher on the ENSS133 router than elsewhere. There were several actions taken during the week of 11/16 (IETF video/audiocast) which reduced the severity of this problem including: (a) ICMP unreachable messages were turned off on the external interfaces of ENSS routers that experienced problems. These messages were not being processed directly on the external ENSS interfaces which resulted in some inefficiency. New software will be deployed in early December to correct this. (b) SprintLink rerouted traffic (and the MBONE tunnel) from the IETF to Cornell from the CIX (via internal PSInet path), to the T3 ANSNET path. This improved stability within PSInet and within ANSNET. (c) Cornell rerouted traffic (MBONE tunnel) to SDSC from the PSInet path to the T3 ANSNET path. (d) One of the two parallel IETF audio/video channels was disabled. (e) A default route was established on ENSS133 pointing to its adjacent internal router (CNSS49). This ensured that card<->system traffic being processed due to unreachable destinations was moved to the CNSS router which was not involved in processing EGP updates. (f) A new version of the routing software was installed on the four ENSS nodes that experienced route flapping to aggregate EGP updates from external peers before sending IBGP messages to other internal T3 routers. The combination of all of these actions stabilized ENSS133 and the other ENSS routers that experienced instabilities. There are several actions which we already have, or will soon implement to avoid ANSNET border router instabilities during future MBONE multicast events: (1) The ENSS EGP software has been enhanced to support improved aggregation of updates from external peers into IBGP update messages. The ENSS will now aggregate EGP derived routes together into a single update before flooding this to other routers across the backbone via IBGP. This improves the efficiency of the ENSS dramatically. (2) A change to the ANSNET router interface microcode has been implemented (and will be deployed during early December) so that problems resulting from large amounts of ENSS card-system traffic will be eliminated when destinations become unreachable. Even if mrouted keeps sending traffic, this will be dropped on the incoming ENSS interface. (3) The T1 NSFNET backbone was disconnected on 12/2. The T1 network (particularly the interconnect points with the T3 system) was a major source of route flapping, and eliminating it should provide an additional margin for handling instability from other peer networks. While the changes we are making to the T3 network will significantly improve T3 network performance in dealing with external EGP peer flapping, and related MBONE routing problems, our changes will *NOT* improve the problems that other Internet networks may experience when processing source route packets, and handling routing transitions with MBONE tunnels. We recommend that each service provider develop their own internal routing plan to address this, we continue to recommend the migration to use of BGP at all border gateways, and we recommend that MBONE software be upgraded to support IP encapsulation to avoid the problems with routers that do not process loose source route optioned packets efficiently. We also are recommending that the MBONE developers explore optimizing the mrouted software to avoid the sustained unidirectional flows to unreachable destinations that we observed. Finally, it is recommended that an mrouted machine be maintained on the ENSS DMZ of each participating regional, and this node be used as a hierarchical distribution point to locations in the local campus and regional. Backhauling of traffic across campuses and regionals should be discouraged. Routing Software on the T3 Network ================================== New routing software was installed on the T3 backbone in November to support various enhancements. There is improved route download performance using asynchronous IOCTL calls. There is new support for static routes, and checks for the size of BGP updates and attributes. New software will be installed in early December that addresses the routing instability problems observed during the MBONE multicast events in November. This will include the code for improved aggregation of EGP updates thereby eliminating any CPU starvation during EGP route flapping. Next AIX 3.1 System Software Release ==================================== New RS6000 system software was deployed on several T3 network nodes in late November and will be fully deployed by early December. The most significant change is the ability to drop packets on the interface which receives them for which there is no route available. During transient routing conditions we may experience high card/system traffic due to route downloads which can cause transient instabilities. This change will improve the efficiency for route installs/deletes between router system memory and the on-card forwarding tables. This code also supports the generation of ICMP network unreachable messages on the card rather than on the system. There are two bug fixes in this software, one for the performance problem that can occur when an IBGP TCP session gets deadlocked, and another that avoids FDDI problems if a T1 backup link on a T3 ENSS gets congested. Also all ENSS interfaces now support MTU path discovery. RS960 FDDI Deployment Status ============================ We are proceeding to schedule RS960 FDDI adapter installations on several ENSS nodes in December including ENSS134 (NEARnet), ENSS130 (Argonne), ENSS145 (FIX-E), ENSS136 (SURAnet), ENSS139 (SESQUInet), ENSS144 (FIX-W), ENSS142 (WestNet), and ENSS143 (NorthWestNet). There are performance tests under way involving Pittsburgh Supercomputer Center, San Diego Supercomputer Center, and National Center for Supercomputer Applications that are designed to exploit the FDDI high bandwidth capability. The MTUs on various interfaces in the T3 backbone have been changed as was described in the October engineering report, and the ENSS nodes will negotiate MTU path discovery with other systems outside the T3 backbone. During the tests performed so far, there have been some observations of low level packet loss (1.6%) between the Cray end systems at the supercomputer centers which has caused performance problems with the large window TCP implementations achieving peak performance over the T3 backbone. The packet loss problems have not been traced to sources inside the T3 backbone, and the problem is being investigated by the supercomputer centers. Network Source/Destination Statistics Collection ================================================ During November, we collected the first full month of T3 network source/destination traffic statistics. This data will be used for topology engineering, capacity planning, and various research projects. RS960 Memory Parity Problem =========================== During November, we experienced two problems on CNSS17 and CNSS65 due to the failure of parity checking logic within the on-card memory on selected RS960 T3 adapters. These outages are very infrequent and do not generally result in ENSS isolation from the network since only a single interface will be affected, and redundant connectivity is employed on other CNSS interfaces. The problems were cleared by a software reset of the interface. The problem is suspected to be due to the memory technology used on some of the cards.
participants (1)
-
Mark Knopper