Phase-III T3 Network Upgrade Plan Summary ========================================= Jordan Becker, ANS Mark Knopper, Merit ANS & Merit are planning a major upgrade to the ANS/NSFNET T3 backbone service beginning in late April which is scheduled to complete at the end of May. The upgrade involves changing all of the T3 interface adapters in the RS/6000 T3 routers as well as the DSUs. The new T3 adapter (called RS960) supports on-card packet forwarding which will dramatically improve the performance, as packet forwarding will not require support by the main RS6000 CPU. The mechanism used will allow a T3 adapter to send data directly to another adapter across the RS6000 bus. The concept is roughly comparable to an evolution of the architecture of T1 nodes on the NSFNET, since all the processors are in a single system sharing a common bus. The reliability of the new RS960 adapter is much greater than the existing T3 adapter (known as "Hawthorne T3 adapter"). We will also upgrade the DSU to provide "C-bit parity" over the T3 links. C-bit parity (based on ANSI T1.107A spec.) will provide improved end-to-end real-time link level network management information. C-bit parity is conceptually comparable to Extended Super Framing (ESF) over T1 circuits. Other minor changes on the router include the replacement of the fan, the I/O planar card, and associated cabling to support the adapter upgrade. The result of this upgrade will be higher speed packet switching and increased SNMP functionality, better T3 link monitoring, higher T3 router reliability and availability, and better diagnostic and repair capability. The deployment of this upgrade will temporarily affect T3 network connectivity while the core node (CNSS) and end node (ENSS) routers are upgraded. In order to minimize down time, all nodes will be upgraded during off-hours on weekends. The target start date for the phase-III upgrade is April 24. This is contingent upon successful completion of the test plan described below. Implementation of the deployment is planned to cover six steps, each step taking place over a Friday night/Saturday morning 8 hour window. Each step will correspond to the upgrade of all CNSS's located at two adjacent POPs and all of the T3 ENSS nodes supported by those CNSS's. T3 network outages at NSFNET midlevel T3 network sites are expected to average 2 hours per step, although we have provisioned a 8 hour window to accomplish the upgrade to the T3 ENSS and adjacent CNSS nodes. NSFNET T3 midlevel networks will be cutover to the T1 backbone prior to the scheduled upgrade and will remain on the T1 until the affected nodes are upgraded and network reachability has been successfully re-established. During the outage period where the T3 CNSS routers are upgraded (expected to be 2 hours), traffic from other T1 and lower speed ENSS nodes, as well as T1 CNSS transit traffic will be continue to be routed via the T1 safetynet circuits. All upgrades will start on a Friday night, with a starting time of approximately 23:00 local time. There will be two exceptions: changes to the Denver and Cleveland facilities will be starting at approximately 24:00 local time. There will be a scheduled visit to your site by the local CE on the Thursday prior to the Friday installation in order to ensure that all parts have been received on site in preparation for the installation. Each CNSS site should take about 8 hours to complete, with the exception of the New York City core node. At the New York City site, we will also be physically moving the equipment within the POP and expect this move to add about 4 hours to the upgrade there. Although each upgrade is expected to require 8 hours to complete, we are reserving the 48hour window from approximately 23:00 local time on Friday to 24:00 local time on Sunday, for possible disruptions to service. This window will allow enough time to debug any unforeseen problem that may arise. A second visit to core nodes will be required to replace a single remaining old-technology T3 adapter. This will result in a T3 outage of approximately 105 minutes at sites as indicated below. We have established a tentative upgrade schedule which is contingent upon successful completion of all testing (described below). If any unforseen problems are encountered during the deployment we will postpone subsequent upgrades until the problems we encounter are resolved. The current upgrade schedule by CNSS/ENSS site is as follows: Step 1, April 24/25 Denver, Seattle core nodes T3 ENSS Nodes: Boulder E141, Salt Lake E142, U. Washington E143 Step 2, May 1/2 San Francisco, Los Angeles core nodes T3 ENSS Nodes: Palo Alto E128, San Diego E135, Other ENSS Nodes: E144, E159, E156, E170 also affected Second core node visit: Seattle Step 3, May 8/9 Chicago, Cleveland, New York City, Hartford core nodes T3 ENSS Nodes: Argonne E130, Ann Arbor E131, Pittsburgh E132, Ithaca E133, Boston E134, Princeton E137 Other ENSS Nodes: E152, E162, E154, E158, E167, E168, E171, E172, CUNY E163, E155, E160, E161, E164, E169 also affected Second core node visit: San Francisco Step 4, May 15/16 St. Louis, Houston core nodes T3 ENSS Nodes: Champaign E129, Lincoln E140, Rice U. E139 Other ENSS Nodes: U. Louisville E157, E165 also affected Second core node visit: Los Angeles, Denver, Chicago Step 5, May 22/23 Washington D.C., Greensboro core nodes T3 ENSS Nodes: College Park E136, FIX-E E145, Georgia Tech E138 Other ENSS Nodes: Concert E150, VPI E151, E153, E166 also affected Second core node visit: Houston, New York City, Hartford It is anticipated that with the exception of the T3 ENSS indicated site outages other ENSS nodes and transit traffic services will be switched across the safetynet and T1 CNSS concentrators during the upgrades. However brief routing transients or instabilities may be observed. NSR messages will be posted 48 hours in advance of any scheduled changes and questions or comments on the schedule or plan may be directed to the ie@merit.edu mailing list. T3 Research Network RS960 Test Plan and Experiences =================================================== This upgrade is the culmination of several months of development and lab and network testing of the new technology. Many of the problems identified during the deployment of the phase-II T3 network have been corrected in the new hardware/software. A summary of the experiences we have had with this technology on the T3 Research network is described below. ANS/Merit and their partners maintain a full wide area T3/T1 Research network for development and testing of new hardware, software, and procedures prior to introduction on the production networks. The T3 Research network includes 3 core node locations with multiple fully configured CNSS routers. There are 5 ENSS sites at which we maintain full T3 ENSS routers as well as local ethernet and FDDI LANs that interconnect multiple peer routers and test hosts. The Research network is designed to emulate the production network configuration as much as possible. There wide area Research network interconnects with multiple testbed at each of the 5 ENSS locations. These testbeds are configured to emulate regional and campus network configurations. General Testing Goals and Methods --------------------------------- Unit and system testing of all phase-III technology is conducted first in the development laboratories, and then regression tested on the development testbeds at each of the participating sites. The primary goal for testing of the phase-III technology on the T3 Research network is to determine whether the new technology meets several acceptance criterion for deployment on the production network. These criterion include engineering requirements such as routing, packet forwarding, manageability, and fault tolerance. They also include regression testing against all known problems that have been corrected to date on the T3 system, including all of its components. The following are secondary goals that the test plan includes: 1. To gain experience with new components so that the NOC and engineering staffs can recover from problems once in production. 2. To identify any problems resulting from attempts to duplicate production network traffic load and distributions on the Test Network. 3. To perform load saturation, route flapping, and other stress tests to measure system response and determine failure modes and performance limits. 4. To duplicate selected system and unit tests in a more production-like environment. 5. To design and execute tests and that reflect "end-user perceptions" of network performance and availability. 6. To isolate new or unique components under test to evaluate new criterion or performance objectives for testing those specific new components. Regression testing against the entire system is also emphasized. 7. To independantly evaluate the validity of specific tests to ensure their usefulness. Phase-III Components to be tested ================================= The RS960 interface upgrade consists of two new hardware components: 1. RS960 DS3/HSSI interfaces 2. New T3 DSU adapters a. Communication card with c-bit parity (ANSI T1.107A) b. High Speed Serial Interface (HSSI) interface card (HSSI - Developed by Cisco & T3plus Inc. is defacto standard comparable to V.35 or RS-422). There will be several new software components tested for the target production network AIX 3.1 operating system level: 1. RS960 DS3 driver & kernel modifications 2. SNMP software a. SNMP Daemon, DSU proxy subagent b. T3 DSU logging and interactive control programs 3. RS960 adapter firmware 4. New RS960 utility programs for AIX Operating System a. ifstat - Interface statistics b. ccstat - On-Card Statistics General Areas Tested ==================== The following areas comprise the areas where test objectives are exercised. Extensive testing is done to ensure that we meet these objectives. 1) Packet Forwarding a. Performance and stress testing b. Reliability under stress conditions 2) Routing a. Focus on consistency of routes - across system tables and smart card(s) tables - after interface failure - after network partitioning - after rapid node/circuit/interface transitions - under varying traffic load conditions b. Limit testing - determine limits on: - number of routes - number of AS's - packet sizes - number of networks - number of nets at given metric per node - system behavior when these limits are exceeded c. Interoperability with Cisco BGP testing 3) System monitoring via SNMP a. RS960 hardware, driver, microcode b. DSU functions, C-bit parity end-to-end connectivity and performance 4) End to End Tests a. Connection availability - does TCP connection stay open in steady state? b. Throughput - measure throughput on host to host connection through network c. Delay - measure packet delays from user point of view - observe round trip, and unidirectional delays d. Steady State Performance - evaluate effect on TCP connection due to network changes 5) Unit Test Verification a. Repeat selected regression tests performed by development b. Exercize DSU functions - measure packet loss - perform loopback tests c. Induce random & deterministic noise into circuits and evaluate interface hardware/software response 6) Deployment Phase Testing a. Test all machine configurations that will be used during the deployment transition phases b. Measure throughput under production load across transitional configurations (RS960 T3 <-> Hawthorne T3 path) 7) NOC Procedure Testing a. During testing, identify all possible failure modes b. Install & document specific recovery procedures c. Train NOC staff on test network Test Schedule Summary ===================== DATE EVENT 3/1 Development unit & system testing complete, code review o Build AIX system for testnet o Freeze all fix buckets 3/6-3/10 Install RS960 on testnet with transition configurations 3/20 Disconnect testnet from production net o Bring back Richardson - White Plains T3 link o Burn-in and swap out any failed or hardware components. o Prepared for testbed regression tests o Install new build (3/19) 3/21-3/24 Steady state traffic flow testing, stability testing o Configure external Ciscos with static routes o Inject routes and traffic with copy tools and packet generators to induce non-overload of traffic. o Configure and start end hosts for application tests. 3/25 Throughput testing 1 o Stress test 3xHawthorne T3-> 1xRS960 T3 CNSS flow on WP3, lesser flow in reverse direction. Measure degredation with timestamped echo request packets, as well as analyze memory used. check all links for dropped or lost packets. 3/26 Throughput testing 2 o Repeat 3/23 test on White Plains CNSS2 with 3xRS960s->1xHawthorne T3 CNSS config. Throughput testing 3 o Repeat above on BW5 POP with all RS960 CNSS config. Look for any system degredation, or pathological route_d effects. 3/27 Upgrade testnet to match post-deployment configuration o White Plains CNSS3-Elmsford ENSS and White Plains CNSS3 <-> Yorktown ENSS become RS960 links 3/27 Routing testing 1 o Configure & validate cisco route injection setup o Copy tools loaded with matching configs for routes. o Install utility to diff netstat -r and ifstat -r tables o Vary traffic load & distribution with traffic generators and copy tools, run scripts to look for incorrect route entries o End to end applications running with changing routes and loads, look for errors (timeouts) and throughput. 3/30 Routing testing 2 o MCI to get error noise injector for T3 at White Plains, on Bridgewater-White Plains T3 link. ANS to operate noise injecto r. o Repeat 3/25 tests with - Intentionally crashed card (Use Milford utility to crash card) - Partitioned network (pull HSSI cables in White Plains) - Injected line errors (ANS to inject in White Plains on MCI equipment) 3/31 Routing limit testing o as defined in testplan: - routes - AS's - packet sizes - number of nets at given metric per nodes 4/01 SNMP RS960 tests o as defined in testplan: - RS960 hardware - RS960 driver - RS960 microcode 4/02 SNMP DSU tests o as defined in testplan: - DSU functions - C-bit parity 4/03 Regression End-to-End tests o Induce background traffic load with packet generators and copy tools, run end-user applications and also transit end-end 1x10**6 packets to an idle machine with pkt generators. Count packets dropped. 4/06 Regression Configuration Conversion o Upgrade any adapters, DSUs, system software, microcode that are not uplevel, make sure that DSUs are now configured for SNMP control o repeat everything Phase-III Test Experiences and Results To-Date Summary ====================================================== Testing is scheduled to continue up until 4/17. We will then freeze the test network in preparation for deployment on 4/24. The software build that supports the new technology will be installed on selected nodes between 4/17 and 4/24 to burn-in prior to the 4/24 hardware upgrades. Testing on the test network will continue throughout the deployment. Overall Testnet Observations ---------------------------- The performance of the RS960 technology is far superior to that of the existing Hawthorne T3 adapter technology. Although peak performance throughput tests have not yet been conducted, the steady state performance for card-card transfers has been measured in excess of 20KPPS with excellent stability. During the early deployment of the RS960 and DSU technology on the testnetwork, we observed several new transmission facility problems that were not observed in the lab tests, or in earlier DS3 adapter tests. We found a new form of data pattern sensitivity where under certain conditions the new DSU can generate a stream of 010101 bits that induce an Alert Indicator Signal (blue alarm) within the MCI network. The RS960 and existing Hawthorne T3 card do not interoperate over a serial link. However they will interoperate if both are installed within a single CNSS node using specially developed driver software which uses the main RS6000 CPU for packet forwarding between the RS960 and Hawthorne T3 cards. As part of the phase-III migration period we had originally planned to support an interim CNSS configuration of 3xHawthorne T3 adapters and 1xRS960 adapter as well as a 1xHawthorne T3 & 3xRS960 T3 interim CNSS configuration. Unfortunately during the performance tests we determined that the 3xHawthorne T3 & 1xRS960 T3 configuration creates a performance bottleneck that could cause congestion under heavy load. This is due to the interim state RS960 <-> Hawthorne interface software driver bottlneck where the main RS6000 CPU is used for packet forwarding between dis-similar adapters. We have therefore eliminated this configuration from the deployment plan and will support only the 1xHawthorne, 3xRS960 configuration as an interim state during the deployment. We are also looking various deployment strategies that will avoid any congestion across an interim RS960<->Hawthorne path. These strategies include interior T3 link metric adjustment, safetynet link load splitting, careful placement of these transition links to avoid hot-spots, or temporary addition of new DS3 links that will short-cut these transition links. We are using something called a copy tool that was developed as a host system that interfaces on a production network ethernet, and test network ethernet whereby all production ethernet packets are promiscuously copied on the host, given a new destination address, and injected into the Research network to simulate production traffic flows within the Research network. We have found a bug in the copy tool that has caused problems on the test ethernets at a couple of Research network locations. Everytime the copy tool is re-booted, we experience congestion on the test ethernets due to an erroneous broadcast of a copied packet onto a test ethernet. We are fixing this problem before we run these tests again. We have run numerous route flapping tests where will install and delete routes repeatedly on all installed RS960 cards and have not encountered any chronic problems so far. The installation and deletion of 6000 routes on the card is fast enough that we can not measure an inconsistencies between different on-card route tables. We have compiled a limit of 6000 routes on the card for now since this reflects the deployment configuration, however we can support up to 14000 IP routes on the card if necessary. We have opened and closed over 100 problems during the several months of lab and Research network testing on RS960. There are currently 14 open problems remaining from our tests to date and we can provide some details on this for anyone interested. We have fixes for most of these problems that will be regression tested on the Research network next week. We expect to close these problems prior to deployment of RS960 on 4/24.