Phase-III T3 Network Upgrade Plan Summary
=========================================
Jordan Becker, ANS Mark Knopper, Merit
ANS & Merit are planning a major upgrade to the ANS/NSFNET T3 backbone
service beginning in late April which is scheduled to complete at the end of
May. The upgrade involves changing all of the T3 interface adapters in the
RS/6000 T3 routers as well as the DSUs. The new T3 adapter (called RS960)
supports on-card packet forwarding which will dramatically improve the
performance, as packet forwarding will not require support by the main RS6000
CPU. The mechanism used will allow a T3 adapter to send data directly to
another adapter across the RS6000 bus. The concept is roughly comparable to
an evolution of the architecture of T1 nodes on the NSFNET, since all the
processors are in a single system sharing a common bus. The reliability of
the new RS960 adapter is much greater than the existing T3 adapter (known as
"Hawthorne T3 adapter").
We will also upgrade the DSU to provide "C-bit parity" over the T3
links. C-bit parity (based on ANSI T1.107A spec.) will provide improved
end-to-end real-time link level network management information. C-bit parity
is conceptually comparable to Extended Super Framing (ESF) over T1 circuits.
Other minor changes on the router include the replacement of the fan, the I/O
planar card, and associated cabling to support the adapter upgrade.
The result of this upgrade will be higher speed packet switching and
increased SNMP functionality, better T3 link monitoring, higher T3 router
reliability and availability, and better diagnostic and repair capability.
The deployment of this upgrade will temporarily affect T3 network
connectivity while the core node (CNSS) and end node (ENSS) routers are
upgraded. In order to minimize down time, all nodes will be upgraded during
off-hours on weekends.
The target start date for the phase-III upgrade is April 24. This is
contingent upon successful completion of the test plan described below.
Implementation of the deployment is planned to cover six steps, each step
taking place over a Friday night/Saturday morning 8 hour window. Each step
will correspond to the upgrade of all CNSS's located at two adjacent POPs and
all of the T3 ENSS nodes supported by those CNSS's. T3 network outages at
NSFNET midlevel T3 network sites are expected to average 2 hours per step,
although we have provisioned a 8 hour window to accomplish the upgrade to the
T3 ENSS and adjacent CNSS nodes. NSFNET T3 midlevel networks will be cutover
to the T1 backbone prior to the scheduled upgrade and will remain on the T1
until the affected nodes are upgraded and network reachability has been
successfully re-established.
During the outage period where the T3 CNSS routers are upgraded
(expected to be 2 hours), traffic from other T1 and lower speed ENSS nodes, as
well as T1 CNSS transit traffic will be continue to be routed via the T1
safetynet circuits.
All upgrades will start on a Friday night, with a starting time of
approximately 23:00 local time. There will be two exceptions: changes to the
Denver and Cleveland facilities will be starting at approximately 24:00 local
time. There will be a scheduled visit to your site by the local CE on the
Thursday prior to the Friday installation in order to ensure that all parts
have been received on site in preparation for the installation. Each CNSS
site should take about 8 hours to complete, with the exception of the New York
City core node. At the New York City site, we will also be physically moving
the equipment within the POP and expect this move to add about 4 hours to the
upgrade there. Although each upgrade is expected to require 8 hours to
complete, we are reserving the 48hour window from approximately 23:00 local
time on Friday to 24:00 local time on Sunday, for possible disruptions to
service. This window will allow enough time to debug any unforeseen problem
that may arise. A second visit to core nodes will be required to replace a
single remaining old-technology T3 adapter. This will result in a T3 outage
of approximately 105 minutes at sites as indicated below.
We have established a tentative upgrade schedule which is contingent
upon successful completion of all testing (described below). If any unforseen
problems are encountered during the deployment we will postpone subsequent
upgrades until the problems we encounter are resolved. The current upgrade
schedule by CNSS/ENSS site is as follows:
Step 1, April 24/25
Denver, Seattle core nodes
T3 ENSS Nodes: Boulder E141, Salt Lake E142, U. Washington E143
Step 2, May 1/2
San Francisco, Los Angeles core nodes
T3 ENSS Nodes: Palo Alto E128, San Diego E135,
Other ENSS Nodes: E144, E159, E156, E170 also affected
Second core node visit: Seattle
Step 3, May 8/9
Chicago, Cleveland, New York City, Hartford core nodes
T3 ENSS Nodes: Argonne E130, Ann Arbor E131, Pittsburgh E132,
Ithaca E133, Boston E134, Princeton E137
Other ENSS Nodes: E152, E162, E154, E158, E167, E168, E171, E172,
CUNY E163, E155, E160, E161, E164, E169 also affected
Second core node visit: San Francisco
Step 4, May 15/16
St. Louis, Houston core nodes
T3 ENSS Nodes: Champaign E129, Lincoln E140, Rice U. E139
Other ENSS Nodes: U. Louisville E157, E165 also affected
Second core node visit: Los Angeles, Denver, Chicago
Step 5, May 22/23
Washington D.C., Greensboro core nodes
T3 ENSS Nodes: College Park E136, FIX-E E145, Georgia Tech E138
Other ENSS Nodes: Concert E150, VPI E151, E153, E166 also affected
Second core node visit: Houston, New York City, Hartford
It is anticipated that with the exception of the T3 ENSS indicated
site outages other ENSS nodes and transit traffic services will be switched
across the safetynet and T1 CNSS concentrators during the upgrades. However
brief routing transients or instabilities may be observed. NSR messages will
be posted 48 hours in advance of any scheduled changes and questions or
comments on the schedule or plan may be directed to the ie(a)merit.edu mailing
list.
T3 Research Network RS960 Test Plan and Experiences
===================================================
This upgrade is the culmination of several months of development and
lab and network testing of the new technology. Many of the problems
identified during the deployment of the phase-II T3 network have been
corrected in the new hardware/software. A summary of the experiences we have
had with this technology on the T3 Research network is described below.
ANS/Merit and their partners maintain a full wide area T3/T1 Research
network for development and testing of new hardware, software, and procedures
prior to introduction on the production networks. The T3 Research network
includes 3 core node locations with multiple fully configured CNSS routers.
There are 5 ENSS sites at which we maintain full T3 ENSS routers as well as
local ethernet and FDDI LANs that interconnect multiple peer routers and test
hosts. The Research network is designed to emulate the production network
configuration as much as possible. There wide area Research network
interconnects with multiple testbed at each of the 5 ENSS locations. These
testbeds are configured to emulate regional and campus network configurations.
General Testing Goals and Methods
---------------------------------
Unit and system testing of all phase-III technology is conducted first
in the development laboratories, and then regression tested on the development
testbeds at each of the participating sites. The primary goal for testing of
the phase-III technology on the T3 Research network is to determine whether
the new technology meets several acceptance criterion for deployment on the
production network. These criterion include engineering requirements such as
routing, packet forwarding, manageability, and fault tolerance. They also
include regression testing against all known problems that have been corrected
to date on the T3 system, including all of its components.
The following are secondary goals that the test plan includes:
1. To gain experience with new components so that the NOC and engineering
staffs can recover from problems once in production.
2. To identify any problems resulting from attempts to duplicate production
network traffic load and distributions on the Test Network.
3. To perform load saturation, route flapping, and other stress tests to
measure system response and determine failure modes and performance limits.
4. To duplicate selected system and unit tests in a more production-like
environment.
5. To design and execute tests and that reflect "end-user perceptions" of
network performance and availability.
6. To isolate new or unique components under test to evaluate new criterion or
performance objectives for testing those specific new components. Regression
testing against the entire system is also emphasized.
7. To independantly evaluate the validity of specific tests to ensure their
usefulness.
Phase-III Components to be tested
=================================
The RS960 interface upgrade consists of two new hardware components:
1. RS960 DS3/HSSI interfaces
2. New T3 DSU adapters
a. Communication card with c-bit parity (ANSI T1.107A)
b. High Speed Serial Interface (HSSI) interface card
(HSSI - Developed by Cisco & T3plus Inc. is defacto standard
comparable to V.35 or RS-422).
There will be several new software components tested for the target production
network AIX 3.1 operating system level:
1. RS960 DS3 driver & kernel modifications
2. SNMP software
a. SNMP Daemon, DSU proxy subagent
b. T3 DSU logging and interactive control programs
3. RS960 adapter firmware
4. New RS960 utility programs for AIX Operating System
a. ifstat - Interface statistics
b. ccstat - On-Card Statistics
General Areas Tested
====================
The following areas comprise the areas where test objectives are
exercised. Extensive testing is done to ensure that we meet these objectives.
1) Packet Forwarding
a. Performance and stress testing
b. Reliability under stress conditions
2) Routing
a. Focus on consistency of routes
- across system tables and smart card(s) tables
- after interface failure
- after network partitioning
- after rapid node/circuit/interface transitions
- under varying traffic load conditions
b. Limit testing
- determine limits on:
- number of routes
- number of AS's
- packet sizes
- number of networks
- number of nets at given metric per node
- system behavior when these limits are exceeded
c. Interoperability with Cisco BGP testing
3) System monitoring via SNMP
a. RS960 hardware, driver, microcode
b. DSU functions, C-bit parity end-to-end connectivity and performance
4) End to End Tests
a. Connection availability
- does TCP connection stay open in steady state?
b. Throughput
- measure throughput on host to host connection through network
c. Delay
- measure packet delays from user point of view
- observe round trip, and unidirectional delays
d. Steady State Performance
- evaluate effect on TCP connection due to network changes
5) Unit Test Verification
a. Repeat selected regression tests performed by development
b. Exercize DSU functions
- measure packet loss
- perform loopback tests
c. Induce random & deterministic noise into circuits and evaluate
interface hardware/software response
6) Deployment Phase Testing
a. Test all machine configurations that will be used during the
deployment transition phases
b. Measure throughput under production load across transitional
configurations (RS960 T3 <-> Hawthorne T3 path)
7) NOC Procedure Testing
a. During testing, identify all possible failure modes
b. Install & document specific recovery procedures
c. Train NOC staff on test network
Test Schedule Summary
=====================
DATE EVENT
3/1 Development unit & system testing complete, code review
o Build AIX system for testnet
o Freeze all fix buckets
3/6-3/10 Install RS960 on testnet with transition configurations
3/20 Disconnect testnet from production net
o Bring back Richardson - White Plains T3 link
o Burn-in and swap out any failed or hardware components.
o Prepared for testbed regression tests
o Install new build (3/19)
3/21-3/24 Steady state traffic flow testing, stability testing
o Configure external Ciscos with static routes
o Inject routes and traffic with copy tools and packet
generators to induce non-overload of traffic.
o Configure and start end hosts for application tests.
3/25 Throughput testing 1
o Stress test 3xHawthorne T3-> 1xRS960 T3 CNSS flow on WP3,
lesser flow in reverse direction. Measure degredation with
timestamped echo request packets, as well as analyze memory
used. check all links for dropped or lost packets.
3/26 Throughput testing 2
o Repeat 3/23 test on White Plains CNSS2 with
3xRS960s->1xHawthorne T3 CNSS config.
Throughput testing 3
o Repeat above on BW5 POP with all RS960 CNSS config. Look for
any system degredation, or pathological route_d effects.
3/27 Upgrade testnet to match post-deployment configuration
o White Plains CNSS3-Elmsford ENSS and White Plains CNSS3 <->
Yorktown ENSS become RS960 links
3/27 Routing testing 1
o Configure & validate cisco route injection setup
o Copy tools loaded with matching configs for routes.
o Install utility to diff netstat -r and ifstat -r tables
o Vary traffic load & distribution with traffic generators and
copy tools, run scripts to look for incorrect route entries
o End to end applications running with changing routes
and loads, look for errors (timeouts) and throughput.
3/30 Routing testing 2
o MCI to get error noise injector for T3 at White Plains, on
Bridgewater-White Plains T3 link. ANS to operate noise injecto
r.
o Repeat 3/25 tests with
- Intentionally crashed card (Use Milford utility to crash
card)
- Partitioned network (pull HSSI cables in White Plains)
- Injected line errors (ANS to inject in White Plains on MCI
equipment)
3/31 Routing limit testing
o as defined in testplan:
- routes
- AS's
- packet sizes
- number of nets at given metric per nodes
4/01 SNMP RS960 tests
o as defined in testplan:
- RS960 hardware
- RS960 driver
- RS960 microcode
4/02 SNMP DSU tests
o as defined in testplan:
- DSU functions
- C-bit parity
4/03 Regression End-to-End tests
o Induce background traffic load with packet generators and
copy tools, run end-user applications and also transit end-end
1x10**6 packets to an idle machine with pkt generators. Count
packets dropped.
4/06 Regression Configuration Conversion
o Upgrade any adapters, DSUs, system software, microcode that
are not uplevel, make sure that DSUs are now configured for
SNMP control
o repeat everything
Phase-III Test Experiences and Results To-Date Summary
======================================================
Testing is scheduled to continue up until 4/17. We will then freeze
the test network in preparation for deployment on 4/24. The software build
that supports the new technology will be installed on selected nodes between
4/17 and 4/24 to burn-in prior to the 4/24 hardware upgrades. Testing on the
test network will continue throughout the deployment.
Overall Testnet Observations
----------------------------
The performance of the RS960 technology is far superior to that of the
existing Hawthorne T3 adapter technology. Although peak performance
throughput tests have not yet been conducted, the steady state performance for
card-card transfers has been measured in excess of 20KPPS with excellent
stability.
During the early deployment of the RS960 and DSU technology on the
testnetwork, we observed several new transmission facility problems that were
not observed in the lab tests, or in earlier DS3 adapter tests. We found a
new form of data pattern sensitivity where under certain conditions the new
DSU can generate a stream of 010101 bits that induce an Alert Indicator Signal
(blue alarm) within the MCI network.
The RS960 and existing Hawthorne T3 card do not interoperate over a
serial link. However they will interoperate if both are installed within a
single CNSS node using specially developed driver software which uses the main
RS6000 CPU for packet forwarding between the RS960 and Hawthorne T3 cards. As
part of the phase-III migration period we had originally planned to support an
interim CNSS configuration of 3xHawthorne T3 adapters and 1xRS960 adapter as
well as a 1xHawthorne T3 & 3xRS960 T3 interim CNSS configuration.
Unfortunately during the performance tests we determined that the 3xHawthorne
T3 & 1xRS960 T3 configuration creates a performance bottleneck that could
cause congestion under heavy load. This is due to the interim state RS960 <->
Hawthorne interface software driver bottlneck where the main RS6000 CPU is
used for packet forwarding between dis-similar adapters. We have therefore
eliminated this configuration from the deployment plan and will support only
the 1xHawthorne, 3xRS960 configuration as an interim state during the
deployment. We are also looking various deployment strategies that will avoid
any congestion across an interim RS960<->Hawthorne path. These strategies
include interior T3 link metric adjustment, safetynet link load splitting,
careful placement of these transition links to avoid hot-spots, or temporary
addition of new DS3 links that will short-cut these transition links.
We are using something called a copy tool that was developed as a host
system that interfaces on a production network ethernet, and test network
ethernet whereby all production ethernet packets are promiscuously copied on
the host, given a new destination address, and injected into the Research
network to simulate production traffic flows within the Research network. We
have found a bug in the copy tool that has caused problems on the test
ethernets at a couple of Research network locations. Everytime the copy tool
is re-booted, we experience congestion on the test ethernets due to an
erroneous broadcast of a copied packet onto a test ethernet. We are fixing
this problem before we run these tests again.
We have run numerous route flapping tests where will install and
delete routes repeatedly on all installed RS960 cards and have not encountered
any chronic problems so far. The installation and deletion of 6000 routes on
the card is fast enough that we can not measure an inconsistencies between
different on-card route tables. We have compiled a limit of 6000 routes on
the card for now since this reflects the deployment configuration, however we
can support up to 14000 IP routes on the card if necessary.
We have opened and closed over 100 problems during the several months
of lab and Research network testing on RS960. There are currently 14 open
problems remaining from our tests to date and we can provide some details on
this for anyone interested. We have fixes for most of these problems that
will be regression tested on the Research network next week. We expect to
close these problems prior to deployment of RS960 on 4/24.