Phase-III RS960 Deployment Status Report - Week 1
=================================================
Jordan Becker, ANS Mark Knopper, Merit
Summary
=======
- RS/960 Hardware Deployment
Early last Saturday morning, April 25, the first step of
the Phase-III RS960 DS3 upgrade to the T3 Backbone was completed.
The following T3 backbone nodes are currently running with new T3
hardware and software in a stable configuration:
Seattle POP: CNSS88, CNSS89, CNSS91
Denver POP: CNSS96, CNSS97, CNSS99
Regionals: ENSS141 (Boulder), ENSS142 (Salt Lake), ENSS143 (U. Washington)
This Friday night 5/1 and Saturday morning 5/2, the second step
will occur, with the following nodes being upgraded:
Los Angeles POP (barring rioting, etc.): CNSS16, CNSS17, CNSS19
San Francisco POP: CNSS8, CNSS9, CNSS11
Regionals: ENSS128 (Palo Alto), ENSS144 (NASA Ames/FIX-West),
ENSS135 (San Diego)
- Kernel Build and Routing Software Deployment
Early this morning (April 30), all of the network nodes were
upgraded with Build 2.78.22 and a new version of rcp_routed. Several
nodes (the Step 1 nodes and also the Ann Arbor ENSS) have been running
in this configuration since April 27. This software build includes
drivers for the RS/960 cards, and several bug fixes. The fixes include
reliability improvements to the FDDI software, notably a soft reset for
a hung card; and some additional improvements to the T960 ethernet and
T1 card support to fix a problem where the card could hang under a
heavy load. The rcp_routed change includes a fix to a problem that was
observed during RS/960 testing. The problem involved some aberrant
behavior of rcp_routed when a misconfigured regional peer router would
advertise a route via BGP to an ENSS whose next hop was the ENSS
itself. The old rcp_routed could go into a loop sending multiple
redirect packets out on to the subnet. The new rcp_routed will close
the BGP session if it receives an announcement of such a route. The
new rcp_routed software also has support for externally administered
inter-AS metrics, an auto-restart capability, and bug fixes for BGP
overruns with peer routers.
This deployment caused a few problems. One is that this
new feature of rcp_routed pointed out a misconfigured peer router
at Rice University in Houston. This caused the BGP connection to
open and close rapidly which caused further problems on the peer
router. Eventually the peer was reconfigured to remove the
bad route, which fixed the problem. Another problem was on the
Argonne ENSS. This node crashed in such a way that it was
not reachable and there was difficulty contacting the site to
reboot the machine. Once the reboot was done the ENSS came back
up running properly. We are looking into the cause of this crash,
as this is the first time this has happened in all of our testing
and deployment. The third problem was at the CoNCert ENSS, and
this node crashed in such a way that the unix file system was
completely lost, and this requires a complete rebuild of the
system. This problem has happened in two cases previously, and
seems to be related to an AIX problem rather than anything in
the new software build. It is an insidious problem since it
completely wipes out any trace of what may have caused it. We
continue to look into this.
In our view, none of these problems are severe enough
to call off the deployment of the RS/960 hardware upgrade on
the west coast Saturday morning. These problems aren't related
to the new hardware or software support for the new cards, but
seem to be problems in other parts of the new software build
or routing daemon. The following section details our experiences
of the first week's deployment. We believe we have an understanding
of the problems that occurred and have increased the number
of spare cards in the kits that the teams will take out to the
sites.
Summary of Week 1 (beginning 4/25) RS/960 Deployment
====================================================
Since the deployment, nodes CNSS88 and CNSS96 have been running
with mixed technology (i.e. 3xRS960 T3 interfaces, 1xHawthorne
T3 interface). Production traffic on the affected ENSS nodes was
cutover to the T1 backbone at 2:00 AM EST on 4/25. The Denver POP
nodes were returned to full service after approximately 10.5 hours and
Seattle after about 13 hrs of work.
The new software had been deployed on the affected nodes during the
prior week. The physical upgrade consisted of mechanical changes to the
RS6000 machines including (1) replacement of the RS6000 I/O planar, (2)
replacement of the Hawthorne T3 adapters with RS960 adapters, (3) upgrading
the DSU hardware and associated cabling, (4) a modification of the cooling
capabilities of the RS6000, (5) updating and standardizing on new labels and
(6) local software configuration changes accommodating the use of new
hardware. The nodes were then tested and returned to production service.
Although the resulting configuration is very stable and gratifying to
the team working on the upgrade, the process was not quite as smooth as had
been expected. There were 33 RS960 T3 interface adapters, T3 DSUs, and 9 I/O
planars installed. Out of these, 5 RS960 adapters and 2 I/O planars had to be
replaced to achieve stability. The failing new components were air-shipped
back to the staging lab for a complete failure analysis. and a summary of the
results are described below.
All of the problems involving replaced components have been
identified or explained. The post-manufacturing burn-in and
staging process consists of a stress test involving card-to-card
traffic testing, DSU testing, and power cycling prior to shipping. Lab
burn-in and staging procedures have been adjusted based upon our
experiences with the first deployment step. The installation
procedures will be fine tuned to improve our efficiency and reduce the
installation time required. It is therefore our intention to continue
with the step 2 deployment plan in the San Francisco and Los Angeles
backbone nodes (and adjacent ENSS nodes) on Friday evening 5/1.
Installation Problems Encountered
=================================
The replacement of the RS6000 I/O planers had to be performed
twice on nodes CNSS88 and CNSS91 after the nodes did not come
up properly the first time. The original set of planers were returned
to the lab for analysis. It was determined that the underside of the
planars were damaged during the physical installation process. The
procedure for installing these planars in the RS6000 model 930 rack
mounted system has been modified to prevent this type of damage from
occuring.
Adapter re-seating and reconfiguration was required on the
asynchronous communications adapter on CNSS89.
The dual planar change and troubleshooting work adversely
affected the hard disk on CNSS88 and the disk image had to be
reloaded from tape.
An apparent dysfunctional DSU HSSI card connected to CNSS97 was
replaced and was returned for lab failure analysis. The
analysis revealed that a DSU microprocessor was not properly seated.
T3 Technologies will re-inspect the DSUs for proper seating of all DSU
components.
The DSU at ENSS142 (Salt Lake City) needed internal work. The analysis
revealed a bent pin on the connector between the HSSI card and the DSU
backplane. T3 Technologies Inc. (DSU manufacturer) identified a mechanical
deficiency in the connector that can cause certain pins to loosen and bend
during insertion of the card. T3 Technologies will add a procedure to inspect
these connectors prior to shipment for any deficiencies.
An RS960 T3 card in CNSS89 did not have the correct electrical isolation
spacers installed, and this caused a temporary short circuit. The card was
returned to the lab and it was determined that a last-minute change was made
to the card to swap an i596 serializer chip and that the spacer was
erroneously left off. A visual inspection procedure for spacers and standoffs
will be added to the manufacturing and staging inspection process.
The RS960 T3 card in CNSS97 came up in a NONAP mode, however the card
continued to respond to DSU queries. The card was replaced. During the
failure analysis, the card passed all stress tests. This card was swapped
prior to a planar change and the NONAP mode has been attributed to the planar
problem.
When reliable connectivity could not be achieved between
CNSS97<->ENSS141, another RS960 T3 card in CNSS97 was
replaced. The lab failure analysis revealed that the 20Mhz oscillator
crystal on the card was broken due to a mechanical shock. This shock
could have occured during shipping or handling.
CNSS91 rebooted after the upgrade and exhibited an unstable
configuration. Both the I/O planar and the RS960 T3 card were
changed before stable operation was achieved. The RS960 card was later
found to have a cold solder joint between the RAM module and adapter.
The solder joint would allow the card to work when cold, but would fail
intermittently when running hot. A simple adapter thermal cycling
procedure will be investigated to determine if this problem can be
avoided in the future.
CNSS88 was the last node to come up. The I/O planar and memory card
had to be re-seated, and an RS960 T3 card replacement was necessary.
The RS960 card was found to have a solder splash that was overlapping
two adjacent pins. A general corrective action for this problem
involves adding a procedure involving high magnification visual
microchannel interface inspection for shorts.
Conclusions
===========
Although the Step 1 installation was complicated by several component
problems, the entire deployment team (IBM, Merit, ANS, MCI) has a high comfort
level with the results of the failure analysis, the stability of the resulting
installation, and overall deployment process. We are very encouraged that we
have completed this first step and acheived a very stable configuration within
the time, human resources, spares and support that was provisioned as part of
the deployment plan. The software bring-up and test scripts worked exactly as
planned.
The teamwork between the 50+ individuals involved at Merit, IBM, MCI
and ANS was excellent. The basic installation process is sound, although some
minor changes will be implemented in the future installation steps. These
will include the RS960 adapter installation, DSU trouble-shooting procedure
and the I/O planar change process. Also the number of spares provisioned in a
deployment kit will be increased as a pre-caution for future installations.
There has been one software problem identified with the new technology
on the test network since the end of system testing on 4/17. This involves a
microcode bug where a portion of the RS960 on-card memory will not report a
parity error to the system if that portion of the memory suffers a failure.
All RS960 on-card memory is tested prior to shipment and a microcode fix to
this problem will be delivered by IBM for testing within the next week. Only
a single RS960 adapter has demonstrated an on-card memory parity error during
all test network testing.
Next Steps
==========
Step 2 of the deployment is currently scheduled to commence at 23:00
local time on 5/1. Step 2 will involve the following nodes/locations:
May 1/2
San Francisco, Los Angeles core nodes
T3 ENSS Nodes: Palo Alto E128, San Diego E135,
Other ENSS Nodes: E144, E159, E156, E170 also affected
Second core node visit: Seattle