Phase-III RS960 Deployment Status Report - Week 1 ================================================= Jordan Becker, ANS Mark Knopper, Merit Summary ======= - RS/960 Hardware Deployment Early last Saturday morning, April 25, the first step of the Phase-III RS960 DS3 upgrade to the T3 Backbone was completed. The following T3 backbone nodes are currently running with new T3 hardware and software in a stable configuration: Seattle POP: CNSS88, CNSS89, CNSS91 Denver POP: CNSS96, CNSS97, CNSS99 Regionals: ENSS141 (Boulder), ENSS142 (Salt Lake), ENSS143 (U. Washington) This Friday night 5/1 and Saturday morning 5/2, the second step will occur, with the following nodes being upgraded: Los Angeles POP (barring rioting, etc.): CNSS16, CNSS17, CNSS19 San Francisco POP: CNSS8, CNSS9, CNSS11 Regionals: ENSS128 (Palo Alto), ENSS144 (NASA Ames/FIX-West), ENSS135 (San Diego) - Kernel Build and Routing Software Deployment Early this morning (April 30), all of the network nodes were upgraded with Build 2.78.22 and a new version of rcp_routed. Several nodes (the Step 1 nodes and also the Ann Arbor ENSS) have been running in this configuration since April 27. This software build includes drivers for the RS/960 cards, and several bug fixes. The fixes include reliability improvements to the FDDI software, notably a soft reset for a hung card; and some additional improvements to the T960 ethernet and T1 card support to fix a problem where the card could hang under a heavy load. The rcp_routed change includes a fix to a problem that was observed during RS/960 testing. The problem involved some aberrant behavior of rcp_routed when a misconfigured regional peer router would advertise a route via BGP to an ENSS whose next hop was the ENSS itself. The old rcp_routed could go into a loop sending multiple redirect packets out on to the subnet. The new rcp_routed will close the BGP session if it receives an announcement of such a route. The new rcp_routed software also has support for externally administered inter-AS metrics, an auto-restart capability, and bug fixes for BGP overruns with peer routers. This deployment caused a few problems. One is that this new feature of rcp_routed pointed out a misconfigured peer router at Rice University in Houston. This caused the BGP connection to open and close rapidly which caused further problems on the peer router. Eventually the peer was reconfigured to remove the bad route, which fixed the problem. Another problem was on the Argonne ENSS. This node crashed in such a way that it was not reachable and there was difficulty contacting the site to reboot the machine. Once the reboot was done the ENSS came back up running properly. We are looking into the cause of this crash, as this is the first time this has happened in all of our testing and deployment. The third problem was at the CoNCert ENSS, and this node crashed in such a way that the unix file system was completely lost, and this requires a complete rebuild of the system. This problem has happened in two cases previously, and seems to be related to an AIX problem rather than anything in the new software build. It is an insidious problem since it completely wipes out any trace of what may have caused it. We continue to look into this. In our view, none of these problems are severe enough to call off the deployment of the RS/960 hardware upgrade on the west coast Saturday morning. These problems aren't related to the new hardware or software support for the new cards, but seem to be problems in other parts of the new software build or routing daemon. The following section details our experiences of the first week's deployment. We believe we have an understanding of the problems that occurred and have increased the number of spare cards in the kits that the teams will take out to the sites. Summary of Week 1 (beginning 4/25) RS/960 Deployment ==================================================== Since the deployment, nodes CNSS88 and CNSS96 have been running with mixed technology (i.e. 3xRS960 T3 interfaces, 1xHawthorne T3 interface). Production traffic on the affected ENSS nodes was cutover to the T1 backbone at 2:00 AM EST on 4/25. The Denver POP nodes were returned to full service after approximately 10.5 hours and Seattle after about 13 hrs of work. The new software had been deployed on the affected nodes during the prior week. The physical upgrade consisted of mechanical changes to the RS6000 machines including (1) replacement of the RS6000 I/O planar, (2) replacement of the Hawthorne T3 adapters with RS960 adapters, (3) upgrading the DSU hardware and associated cabling, (4) a modification of the cooling capabilities of the RS6000, (5) updating and standardizing on new labels and (6) local software configuration changes accommodating the use of new hardware. The nodes were then tested and returned to production service. Although the resulting configuration is very stable and gratifying to the team working on the upgrade, the process was not quite as smooth as had been expected. There were 33 RS960 T3 interface adapters, T3 DSUs, and 9 I/O planars installed. Out of these, 5 RS960 adapters and 2 I/O planars had to be replaced to achieve stability. The failing new components were air-shipped back to the staging lab for a complete failure analysis. and a summary of the results are described below. All of the problems involving replaced components have been identified or explained. The post-manufacturing burn-in and staging process consists of a stress test involving card-to-card traffic testing, DSU testing, and power cycling prior to shipping. Lab burn-in and staging procedures have been adjusted based upon our experiences with the first deployment step. The installation procedures will be fine tuned to improve our efficiency and reduce the installation time required. It is therefore our intention to continue with the step 2 deployment plan in the San Francisco and Los Angeles backbone nodes (and adjacent ENSS nodes) on Friday evening 5/1. Installation Problems Encountered ================================= The replacement of the RS6000 I/O planers had to be performed twice on nodes CNSS88 and CNSS91 after the nodes did not come up properly the first time. The original set of planers were returned to the lab for analysis. It was determined that the underside of the planars were damaged during the physical installation process. The procedure for installing these planars in the RS6000 model 930 rack mounted system has been modified to prevent this type of damage from occuring. Adapter re-seating and reconfiguration was required on the asynchronous communications adapter on CNSS89. The dual planar change and troubleshooting work adversely affected the hard disk on CNSS88 and the disk image had to be reloaded from tape. An apparent dysfunctional DSU HSSI card connected to CNSS97 was replaced and was returned for lab failure analysis. The analysis revealed that a DSU microprocessor was not properly seated. T3 Technologies will re-inspect the DSUs for proper seating of all DSU components. The DSU at ENSS142 (Salt Lake City) needed internal work. The analysis revealed a bent pin on the connector between the HSSI card and the DSU backplane. T3 Technologies Inc. (DSU manufacturer) identified a mechanical deficiency in the connector that can cause certain pins to loosen and bend during insertion of the card. T3 Technologies will add a procedure to inspect these connectors prior to shipment for any deficiencies. An RS960 T3 card in CNSS89 did not have the correct electrical isolation spacers installed, and this caused a temporary short circuit. The card was returned to the lab and it was determined that a last-minute change was made to the card to swap an i596 serializer chip and that the spacer was erroneously left off. A visual inspection procedure for spacers and standoffs will be added to the manufacturing and staging inspection process. The RS960 T3 card in CNSS97 came up in a NONAP mode, however the card continued to respond to DSU queries. The card was replaced. During the failure analysis, the card passed all stress tests. This card was swapped prior to a planar change and the NONAP mode has been attributed to the planar problem. When reliable connectivity could not be achieved between CNSS97<->ENSS141, another RS960 T3 card in CNSS97 was replaced. The lab failure analysis revealed that the 20Mhz oscillator crystal on the card was broken due to a mechanical shock. This shock could have occured during shipping or handling. CNSS91 rebooted after the upgrade and exhibited an unstable configuration. Both the I/O planar and the RS960 T3 card were changed before stable operation was achieved. The RS960 card was later found to have a cold solder joint between the RAM module and adapter. The solder joint would allow the card to work when cold, but would fail intermittently when running hot. A simple adapter thermal cycling procedure will be investigated to determine if this problem can be avoided in the future. CNSS88 was the last node to come up. The I/O planar and memory card had to be re-seated, and an RS960 T3 card replacement was necessary. The RS960 card was found to have a solder splash that was overlapping two adjacent pins. A general corrective action for this problem involves adding a procedure involving high magnification visual microchannel interface inspection for shorts. Conclusions =========== Although the Step 1 installation was complicated by several component problems, the entire deployment team (IBM, Merit, ANS, MCI) has a high comfort level with the results of the failure analysis, the stability of the resulting installation, and overall deployment process. We are very encouraged that we have completed this first step and acheived a very stable configuration within the time, human resources, spares and support that was provisioned as part of the deployment plan. The software bring-up and test scripts worked exactly as planned. The teamwork between the 50+ individuals involved at Merit, IBM, MCI and ANS was excellent. The basic installation process is sound, although some minor changes will be implemented in the future installation steps. These will include the RS960 adapter installation, DSU trouble-shooting procedure and the I/O planar change process. Also the number of spares provisioned in a deployment kit will be increased as a pre-caution for future installations. There has been one software problem identified with the new technology on the test network since the end of system testing on 4/17. This involves a microcode bug where a portion of the RS960 on-card memory will not report a parity error to the system if that portion of the memory suffers a failure. All RS960 on-card memory is tested prior to shipment and a microcode fix to this problem will be delivered by IBM for testing within the next week. Only a single RS960 adapter has demonstrated an on-card memory parity error during all test network testing. Next Steps ========== Step 2 of the deployment is currently scheduled to commence at 23:00 local time on 5/1. Step 2 will involve the following nodes/locations: May 1/2 San Francisco, Los Angeles core nodes T3 ENSS Nodes: Palo Alto E128, San Diego E135, Other ENSS Nodes: E144, E159, E156, E170 also affected Second core node visit: Seattle