RS/960 deployment on the T3 Backbone - Week 3
Phase-III T3 Network Deployment - Step 3 Status Report ====================================================== Jordan Becker, ANS Mark Knopper, Merit Step 3 of the phase-III network was successfully completed last Saturday 5/9. This was the most extensive step that is planned as part of the phase-III deployment. It was completed with only a few minor problems that caused us to extend our scheduled maintainence window by 1 hour and 55 minutes. The deployment was completed at 11:55 EST on 5/9. The following T3 backbone nodes are currently running with new T3 RS960 hardware and software in a stable configuration: Seattle POP: CNSS88, CNSS89, CNSS91 Denver POP: CNSS96, CNSS97, CNSS99 San Fran. POP: CNSS8, CNSS9, CNSS11 L.A. POP: CNSS16, CNSS17, CNSS19 Chicago POP: CNSS24, CNSS25, CNSS27 Cleveland POP: CNSS40, CNSS41, CNSS43 New York City POP: CNSS32, CNSS33, CNSS35 Hartford POP: CNSS48, CNSS49, CNSS51 Regionals: ENSS141 (Boulder), ENSS142 (Salt Lake), ENSS143 (U. Washington) ENSS128 (Palo Alto), ENSS144 (FIX-W), ENSS135 (San Diego) ENSS130 (Argonne), ENSS131 (Ann Arbor), ENSS132 (Pittsburgh), ENSS133 (Ithaca), ENSS134 (Boston) ENSS137 (Princeton) CNSS16, CNSS96, CNSS24, CNSS32, CNSS48 are now running with mixed technology (e.g. 3xRS960 T3 interfaces, 1xHawthorne T3 interface). Step 3 Deployment Difficulties ============================== During the step 3 deployment, there was a suspected bad RS/960 card removed from the Ann Arbor ENSS131 node. This took about 1 hour of trouble shooting time. The T1-C1 (C35) crashed and would not reboot. This turned out to be a loose connector to the SCSI controller. This was probably due the physical move of all CNSS equipment within the NYU POP. There was a problem getting the link to Argonne E130 back online. This turned out to be a misconnected cable on the DSU. There was also an early Saturday morning problem with the T3-B machine at the Cleveland POP. The machine would come up, and then crash within minutes. A number of different troubleshooting procedures were attempted, and the final solution was to replace the RS960 card which inter-connects the T3-B machine to the T3-C1 (e.g. C40->C41). There was a metal shaving on the HSSI connector, however when this was removed, the card still did not work. This was the last machine to be upgraded and it took us a few hours to fix (9am-11am). Finally, during the trouble shooting of the Cleveland CNSS40 node, we found the RS960 card in the Chicago POP CNSS27 node had frozen (probably mis-seated or broken card). It had been running fine for an estimated 5 hours. The card was replaced. Ten hours after the upgrade, everything seemed to be normal with one exception. There was an abnormal packet loss rate between CNSS8<->CNSS9 measured via an on-card RS960 utility program. The RS960 card in the CNSS8 node was replaced Sunday at 6:00am PST. During the deployment activities the "rover" monitoring software was running at both Merit (Ann Arbor) and ANS (Elmsford), with a backup monitor running at an unused RS/6000 CNSS at the Denver POP. An effort was made to leave either Ann Arbor or Elmsford up and connected to the T3 backbone at all times during the night, so that the backup monitor was not required to be used. The NOC was able to successfully monitor the T1 and T3 backbones throughout the deployment timeframe. In addition to all the normal deployment activities, we were able to swap the T960 ethernet card at Pittsburgh. Following the weekend deployment, we identified an RS960 card on CNSS25 that was recording some DMA underrun errors. This was not affecting user traffic, but the card was scheduled to be swapped out as a pre-caution. Since we are revisiting the Chicago POP site during the upcoming step 4 deployment this weekend, we will replace the card then. Taking into consideration that we upgraded 4 POPs at the same time, we feel this deployment went rather well. There were 4 RS960 cards that were shipped backed to IBM for failure analysis. Preliminary analysis in the lab did not result in any reproducible failures. We now have a complete cross-country RS960 path established. We have established modified T3 link metrics to balance traffic across the 5 existing hybrid links. We are watching very closely over the next two weeks for any circuit or equipment problems that result in the cross-country RS960 path being broken, since this could cause congestion on one or more hybrid links. Step 4 Deployment Scheduled for 5/15 ==================================== Based upon the successful completion of step 3 of the deployment, step 4 is currently scheduled to commence at 23:00 local time on 5/15. Step 4 will involve the following nodes/locations: St. Louis POP: CNSS80, CNSS81, CNSS83 Houston POP: CNSS64, CNSS65, CNSS67 Second Visit: CNSS96 (Denver), CNSS24 (Chicago), CNSS16 (L.A.) Regionals: ENSS130 (Argonne), ENSS140 (Lincoln), ENSS139 (Rice) Other ENSS's Affected: ENSS165, ENSS157, ENSS174, ENSS173 During this deployment the Houston ENSS139 node will be isolated from the backbone. Therefore the Houston T1/T3 interconnect will be switched over to San Diego ENSS135 prior to the deployment. The Ann Arbor ENSS131 interconnect gateway is expected to remain operational throughout the deployment. Following the step 4 deployment, selected T3 internal link metrics will be re-adjusted to support load balancing of traffic across the 3 different hybrid technology links that will exist. The selection of these link metrics has been chosen through a calculation of traffic distributions on each link based upon an AS<->AS traffic matrix.
participants (1)
-
mak