According to AT&T, in a story just released, the problem was a Cisco code problem. It would be interesting to hear what the actual software cause was, if anyone from Cisco cares to let us know?.../rlj - - - AT&T's network failure AT&T Corp. reported today that last week's frame relay network failure was due to a software-based problem. Last Monday's network failure disrupted service nationally for thousands of AT&T's business customers for about a day. The cause of the outage was previously attributed to the interaction between two frame-relay switches, but AT&T now says it was a software problem. AT&T Chairman C. Michael Armstrong said today the company is working closely with Cisco Systems, which provided its switches, on ways to fix the problem. The type of network application affected by frame-relay service is "any kind of high-speed data applications where large amounts of information [are] exchanged in quick bursts," Granieri said. Querying an inventory database is one example of such an application, he said. Reuters contributed to this report
:: Rodney Joffe writes ::
According to AT&T, in a story just released, the problem was a Cisco code problem. It would be interesting to hear what the actual software cause was, if anyone from Cisco cares to let us know?.../rlj
http://www.cisco.com/warp/public/770/fn042198.shtml is what's available publically. Summary: Most stratacom cards support what they call "Y-Redundancy" in which you connect up to two physical cards with a Y cable and then switch can them failover in the event of a hardware failure. This problem in question concerns BXM cards (a DS3/OC-3/OC-12 trunk card) that are configured for Y-Redundancy. If the active card restarts, the previosuly-standby card will quickly become the active card, and the previosuly-active card, when it comes back, will become the standby card. If the newly-active card then restarts or fails within a certain period of time, the system is vulnerable to the problem. The specific problem is that the two cards get into a loop sending cells to each other, and during each iteration of the loop, the active card also sends one cell out on the trunk. The cells that loop are control cells, and when the node at the far end of the trunk receives these cells at full line rate, it will eventually end up aborting (and restarting). One way of becoming vulnerable to this is to upgrade BXM firmware is a specific sequence that leads the the above-mentioned card restarts. Specific sequences of hardware failures can also cause this, as can specific sequences of card-level control commands (reset card, etc.). Based on ATT's press release, I would guess that they were upgrading firmware in on a pair of Y-Redundant BXMs in the order that opens up vulnerability to this issue. ATT's release said only one node was being upgraded, and Cisco's field notice isn't clear on how this failure on one node cascades throughout a large network. Firmware fixes are available today. Software upgrades are being developed to contain the problem if it or a similar problem does occur. - Brett (brettf@netcom.com) (Who's glad that the BPX's in his network don't have BXMs). ------------------------------------------------------------------------------ ... Coming soon to a | Brett Frankenberger .sig near you ... a Humorous Quote ... | brettf@netcom.com
participants (2)
-
Brett Frankenberger
-
Rodney Joffe