On Sat, Nov 24, 2001 at 02:16:38PM -0500, Sean Donelan wrote:
On Sat, 24 Nov 2001, Neil J. McRae wrote:
I'd be surprised if it was the GSR, and in anycase that doesn't absolve anyone. If it was a software issue- why wasn't the software properly tested? Why was such a critical upgrade rolled out across the entire network at the same time? It doesn't add up.
After a fully run lab-test as well as limited "real-life" deployment you can still never see all the possible cases that would possibly come to haunt you later. Sometimes you do an across-the-board upgrade for security as well as specific feature/bugset reasons to fix the set of bugs into the "we know what they are and how to deal with them". No vendor claims to have perfect software. Nor will you find anyone but the irresponsible vendor to suggest that any specific image is "perfect".
It appears to be yet another CEF bug. If you want to use a GSR you are stuck using some version of IOS with a CEF bug. The question is which bug do you want. Each version of IOS has a slightly different set. Several US network providers have also been bitten by CEF bugs too.
True, but most of those are in the past. I'm not familiar with the specifics of the bugs that BT encountered but something that should be taken note of is the ability for a Cisco router to function when in a "broken" state and you want to get a 'fixed' image onto it. It would be nice if there were easier ways to do it in some cases but you can't have a perfect environment esp when you do sw upgrades you don't always have your on-site hands standing by to help you swap flash cards or deal with whatever logistical issues you may encounter.
While trying to fix one set of bugs, BT upgraded of their network. I'm not sure if they were upgrading at 9am in the morning, or had upgraded earlier and the bug finally came out under load at 9am. When the BT network melted down, Cisco suggested installing a different version of IOS, which had previously been tested. At noon, BT found the new version had an even worse bug, sending packets out the wrong interface. It was until 2200 (13 hours later), BT and Cisco found a version of IOS which stablized the network. "Stablized" not fixed. The running version of IOS still has a bug, but it isn't as severe.
I'm sure that BT and Cisco have had some conversations about what can be done to improve the testing that Cisco does to better simulate their network at this time from such a public outage. -- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.