Re: Followup British Telecom outage reason

24 Nov 2001

      They probably did. The vendor probably did also. Of course, they can't
always simulate real network conditions. Nor can your own labs. Heck,
even a small deployment on 2 or 3 routers (out of, say, 200) can't
catch everything. It is a simple fact that some bugs don't show up
until its too late.

And cascade failures occure more often than you might think (and not
necessarily from software.) Remember the AT&T frame outage? Procedural
error. How about the netcom outage of a few years ago? Someone
misplaced a '.*' if I remember correctly. Human error of the simplest
kind. I've had a data center go offline because someone slipped and
turned off one side of a large breaker box.

These things happen.

The challenge is to eliminate the ones you CAN control. And, IMO, the
industry is generally doing a good job of that.

I chalk this whole thing up to bad karma for BT.

-Wayne

On Sat, Nov 24, 2001 at 11:05:20AM +0000, Neil J. McRae wrote:
...
...
BT is telling ISPs the reason for the multi-hour outage was
a software bug in the interface cards used in BT's core network.
BT installed a new version of the software.  When that didn't fix
the problem, they fell back to a previous version of the software.
BT didn't identify the vendor, but BT is identified as a "Cisco Powered
Network(tm)."  Non-BT folks believe the problem was with GSR interface
cards.  I can't independently confirm it.
I'd be surprised if it was the GSR, and in anycase that doesn't
absolve anyone. If it was a software issue- why wasn't the software
properly tested? Why was such a critical upgrade rolled out across
the entire network at the same time? It doesn't add up.
Neil.
---
Wayne Bouchard
web@typo.org
Network Engineer
http://www.typo.org/~web/resume.html

Re: Followup British Telecom outage reason

Wayne E. Bouchard