On Sat, 05 February 2000, Sean Donelan wrote:
Since Lucent equipment was also involved in the 10 days of Worldcom problems, is there a common root cause between the Worldcom's problems and Qwest's problems? Is there some lesson other providers should be learning from these events? Or is each service provider expected to learn and re-learn these lessons individually? Is there some network design decision engineers are getting wrong?
Lucent people told me that the Worldcom problem resulted from a software upgrade to Worldcom's Lucent switches that was done without having a good fallback plan. Lucent engineers had recommended a different strategy to Worldcom but Worldcom went ahead and did it their way. Then the software upgrade triggered some kind of cascading problem that either affected the old code or travelled through the network or both. In other words, they created a problem as a side effect of the upgrade but didn't have agood strategy to contain or kill the problem that propogated like some kind of living organism. Seems to me that we *HAVE* seen this type of problem before in the Internet with things like AS7007 routes which seemed to hang around parts of the net for days. How do you plan to rollback to a known state when you can't simply backtrack or reverse your actions? --- Michael Dillon Phone: +44 (20) 7769 8489 Mobile: +44 (79) 7099 2658 Director of Product Engineering, GTS IP Services 151 Shaftesbury Ave. London WC2H 8AL UK