To quote Dr. Halamka:
I hope that my approval enables [Cisco] to share more with the press so that everyone understands that our issues were purely architectural and not related to Cisco products or Cisco engineers.
So, now that I have it in writing, here's what I'll share: The Topology There are three campuses, each with several phyiscal sites, connected to each other with 8 different FEC and OC3 LANE links -- all L2. Each campus had several core switches depending on the number of buildings, again connected redundantly within the campus with FEC and/or OC3 LANE L2 links, plus dozens of access switches with redundant L2 uplinks to various core switches. Many protocols were in use, including non-routeable protocols like NetBEUI and SNA. How did this mess occur? "The CareGroup network grew organically due to the BIDMC merger, PACS installation, East to West movement of clinical services, Libby sale and changing CareGroup environment." In short, the network was growing very quickly due to business changes and was never redesigned to cope with the much larger scale and new application requirements. This resulted in the previously-mentioned 10+ hop STP, which spanned across multiple sites and multiple adminitrative zones. CareGroup had recently gone through a network audit by a third party which listed most/all of the potential problems in this design and suggested solutions. The outage occurred before CareGroup had an opportunity to act on these recommendations. The Outage The trigger for the outage was the addition of a new high-bandwidth application which somehow interfered with the proper operation of STP. Even after this application was removed, STP could not reconverge due to the default 7-hop limit. During troubleshooting, Cisco inserted L3 hops at logical boundaries, breaking the STP into more manageable chunks. Most of these smaller chunks were still unstable, so Cisco removed all redundant links in each area, verified stability, and then reintroduced redundant links one-by-one to troubleshoot potential problems. As you can imagine, this takes a long time, and was done with Cisco staff on site travelling to each building and wiring closet to make the necessary changes. Going Forward The new design calls for one pair of distribution L3 switches per building, with L3 GE links in a full mesh between all distribution switches in a campus, and between specific switches at each campus. STP will be confined to the access L2 switches. All non-IP traffic will be removed or isolated. In the meantime, CareGroup will be performing small-scare hardware and link upgrades to improve resiliency within each campus, and will install computers with dial-up connections to the datacenter in key locations so that they will have minimal connectivity if another LAN event occurs. Strict change control will be implemented with Cisco approvals. Conclusion As I alluded to before, this is not a story of equipment failure or STP's inherent scaling problems. It's a story of a business which didn't or couldn't adapt the network to the business needs -- even though everyone saw the train coming at them. The lesson here isn't to review your network design, but to review how your business handles growth and/or known problems. S