Cisco wins... As a result of the crash, Beth Israel Deaconess plans to spend $3 million to replace its entire network - creating an entire parallel set of wires and switches, double the capacity the medical center thought it needed.
Cisco wins...
As a result of the crash, Beth Israel Deaconess plans to spend $3 million to replace its entire network - creating an entire parallel set of wires and switches, double the capacity the medical center thought it needed.
The question is for how long that parallel network would be around before it falls due the same problem ( dclue/dt < 0 on the part of those who run it ) manifesting itself in a different way. Alex
Oh wow I worked for a company who integrated some fairly large network based imaging systems in there and things were broken then too. Their techs kept cutting fibers and disconnecting nodes and it took days for them to figure out why. ----- Original Message ----- From: "Huff, Mark" <mhuff@integratelecom.com> To: "Nanog (E-mail)" <nanog@merit.edu> Sent: Wednesday, November 27, 2002 8:19 AM Subject: Spanning tree melt down ?
Cisco wins...
As a result of the crash, Beth Israel Deaconess plans to spend $3 million to replace its entire network - creating an entire parallel set of wires and switches, double the capacity the medical center thought it needed.
speculating on cause and effect, my first bet would that someone turned off spanning tree on a trunk or trunks immediately prior to the flood. my next bet would be a babbling device - i've seen an unauthorized hub on a flat layer 2 net basically shut the network down. it was after a power hit. when we found the buggar and power cycled it, all was well. i don't think that the researcher was the culprit. more likely the victim. thoughts? ----- Original Message ----- From: "Scott Granados" <scott@wworks.net> To: "Huff, Mark" <mhuff@integratelecom.com>; "Nanog (E-mail)" <nanog@merit.edu> Sent: Wednesday, November 27, 2002 12:54 PM Subject: Re: Spanning tree melt down ?
Oh wow I worked for a company who integrated some fairly large network
based
imaging systems in there and things were broken then too.
Their techs kept cutting fibers and disconnecting nodes and it took days for them to figure out why.
----- Original Message ----- From: "Huff, Mark" <mhuff@integratelecom.com> To: "Nanog (E-mail)" <nanog@merit.edu> Sent: Wednesday, November 27, 2002 8:19 AM Subject: Spanning tree melt down ?
Cisco wins...
As a result of the crash, Beth Israel Deaconess plans to spend $3 million to replace its entire network - creating an entire parallel set of wires and switches, double the capacity the medical center thought it needed.
On Thu, 28 Nov 2002, Garrett Allen wrote:
speculating on cause and effect, my first bet would that someone turned off spanning tree on a trunk or trunks immediately prior to the flood. my next bet would be a babbling device - i've seen an unauthorized hub on a flat layer 2 net basically shut the network down. it was after a power hit. when we found the buggar and power cycled it, all was well. i don't think that the researcher was the culprit. more likely the victim.
This article had some more information: http://www.nwfusion.com/news/2002/1125bethisrael.html This slashdot article also seems to have some details: http://slashdot.org/comments.pl?sid=46238&cid=4770093 Text as follows: I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study." -- Simon Lyall. | Newsmaster | Work: simon.lyall@ihug.co.nz Senior Network/System Admin | Postmaster | Home: simon@darkmere.gen.nz ihug, Auckland, NZ | Asst Doorman | Web: http://www.darkmere.gen.nz
Heh, so they kept bolting stuff on and a failure somewhere caused a spanning tree change which because of over complexity and out of date config was unable to converge. Ah yes, occam also applies to switch topology :) Steve On Fri, 29 Nov 2002, Simon Lyall wrote:
On Thu, 28 Nov 2002, Garrett Allen wrote:
speculating on cause and effect, my first bet would that someone turned off spanning tree on a trunk or trunks immediately prior to the flood. my next bet would be a babbling device - i've seen an unauthorized hub on a flat layer 2 net basically shut the network down. it was after a power hit. when we found the buggar and power cycled it, all was well. i don't think that the researcher was the culprit. more likely the victim.
This article had some more information:
http://www.nwfusion.com/news/2002/1125bethisrael.html
This slashdot article also seems to have some details:
http://slashdot.org/comments.pl?sid=46238&cid=4770093
Text as follows:
I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."
-- Simon Lyall. | Newsmaster | Work: simon.lyall@ihug.co.nz Senior Network/System Admin | Postmaster | Home: simon@darkmere.gen.nz ihug, Auckland, NZ | Asst Doorman | Web: http://www.darkmere.gen.nz
I'm still failing to see why this required a $3M forklift of new equipment to correct the problem. Was this just Cisco sales pouncing on someone's misfortune as a way to push new stuff? On Thu, 28 Nov 2002, Stephen J. Wilcox wrote:
Heh, so they kept bolting stuff on and a failure somewhere caused a spanning tree change which because of over complexity and out of date config was unable to converge.
Ah yes, occam also applies to switch topology :)
Steve
On Fri, 29 Nov 2002, Simon Lyall wrote:
On Thu, 28 Nov 2002, Garrett Allen wrote:
speculating on cause and effect, my first bet would that someone turned off spanning tree on a trunk or trunks immediately prior to the flood. my next bet would be a babbling device - i've seen an unauthorized hub on a flat layer 2 net basically shut the network down. it was after a power hit. when we found the buggar and power cycled it, all was well. i don't think that the researcher was the culprit. more likely the victim.
This article had some more information:
http://www.nwfusion.com/news/2002/1125bethisrael.html
This slashdot article also seems to have some details:
http://slashdot.org/comments.pl?sid=46238&cid=4770093
Text as follows:
I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."
-- Simon Lyall. | Newsmaster | Work: simon.lyall@ihug.co.nz Senior Network/System Admin | Postmaster | Home: simon@darkmere.gen.nz ihug, Auckland, NZ | Asst Doorman | Web: http://www.darkmere.gen.nz
Smells like it to me...sounds like they said, "HALP" to Cisco, and Cisco said, "Clean out the warehouse, we've got a live one!" At 16:08 11/28/02 -0600, you wrote:
I'm still failing to see why this required a $3M forklift of new equipment to correct the problem. Was this just Cisco sales pouncing on someone's misfortune as a way to push new stuff?
On Thu, 28 Nov 2002, Stephen J. Wilcox wrote:
Heh, so they kept bolting stuff on and a failure somewhere caused a
spanning
tree change which because of over complexity and out of date config was unable to converge.
Ah yes, occam also applies to switch topology :)
Steve
On Fri, 29 Nov 2002, Simon Lyall wrote:
On Thu, 28 Nov 2002, Garrett Allen wrote:
speculating on cause and effect, my first bet would that someone
turned off
spanning tree on a trunk or trunks immediately prior to the flood. my next bet would be a babbling device - i've seen an unauthorized hub on a flat layer 2 net basically shut the network down. it was after a power hit. when we found the buggar and power cycled it, all was well. i don't think that the researcher was the culprit. more likely the victim.
This article had some more information:
http://www.nwfusion.com/news/2002/1125bethisrael.html
This slashdot article also seems to have some details:
http://slashdot.org/comments.pl?sid=46238&cid=4770093
Text as follows:
I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."
-- Simon Lyall. | Newsmaster | Work: simon.lyall@ihug.co.nz Senior Network/System Admin | Postmaster | Home: simon@darkmere.gen.nz ihug, Auckland, NZ | Asst Doorman | Web: http://www.darkmere.gen.nz
Well, yes, they were. But don't blame Cisco - its not like they held a gun to anyone's head. Of course, there is also the possibility that the hospitol IT folks said "if you had just agreed to our capital requests last year, none of this would have happened" and the money tap got turned on. This could have been long-defered equipement. Or, it could have been a panic-buy. Either way, the buck stops with the hospitol, rather than the vendor. - Dan On Thu, 28 Nov 2002, Robert A. Hayden wrote:
I'm still failing to see why this required a $3M forklift of new equipment to correct the problem. Was this just Cisco sales pouncing on someone's misfortune as a way to push new stuff?
On Thu, 28 Nov 2002, Stephen J. Wilcox wrote:
Heh, so they kept bolting stuff on and a failure somewhere caused a spanning tree change which because of over complexity and out of date config was unable to converge.
Ah yes, occam also applies to switch topology :)
Steve
On Fri, 29 Nov 2002, Simon Lyall wrote:
On Thu, 28 Nov 2002, Garrett Allen wrote:
speculating on cause and effect, my first bet would that someone turned off spanning tree on a trunk or trunks immediately prior to the flood. my next bet would be a babbling device - i've seen an unauthorized hub on a flat layer 2 net basically shut the network down. it was after a power hit. when we found the buggar and power cycled it, all was well. i don't think that the researcher was the culprit. more likely the victim.
This article had some more information:
http://www.nwfusion.com/news/2002/1125bethisrael.html
This slashdot article also seems to have some details:
http://slashdot.org/comments.pl?sid=46238&cid=4770093
Text as follows:
I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."
-- Simon Lyall. | Newsmaster | Work: simon.lyall@ihug.co.nz Senior Network/System Admin | Postmaster | Home: simon@darkmere.gen.nz ihug, Auckland, NZ | Asst Doorman | Web: http://www.darkmere.gen.nz
Thus spake "Robert A. Hayden" <rhayden@geek.net>
I'm still failing to see why this required a $3M forklift of new equipment to correct the problem. Was this just Cisco sales pouncing on someone's misfortune as a way to push new stuff?
Environments with a STP diameter of 10+ are unlikely to have the necessary equipment on-hand to make a L3 campus network that adheres to current best practices. Nevertheless, it is TAC policy to get the customer the equipment that is needed to solve the problem, and Sales handles any monetary issues later after the network is stable. One might also consider if there was a campus upgrade plan already in the works and this event merely altered the timeline for implementation. S
I just heard that NPR is about to do a piece on this - it should air in a few minutes... Marshall On Friday, November 29, 2002, at 03:32 PM, Stephen Sprunk wrote:
Thus spake "Robert A. Hayden" <rhayden@geek.net>
I'm still failing to see why this required a $3M forklift of new equipment to correct the problem. Was this just Cisco sales pouncing on someone's misfortune as a way to push new stuff?
Environments with a STP diameter of 10+ are unlikely to have the necessary equipment on-hand to make a L3 campus network that adheres to current best practices. Nevertheless, it is TAC policy to get the customer the equipment that is needed to solve the problem, and Sales handles any monetary issues later after the network is stable.
One might also consider if there was a campus upgrade plan already in the works and this event merely altered the timeline for implementation.
S
Unnamed Administration sources reported that Marshall Eubanks said:
I just heard that NPR is about to do a piece on this - it should air in a few minutes...
Immediate cause: Someone tried to move a multi-gigabyte file..... -- A host is a host from coast to coast.................wb8foz@nrk.com & no one will talk to a host that's close........[v].(301) 56-LINUX Unless the host (that isn't close).........................pob 1433 is busy, hung or dead....................................20915-1433
participants (12)
-
alex@yuriev.com
-
blitz
-
Daniel Golding
-
David Lesher
-
Garrett Allen
-
Huff, Mark
-
Marshall Eubanks
-
Robert A. Hayden
-
Scott Granados
-
Simon Lyall
-
Stephen J. Wilcox
-
Stephen Sprunk