On Mon, Dec 31, 2018 at 7:24 AM Naslund, Steve <SNaslund@medline.com> wrote:
Bad design if that’s the case, that would be a huge subnet.
According to the notes at the URL Saku shared, they suffered a cascade failure from which they needed the equipment vendor's help to recover. That indicates at least two grave design errors: 1. Vendor monoculture is a single point of failure. Same equipment running the same software triggers the same bug. It all kabooms at once. Different vendors running different implementations have compatibility issues but when one has a bug it's much less likely to take down all the rest. 2. Failure to implement system boundaries. When you automate systems it's important to restrict the reach of that automation. Whether it's a regional boundary or independent backbones, a critical system like this one should be structurally segmented so that malfunctioning automation can bring down only one piece of it. Regards, Bill Herrin However even if that was the case, you would not need to replace hardware in multiple places. You might have to reset it but not replace it. Also being an ILEC it seems hard to believe how long their dispatches to their own central office took. It might have taken awhile to locate the original problem but they should have been able to send a corrective procedure to CO personnel who are a lot closer to the equipment. In my region (Northern Illinois) we can typically get access to a CO in under 30 minutes 24/7. They are essentially smart hands technicians that can reseat or replace line cards.
2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid >frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.
L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH. However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.
Most of the optical muxes I have worked with will run without any management card or control plane at all. Usually the line cards keep forwarding according to the existing configuration even in the absence of all management functions. It would help if we knew what gear this was. True optical muxes do not require much care and feeding once they have a configuration loaded. If they are truly dependent on that control plane, then it needs to be redundant enough with watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces. Seems they would be vulnerable to a DoS if a bad BPDU can wipe them out.
3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?
BPDU
Maybe, it would be strange that it was invalid but valid enough to continue forwarding. In any case loss of the management network should not interrupt forwarding. I also would not be happy with an optical network that relies on spanning tree to remain operational.
My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.
Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly fit the very ambiguous reason offered and is something people actually are doing.
My biggest problem with their explanation is the replacement of line cards in multiple cities. The only way that happens is when bad code gets pushed to them. If it took them that long to fix an L2 broadcast storm, something is seriously wrong with their engineering. Resetting the management interfaces should be sufficient once the offending line card is removed. That is why I think this was a software update failure or a configuration push. Either way, they should be jumping up and down on their vendor as to why this caused such large scale effects.
-- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>