My best parsing of that ticket, with some guesses : - Infinera management card goes Really Bad, knocks out local waves, and starts spewing garbage out onto the management network - Management network propagates the garbage , other Infinera management cards get it and fall into the same state, knocking down local waves and re-spewing garbage. - Backup tunnels in place to ensure management network connectivity works all the time help propagate the garbage. - They start getting into some devices via OOB, probably rebooting. Devices come up ok, then this garbage traffic knocks them over again. - They start pulling down the backup tunnels to stop the virus from spreading, bouncing stuff again, putting filters on each device to drop the garbage traffic. - This starts to work, but then they hit other problems with linecards from devices that were bounced. - They also start hitting sites that they don't have functional OOB for, and have to get someone driving out to manually get access into. On Sun, Dec 30, 2018 at 8:45 AM Saku Ytti <saku@ytti.fi> wrote:
Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/
Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.
Best guess so far that I've heard is
a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves
Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.
a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up
-- ++ytti