CenturyLink RCA?

newer
does emergency (911) dispatch uses...

older
Auto-configuring IPv6 transition...

Saku Ytti

30 Dec 2018 30 Dec '18

1:42 p.m.

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/ Can someone translate this to IP engineer? What did actually happen?

...

From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks. a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up -- ++ytti

Show replies by date

Mike Hammett

30 Dec 30 Dec

3:42 p.m.

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Saku Ytti" <saku@ytti.fi> To: "nanog list" <nanog@nanog.org> Sent: Sunday, December 30, 2018 7:42:49 AM Subject: CenturyLink RCA? Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/ Can someone translate this to IP engineer? What did actually happen?

...

From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Joe Carroll

3:46 p.m.

Technical obscurity... managed perception. On Sun, Dec 30, 2018 at 10:43 Mike Hammett <nanog@ics-il.net> wrote:

...

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff.

----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com

Midwest-IX http://www.midwest-ix.com

------------------------------ *From: *"Saku Ytti" <saku@ytti.fi> *To: *"nanog list" <nanog@nanog.org> *Sent: *Sunday, December 30, 2018 7:42:49 AM *Subject: *CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

John Von Essen

4:24 p.m.

One thing that is troubling when reading that URL is that it appears several steps of restoration required teams to go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we shouldn't be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the first step to alot of the troubleshooting was power cycling and local console logs. -John On 12/30/18 10:42 AM, Mike Hammett wrote:

...

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff.

----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com

Midwest-IX http://www.midwest-ix.com

------------------------------------------------------------------------ *From: *"Saku Ytti" <saku@ytti.fi> *To: *"nanog list" <nanog@nanog.org> *Sent: *Sunday, December 30, 2018 7:42:49 AM *Subject: *CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

Saku Ytti

5:17 p.m.

Hey John, Your criticism is warranted, but would also be addressed by explanation DCN/OOB being the source of the problem. At any rate, I am looking forward to stop speculating and start reading post-mortem written by someone who knows how networks work. On Sun, 30 Dec 2018 at 18:28, John Von Essen <john@essenz.com> wrote:

...

One thing that is troubling when reading that URL is that it appears several steps of restoration required teams to go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we shouldn't be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the first step to alot of the troubleshooting was power cycling and local console logs.

-John

On 12/30/18 10:42 AM, Mike Hammett wrote:

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff.

----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com

Midwest-IX http://www.midwest-ix.com

________________________________ From: "Saku Ytti" <saku@ytti.fi> To: "nanog list" <nanog@nanog.org> Sent: Sunday, December 30, 2018 7:42:49 AM Subject: CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

-- ++ytti

Töma Gavrichenkov

5:46 p.m.

There's a Reddit user claiming he works at CL who says the reason were some faulty Infinera DTN-X instances. https://www.reddit.com/r/centurylink/comments/aa2qa4/comment/ecovgab (dunno though why the user posted that to Reddit and not here) 30 Dec. 2018 г., 20:19 Saku Ytti <saku@ytti.fi>:

...

Hey John,

Your criticism is warranted, but would also be addressed by explanation DCN/OOB being the source of the problem.

At any rate, I am looking forward to stop speculating and start reading post-mortem written by someone who knows how networks work.

On Sun, 30 Dec 2018 at 18:28, John Von Essen <john@essenz.com> wrote:

...
One thing that is troubling when reading that URL is that it appears

several steps of restoration required teams to go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we shouldn't be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the first step to alot of the troubleshooting was power cycling and local console logs.

...
-John

On 12/30/18 10:42 AM, Mike Hammett wrote:

It's technical enough so that laypeople immediately lose interest, yet

completely useless to anyone that works with this stuff.

...
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com

Midwest-IX http://www.midwest-ix.com

________________________________ From: "Saku Ytti" <saku@ytti.fi> To: "nanog list" <nanog@nanog.org> Sent: Sunday, December 30, 2018 7:42:49 AM Subject: CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

-- ++ytti

Naslund, Steve

31 Dec 31 Dec

2:53 p.m.

Not buying this explanation for a number of reasons : 1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, couldn't that be accomplished by simply reseating them? 2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane. 3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network? My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network. Steven Naslund Chicago IL

...

One thing that is troubling when reading that URL is that it appears several steps of restoration required teams to go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we shouldn't be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the first step to alot of the troubleshooting was power cycling and local console logs.

-John

On 12/30/18 10:42 AM, Mike Hammett wrote:

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff.

----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com

Midwest-IX http://www.midwest-ix.com

________________________________ From: "Saku Ytti" <saku@ytti.fi> To: "nanog list" <nanog@nanog.org> Sent: Sunday, December 30, 2018 7:42:49 AM Subject: CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

-- ++ytti

Saku Ytti

3:06 p.m.

Hey Steve, I will continue to speculate, as that's all we have.

...

1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, couldn't that be accomplished by simply reseating them?

L2 DCN/OOB, whole network shares single broadcast domain

...

2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH. However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

...

3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

BPDU

...

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN would plausibly fit the very ambiguous reason offered and is something people actually are doing. -- ++ytti

Eric Loos

3:23 p.m.

This seems entirely plausible given that DWDM amplifiers and lasers being a complex analog system, they need OOB to align. -- Eric

...

On 31 Dec 2018, at 16:06, Saku Ytti <saku@ytti.fi> wrote:

Hey Steve,

I will continue to speculate, as that's all we have.

...
1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, couldn't that be accomplished by simply reseating them?

L2 DCN/OOB, whole network shares single broadcast domain

...
2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH. However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

...
3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

BPDU

...
My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN would plausibly fit the very ambiguous reason offered and is something people actually are doing.

-- ++ytti

Naslund, Steve

3:31 p.m.

They shouldn’t need OOB to operate existing lambdas just to configure new ones. One possibility is that the management interface also handles master timing which would be a really bad idea but possible (should be redundant and it should be able to free run for a reasonable amount of time). The main issue exposed is that obviously the management interface is critical and is not redundant enough. That is if we believe the OOB explanation in the first place (which by the way is obviously not OOB since it wiped out the in band network when it failed). Steven Naslund Chicago IL

...

This seems entirely plausible given that DWDM amplifiers and lasers being a complex analog system, they need OOB to align. -- Eric

Dave Temkin

5:41 p.m.

On Mon, Dec 31, 2018 at 11:33 AM Naslund, Steve <SNaslund@medline.com> wrote:

...

They shouldn’t need OOB to operate existing lambdas just to configure new ones. One possibility is that the management interface also handles master timing which would be a really bad idea but possible (should be redundant and it should be able to free run for a reasonable amount of time). The main issue exposed is that obviously the management interface is critical and is not redundant enough. That is if we believe the OOB explanation in the first place (which by the way is obviously not OOB since it wiped out the in band network when it failed).

Steven Naslund

Chicago IL

A theory, and only a theory, is that they decided to, in order to troubleshoot a much smaller problem (OOB/etc.), deploy an optical configuration change that, when faced with inaccessibility to multiple nodes, ended up causing a significant inconsistency in their optical network, wreaking havoc on all sorts of other systems. With the OOB network already in chaos, card reseats were required to stabilize things on that network and then they could rebuild the optical network from a fully reachable state. Again, only a theory. -Dave

...

...
This seems entirely plausible given that DWDM amplifiers and lasers being a complex analog system, they need OOB to align.

...
--

...
Eric

Aaron1

6:49 p.m.

Yeah, could have been one of those...gone from bad to worse things like Dave mentioned... initial problem and course of action perhaps led to a worse problem. I’ve had DWDM issues that have taken down multiple locations far apart from each other due to how the transport guys hauled stuff A few years back I had about 15 routers all reboot suddenly... they were all far apart from each other, turned out to be one of the dual bgp sessions to rr cluster flapped and all 15 routers crash rebooted. But ~50 hours of downtime !? Aaron

...

On Dec 31, 2018, at 11:41 AM, Dave Temkin <dave@temk.in> wrote:

...
On Mon, Dec 31, 2018 at 11:33 AM Naslund, Steve <SNaslund@medline.com> wrote:

...
They shouldn’t need OOB to operate existing lambdas just to configure new ones. One possibility is that the management interface also handles master timing which would be a really bad idea but possible (should be redundant and it should be able to free run for a reasonable amount of time). The main issue exposed is that obviously the management interface is critical and is not redundant enough. That is if we believe the OOB explanation in the first place (which by the way is obviously not OOB since it wiped out the in band network when it failed).

Steven Naslund

Chicago IL

A theory, and only a theory, is that they decided to, in order to troubleshoot a much smaller problem (OOB/etc.), deploy an optical configuration change that, when faced with inaccessibility to multiple nodes, ended up causing a significant inconsistency in their optical network, wreaking havoc on all sorts of other systems. With the OOB network already in chaos, card reseats were required to stabilize things on that network and then they could rebuild the optical network from a fully reachable state.

Again, only a theory.

-Dave

...
...
This seems entirely plausible given that DWDM amplifiers and lasers being a complex analog system, they need OOB to align.

...
--

...
Eric

Lee

7:02 p.m.

On 12/31/18, Aaron1 <aaron1@gvtc.com> wrote:

...

Yeah, could have been one of those...gone from bad to worse things like Dave mentioned... initial problem and course of action perhaps led to a worse problem.

I’ve had DWDM issues that have taken down multiple locations far apart from each other due to how the transport guys hauled stuff

A few years back I had about 15 routers all reboot suddenly... they were all far apart from each other, turned out to be one of the dual bgp sessions to rr cluster flapped and all 15 routers crash rebooted.

But ~50 hours of downtime !?

It could have been worse: https://www.cio.com.au/article/65115/all_systems_down/

Keith Medcalf

8:31 p.m.

...

It could have been worse: https://www.cio.com.au/article/65115/all_systems_down/

"Make network changes only between 2am and 5am on weekends." Wow. Just wow. I suppose the IT types are considerably different than Process Operations. Our rule is to only make changes scheduled at 09:00 (or no later than will permit a complete backout and restore by 15:00) Local Time on "Full Staff" day that is not immediately preceded or followed by a reduced staff day, holiday, or weekend-day. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

William Herrin

9:30 p.m.

On Mon, Dec 31, 2018 at 12:31 PM Keith Medcalf <kmedcalf@dessus.com> wrote:

...

...
It could have been worse: https://www.cio.com.au/article/65115/all_systems_down/

"Make network changes only between 2am and 5am on weekends."

Wow. Just wow. I suppose the IT types are considerably different than Process Operations. Our rule is to only make changes scheduled at 09:00 (or no later than will permit a complete backout and restore by 15:00) Local Time on "Full Staff" day that is not immediately preceded or followed by a reduced staff day, holiday, or weekend-day.

It depends on your system architecture. If you've built your redundancy well so that you have a continuously maintainable system then you do the work during normal staffing and only when followed by days when folks will be around to notice and fix any mistakes. If you require a disruptive maintenance window then you schedule it for minimum usage times instead. Other conclusions from the article are dubious as well: * Retire legacy network gear faster and create overall life cycle management for networking gear. Retire equipment when it ceases to be cost-effective, not merely because it was manufactured too many years ago. Just don't forget to factor risk in to the cost. * Document all changes, including keeping up-to-date physical and logical network diagrams. "Good intentions never work, you need good mechanisms to make anything happen." - Jeff Bezos Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>

Lee

1 Jan 1 Jan

2:01 a.m.

On 12/31/18, Keith Medcalf <kmedcalf@dessus.com> wrote:

...

...
It could have been worse: https://www.cio.com.au/article/65115/all_systems_down/

"Make network changes only between 2am and 5am on weekends."

Wow. Just wow.

yeah. out of all the possible lessons they could have learned..

...

I suppose the IT types are considerably different than Process Operations. Our rule is to only make changes scheduled at 09:00 (or no later than will permit a complete backout and restore by 15:00) Local Time on "Full Staff" day that is not immediately preceded or followed by a reduced staff day, holiday, or weekend-day.

Do you get paid differently based on time of day? I used to be at a place where they were drifting into a 'no changes until midnight' mode except for one group; the rumor I heard was they got overtime pay after 6PM which is why they got to do all their changes during the day. Lee

William Allen Simpson

12:05 p.m.

On 12/31/18 3:31 PM, Keith Medcalf wrote:

...

...
It could have been worse: https://www.cio.com.au/article/65115/all_systems_down/

"Make network changes only between 2am and 5am on weekends."

Wow. Just wow. I suppose the IT types are considerably different than Process Operations. Our rule is to only make changes scheduled at 09:00 (or no later than will permit a complete backout and restore by 15:00) Local Time on "Full Staff" day that is not immediately preceded or followed by a reduced staff day, holiday, or weekend-day.

We had fairly extensive discussion on this list decades ago. Deploy non-emergency changes early Tuesday morning local time, a few hours before regular working hours. Agreed about "not immediately preceded or followed by a reduced staff day, holiday, or weekend-day." Because you won't know it's really working until the actual users have no reported problems. Tuesdays give a couple of working days to ensure that there were no hidden ill affects. Weekends are terrible for that reason. As are Mondays and Fridays, because actual users aren't around in overseas locations. Even those of us who have operated regional ISPs still can effect the world. And those of us with multiple datacenters world-wide have to ensure that changes in one place don't affect the others.... As I remember, at the time this NANOG wisdom propagated to legal blogs. Because so much of what network operators do has legal implications.

Naslund, Steve

31 Dec 31 Dec

8:11 p.m.

A note for the guys hanging on to those POTS lines…It won’t really help. One of our sites in Dubuque Iowa had ten CenturyLink PRIs (they are the LEC there) homed off of a 5ESS switch. These all were unable to process calls during the CenturyLink problem. The ISDN messaging returned indicated that the CL phone switch had no routes. This tells me that either their inter-switch trunking or SS7 network or both are being transported over the same optical network as the Internet services. So, even if your local line is POTS or traditional TDM it won’t matter if all of their transport is dependent on the IP world. Looking at the Reddit comments on the Infinera devices being a problem, that makes more sense because that device blurs the line between optical mux and IP enabled devices with its Ethernet mapping functions. One advantage of the pure optical mux is that it does not need, care, or understand L2 and L3 network protocols and are largely unaffected by those layers. Convergence in devices moving across more network layers exposes it to more potential bugs. Convergence can easily lead to more single points of failure and the traffic capacity of these devices kind of encourages carriers to put more stuff in one basket than they traditionally did. I understand the motivation to build a single high speed IP centric backbone but it makes everything dependent on that backbone. Steven Naslund Chicago IL

Naslund, Steve

3:23 p.m.

See my comments in line. Steve

...

Hey Steve,

...

I will continue to speculate, as that's all we have.

...

1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, >couldn't that be accomplished by simply reseating them?

...

L2 DCN/OOB, whole network shares single broadcast domain.

Bad design if that’s the case, that would be a huge subnet. However even if that was the case, you would not need to replace hardware in multiple places. You might have to reset it but not replace it. Also being an ILEC it seems hard to believe how long their dispatches to their own central office took. It might have taken awhile to locate the original problem but they should have been able to send a corrective procedure to CO personnel who are a lot closer to the equipment. In my region (Northern Illinois) we can typically get access to a CO in under 30 minutes 24/7. They are essentially smart hands technicians that can reseat or replace line cards.

...

2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid >frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

...

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH. However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

Most of the optical muxes I have worked with will run without any management card or control plane at all. Usually the line cards keep forwarding according to the existing configuration even in the absence of all management functions. It would help if we knew what gear this was. True optical muxes do not require much care and feeding once they have a configuration loaded. If they are truly dependent on that control plane, then it needs to be redundant enough with watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces. Seems they would be vulnerable to a DoS if a bad BPDU can wipe them out.

...

3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

...

BPDU

Maybe, it would be strange that it was invalid but valid enough to continue forwarding. In any case loss of the management network should not interrupt forwarding. I also would not be happy with an optical network that relies on spanning tree to remain operational.

...

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

...

Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly fit the very ambiguous reason offered and is something people actually are doing.

My biggest problem with their explanation is the replacement of line cards in multiple cities. The only way that happens is when bad code gets pushed to them. If it took them that long to fix an L2 broadcast storm, something is seriously wrong with their engineering. Resetting the management interfaces should be sufficient once the offending line card is removed. That is why I think this was a software update failure or a configuration push. Either way, they should be jumping up and down on their vendor as to why this caused such large scale effects.

William Herrin

9:32 p.m.

On Mon, Dec 31, 2018 at 7:24 AM Naslund, Steve <SNaslund@medline.com> wrote:

...

Bad design if that’s the case, that would be a huge subnet.

According to the notes at the URL Saku shared, they suffered a cascade failure from which they needed the equipment vendor's help to recover. That indicates at least two grave design errors: 1. Vendor monoculture is a single point of failure. Same equipment running the same software triggers the same bug. It all kabooms at once. Different vendors running different implementations have compatibility issues but when one has a bug it's much less likely to take down all the rest. 2. Failure to implement system boundaries. When you automate systems it's important to restrict the reach of that automation. Whether it's a regional boundary or independent backbones, a critical system like this one should be structurally segmented so that malfunctioning automation can bring down only one piece of it. Regards, Bill Herrin However even if that was the case, you would not need to replace hardware in multiple places. You might have to reset it but not replace it. Also being an ILEC it seems hard to believe how long their dispatches to their own central office took. It might have taken awhile to locate the original problem but they should have been able to send a corrective procedure to CO personnel who are a lot closer to the equipment. In my region (Northern Illinois) we can typically get access to a CO in under 30 minutes 24/7. They are essentially smart hands technicians that can reseat or replace line cards.

...

...
2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid >frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

...
L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH. However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

Most of the optical muxes I have worked with will run without any management card or control plane at all. Usually the line cards keep forwarding according to the existing configuration even in the absence of all management functions. It would help if we knew what gear this was. True optical muxes do not require much care and feeding once they have a configuration loaded. If they are truly dependent on that control plane, then it needs to be redundant enough with watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces. Seems they would be vulnerable to a DoS if a bad BPDU can wipe them out.

...
3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

...
BPDU

Maybe, it would be strange that it was invalid but valid enough to continue forwarding. In any case loss of the management network should not interrupt forwarding. I also would not be happy with an optical network that relies on spanning tree to remain operational.

...
My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

...
Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly fit the very ambiguous reason offered and is something people actually are doing.

My biggest problem with their explanation is the replacement of line cards in multiple cities. The only way that happens is when bad code gets pushed to them. If it took them that long to fix an L2 broadcast storm, something is seriously wrong with their engineering. Resetting the management interfaces should be sufficient once the offending line card is removed. That is why I think this was a software update failure or a configuration push. Either way, they should be jumping up and down on their vendor as to why this caused such large scale effects.

-- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>

Brielle

3:20 p.m.

(Forgive my top posting, not on my desktop as I’m out of town) Wild guess, based on my own experience as a NOC admin/head of operations at a large ISP - they have an automated deployment system for new firmware for a (mission critical) piece of backbone hardware. They may have tested said firmware on a chassis with cards that did not exactly match the hardware they had in actual deployment (ie: card was older hw revision in deployed hardware), and while it worked fine there, it proceeded shit the bed in the production. Or, they missed a mandatory low level hardware firmware upgrade that has to be applied separately before the other main upgrade. Kinda picturing in my mind that they staged all the updates, set a timer, staggered reboot, and after the first hit the fan, they couldn’t stop the rest as it fell apart as each upgraded unit fell on its own sword on reboot. I’ve been bit by the ‘this card revision is not supported under this platform/release’ bug more often then I’d like to admit. And, yes, my eyes did start to get glossy and hazy the more I read their explanation as well. It’s exactly the kind of useless post I’d write when I want to get (stupid) people off my back about a problem. Sent from my iPad

...

On Dec 31, 2018, at 7:53 AM, Naslund, Steve <SNaslund@medline.com> wrote:

Not buying this explanation for a number of reasons :

1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, couldn't that be accomplished by simply reseating them?

2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

Steven Naslund Chicago IL

...
One thing that is troubling when reading that URL is that it appears several steps of restoration required teams to go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we shouldn't be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the first step to alot of the troubleshooting was power cycling and local console logs.

-John

On 12/30/18 10:42 AM, Mike Hammett wrote:

It's technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff.

----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com

Midwest-IX http://www.midwest-ix.com

________________________________ From: "Saku Ytti" <saku@ytti.fi> To: "nanog list" <nanog@nanog.org> Sent: Sunday, December 30, 2018 7:42:49 AM Subject: CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

-- ++ytti

Naslund, Steve

3:26 p.m.

I agree 100%. Now they need to figure out why bricking the management network stopped forwarding on the optical side. > (Forgive my top posting, not on my desktop as I’m out of town) Steven Naslund Chicago IL

...

Wild guess, based on my own experience as a NOC admin/head of operations at a large ISP - they have an automated deployment system for new firmware for a (mission critical) piece of backbone hardware.

They may have tested said firmware on a chassis with cards that did not exactly match the hardware they had in actual deployment (ie: card was older hw revision in deployed hardware), and while it worked fine there, it proceeded >shit the bed in the production.

Or, they missed a mandatory low level hardware firmware upgrade that has to be applied separately before the other main upgrade.

Kinda picturing in my mind that they staged all the updates, set a timer, staggered reboot, and after the first hit the fan, they couldn’t stop the rest as it fell apart as each upgraded unit fell on its own sword on reboot.

I’ve been bit by the ‘this card revision is not supported under this platform/release’ bug more often then I’d like to admit.

And, yes, my eyes did start to get glossy and hazy the more I read their explanation as well. It’s exactly the kind of useless post I’d write when I want to get (stupid) people off my back about a problem.

Tom Beecher

2 Jan 2 Jan

6:01 p.m.

My best parsing of that ticket, with some guesses : - Infinera management card goes Really Bad, knocks out local waves, and starts spewing garbage out onto the management network - Management network propagates the garbage , other Infinera management cards get it and fall into the same state, knocking down local waves and re-spewing garbage. - Backup tunnels in place to ensure management network connectivity works all the time help propagate the garbage. - They start getting into some devices via OOB, probably rebooting. Devices come up ok, then this garbage traffic knocks them over again. - They start pulling down the backup tunnels to stop the virus from spreading, bouncing stuff again, putting filters on each device to drop the garbage traffic. - This starts to work, but then they hit other problems with linecards from devices that were bounced. - They also start hitting sites that they don't have functional OOB for, and have to get someone driving out to manually get access into. On Sun, Dec 30, 2018 at 8:45 AM Saku Ytti <saku@ytti.fi> wrote:

...

Apologies for the URL, I do not know official source and I do not share the URLs sentiment. https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen? From my own history, I rarely recognise the problem I fixed from reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, I've had this failure mode) c) DCN had direct access to control-plane, and L2 congested control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what type of explanation is acceptable and can be used to fix things. Hopefully CenturyLink does come out with IP-engineering readable explanation, so that we may use it as leverage to support work in our own domains to remove such risks.

a) do not run L2 DCN/OOB b) do not connect MGMT ETH (it is unprotected access to control-plane, it cannot be protected by CoPP/lo0 filter/LPTS ec) c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) d) do fail optical network up

-- ++ytti

2483

Age (days ago)

2486

Last active (days ago)

List overview

Download

22 comments

15 participants

participants (15)

Aaron1
Brielle
Dave Temkin
Eric Loos
Joe Carroll
John Von Essen
Keith Medcalf
Lee
Mike Hammett
Naslund, Steve
Saku Ytti
Tom Beecher
Töma Gavrichenkov
William Allen Simpson
William Herrin

CenturyLink RCA?

Joe Carroll

Naslund, Steve

Eric Loos

Naslund, Steve

Lee

Keith Medcalf

Lee

Naslund, Steve

Naslund, Steve

Brielle

Naslund, Steve

tags

participants (15)