I’m “guessing” based on all the services that were impacted the outage was likely cause by a change that caused a routing change in their multi-service network which overloaded many network devices, and by isolating the source the routes or traffic the rest of the network was able to recover.

But just a guess.

Shane
On Jul 11, 2022, at 4:22 PM, Matthew Petach <mpetach@netflight.com> wrote:



On Mon, Jul 11, 2022 at 9:01 AM Andrey Kostin <ankost@podolsk.ru> wrote:
It's hard to believe that a same time maintenance affecting so many
devices in the core network could be approved. Core networks are build
with redundancy, so that failures can't completely destroy the whole
network.

I think you might need to re-evaluate your assumption 
about how core networks are built.

A well-designed core network will have layers of redundancy 
built in, with easy isolation of fault layers, yes.

I've seen (and sometimes worked on) too many networks 
that didn't have enough budget for redundancy, and were 
built as a string of pearls, one router to the next; if any router 
in the string of pearls broke, the entire string of pearls would 
come crashing down, to abuse a metaphor just a bit too much.

Really well-thought out redundancy takes a design team that 
has enough experience and enough focused hours in the day 
to think through different failure modes and lay out the design 
ahead of time, before purchases get made.    Many real-world 
networks share the same engineers between design, deployment, 
and operation of the network--and in that model, operation and 
deployment always win over design when it comes time to allocate 
engineering hours.  Likeise, if you didn't have the luxury of being 
able to lay out the design ahead of time, before purchasing hardware 
and leasing facilities, you're likely doing the best you can with locations 
that were contracted before you came into the picture, using hardware 
that was decided on before you had an opportunity to suggest better 
alternatives. 

Taking it a step further, and thinking about the large Facebook outage, 
even if you did well in the design phase, and chose two different vendors, 
with hardware redundancy and site redundancy in your entire core 
network, did you also think about redundancy and diversity for the 
O&M side of the house?   Does each redundant data plane have a 
diverse control plane and management plane, or would an errant 
redistribution of BGP into IGP wipe out both data planes, and both 
hardware vendors at the same time?  Likewise, if a bad configuration 
push isolates your core network nodes from the "God box" that 
controls the device configurations, do you have redundancy in 
connectivity to that "God box" so that you can restore known-good 
configurations to your core network sites, or are you stuck dispatching 
engineers with laptops and USB sticks with configs on them to get 
back to a working condition again?

As you follow the control of core networks back up the chain, 
you ultimately realize that no network is truly redundant and 
diverse.  Every network ultimately comes back to a single point 
of failure, and the only distinction you can make is how far up the 
ladder you climb before you discover that single point of failure.

Thanks!

Matt