I’m “guessing” based on all the services that were impacted the outage was likely cause by a change that caused a routing change in their multi-service network which overloaded many network devices, and by isolating the source the routes or traffic the rest of the network was able to recover. But just a guess. Shane
On Jul 11, 2022, at 4:22 PM, Matthew Petach <mpetach@netflight.com> wrote:
On Mon, Jul 11, 2022 at 9:01 AM Andrey Kostin <ankost@podolsk.ru> wrote: It's hard to believe that a same time maintenance affecting so many devices in the core network could be approved. Core networks are build with redundancy, so that failures can't completely destroy the whole network.
I think you might need to re-evaluate your assumption about how core networks are built.
A well-designed core network will have layers of redundancy built in, with easy isolation of fault layers, yes.
I've seen (and sometimes worked on) too many networks that didn't have enough budget for redundancy, and were built as a string of pearls, one router to the next; if any router in the string of pearls broke, the entire string of pearls would come crashing down, to abuse a metaphor just a bit too much.
Really well-thought out redundancy takes a design team that has enough experience and enough focused hours in the day to think through different failure modes and lay out the design ahead of time, before purchases get made. Many real-world networks share the same engineers between design, deployment, and operation of the network--and in that model, operation and deployment always win over design when it comes time to allocate engineering hours. Likeise, if you didn't have the luxury of being able to lay out the design ahead of time, before purchasing hardware and leasing facilities, you're likely doing the best you can with locations that were contracted before you came into the picture, using hardware that was decided on before you had an opportunity to suggest better alternatives.
Taking it a step further, and thinking about the large Facebook outage, even if you did well in the design phase, and chose two different vendors, with hardware redundancy and site redundancy in your entire core network, did you also think about redundancy and diversity for the O&M side of the house? Does each redundant data plane have a diverse control plane and management plane, or would an errant redistribution of BGP into IGP wipe out both data planes, and both hardware vendors at the same time? Likewise, if a bad configuration push isolates your core network nodes from the "God box" that controls the device configurations, do you have redundancy in connectivity to that "God box" so that you can restore known-good configurations to your core network sites, or are you stuck dispatching engineers with laptops and USB sticks with configs on them to get back to a working condition again?
As you follow the control of core networks back up the chain, you ultimately realize that no network is truly redundant and diverse. Every network ultimately comes back to a single point of failure, and the only distinction you can make is how far up the ladder you climb before you discover that single point of failure.
Thanks!
Matt