On Mon, Jul 11, 2022 at 9:01 AM Andrey Kostin <ankost@podolsk.ru> wrote:

It's hard to believe that a same time maintenance affecting so many
devices in the core network could be approved. Core networks are build
with redundancy, so that failures can't completely destroy the whole
network.

I think you might need to re-evaluate your assumption

about how core networks are built.

A well-designed core network will have layers of redundancy

built in, with easy isolation of fault layers, yes.

I've seen (and sometimes worked on) too many networks

that didn't have enough budget for redundancy, and were

built as a string of pearls, one router to the next; if any router

in the string of pearls broke, the entire string of pearls would

come crashing down, to abuse a metaphor just a bit too much.

Really well-thought out redundancy takes a design team that

has enough experience and enough focused hours in the day

to think through different failure modes and lay out the design

ahead of time, before purchases get made. Many real-world

networks share the same engineers between design, deployment,

and operation of the network--and in that model, operation and

deployment always win over design when it comes time to allocate

engineering hours. Likeise, if you didn't have the luxury of being

able to lay out the design ahead of time, before purchasing hardware

and leasing facilities, you're likely doing the best you can with locations

that were contracted before you came into the picture, using hardware

that was decided on before you had an opportunity to suggest better

alternatives.

Taking it a step further, and thinking about the large Facebook outage,

even if you did well in the design phase, and chose two different vendors,

with hardware redundancy and site redundancy in your entire core

network, did you also think about redundancy and diversity for the

O&M side of the house? Does each redundant data plane have a

diverse control plane and management plane, or would an errant

redistribution of BGP into IGP wipe out both data planes, and both

hardware vendors at the same time? Likewise, if a bad configuration

push isolates your core network nodes from the "God box" that

controls the device configurations, do you have redundancy in

connectivity to that "God box" so that you can restore known-good

configurations to your core network sites, or are you stuck dispatching

engineers with laptops and USB sticks with configs on them to get

back to a working condition again?

As you follow the control of core networks back up the chain,

you ultimately realize that no network is truly redundant and

diverse. Every network ultimately comes back to a single point

of failure, and the only distinction you can make is how far up the

ladder you climb before you discover that single point of failure.

Thanks!

Matt