It's hard to believe that a same time maintenance affecting so many
devices in the core network could be approved. Core networks are build
with redundancy, so that failures can't completely destroy the whole
network.
I think you might need to re-evaluate your assumption
about how core networks are built.
A well-designed core network will have layers of redundancy
built in, with easy isolation of fault layers, yes.
I've seen (and sometimes worked on) too many networks
that didn't have enough budget for redundancy, and were
built as a string of pearls, one router to the next; if any router
in the string of pearls broke, the entire string of pearls would
come crashing down, to abuse a metaphor just a bit too much.
Really well-thought out redundancy takes a design team that
has enough experience and enough focused hours in the day
to think through different failure modes and lay out the design
ahead of time, before purchases get made. Many real-world
networks share the same engineers between design, deployment,
and operation of the network--and in that model, operation and
deployment always win over design when it comes time to allocate
engineering hours. Likeise, if you didn't have the luxury of being
able to lay out the design ahead of time, before purchasing hardware
and leasing facilities, you're likely doing the best you can with locations
that were contracted before you came into the picture, using hardware
that was decided on before you had an opportunity to suggest better
alternatives.
Taking it a step further, and thinking about the large Facebook outage,
even if you did well in the design phase, and chose two different vendors,
with hardware redundancy and site redundancy in your entire core
network, did you also think about redundancy and diversity for the
O&M side of the house? Does each redundant data plane have a
diverse control plane and management plane, or would an errant
redistribution of BGP into IGP wipe out both data planes, and both
hardware vendors at the same time? Likewise, if a bad configuration
push isolates your core network nodes from the "God box" that
controls the device configurations, do you have redundancy in
connectivity to that "God box" so that you can restore known-good
configurations to your core network sites, or are you stuck dispatching
engineers with laptops and USB sticks with configs on them to get
back to a working condition again?
As you follow the control of core networks back up the chain,
you ultimately realize that no network is truly redundant and
diverse. Every network ultimately comes back to a single point
of failure, and the only distinction you can make is how far up the
ladder you climb before you discover that single point of failure.
Thanks!
Matt