[...]
it seems that the problem FB ran into was really that there wasn't either:
"secondary path to communicate: "You are the last one standing, do not die" (to an edge node)
or:
"maintain a very long/less-preferred path to a core location(s) to maintain service in case the CDN disappears"
There are almost certainly more complexities which FB is not discussion in their design/deployment which
affected their services last week, but it doesn't look like they were very far off on their deployment, if they
need to maintain back-end connectivity to serve customers from the CDN locales.
-chris
Having worked on trying to solve health-checking situations
in large production complexes in the past, I can definitely
say that is is an exponentially difficult problem for a single
site to determine whether it is "safe" for it to fail out, or if
doing so will result in an entire service going offline, short
of having a central controller which tracks every edge site's
health, and can determine "no, we're below $magic_threshold
number of sites, you can't fail yourself out no matter how
unhealthy you think you are". Which of course you can't
really have, without undoing one of the key reasons for
distributing your serving sites to geographically distant
places in different buildings on different providers--namely
to eliminate single points of failure in your serving infrastructure.
Doing the equivalent of "no router bgp" on your core backbone
is going to make things suck, no matter how you slice it, and
I don't think any amount of tweaking the anycast setup or
DNS values would have made a whit of difference to the
underlying outage.
I think the only question we can armchair quarterback
at this point is whether there were prudent steps that
could go into a design to shorten the recovery interval.
So far, we seem to have collected a few key points:
1) make sure your disaster recovery plan doesn't depend
on your production DNS servers being usable; have
key nodes in /etc/hosts files that are periodically updated
via $automation_tool, but ONLY for non-production,
out-of-band recovery nodes; don't static any of your
production-facing entries.
2) Have a working out-of-band that exists entirely independent
of your production network. Dial, frame relay, SMDS, LTE
modems, starlink dishes on the roof; pick your poison, but
budget it in for every production site. Test it monthly to ensure
connectivity to all sites works. Audit regularly to ensure no
dependencies on the production infrastructure have crept in.
3) Ensure you have a good "oh sh**" physical access plan for
key personnel. Some of you at a recent virtual happy hour
heard me talk about the time I isolated the credit card payment
center for a $dayjob, which also cut off access for the card readers
to get into it to restore the network. Use of a fire axe was granted
to on-site personnel during that. Take the time to think through
how physical access is controlled for every key site in your network,
think about failure scenarios, and have a "in case of emergency,
break glass to get the key" plan in place to shorten recovery times.
4) Have a dependency map/graph of your production network.
a) if everything dies, and you have to restart, what has to come up first?
b) what dependencies are there that have to be done in the right order
c) what services are independent that can be brought up in parallel to speed
up recovery?
d) does every team supporting services on the critical, dependent pathway
have 24x7 on-call coverage, and do they know where in the recovery graph
they're needed? It doesn't help to have teams that can't start back up until
step 9 crowding around asking "are you ready for us yet?" when you still can't
raise the team needed for step 1 on the dependency graph. ^_^;
5) do you know how close the nearest personnel are to each POP/CDN node,
in case you have to do emergency "drive over with a laptop, hop on the console,
and issue the following commands" rousting in the middle of the night? If someone
lives.3 miles from the CDN node, it's good to know that, so you don't call the person
who is on-call but is 2 hours away without first checking if the person 3 miles away
can do it faster.
I'm sure others have even better experiences than I, who can contribute
and add to the list. If nothing else, perhaps collectively we can help
other companies prepare a bit better, so that when the next big "ooops"
happens, the recovery time can be a little bit shorter. :)
Thanks!
Matt