On Mon, Oct 11, 2021 at 8:07 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Sat, Oct 9, 2021 at 11:16 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Bill Woodcock wrote:

[...] 

it seems that the problem FB ran into was really that there wasn't either:
   "secondary path to communicate: "You are the last one standing, do not die"  (to an edge node)
 or:
  "maintain a very long/less-preferred path to a core location(s) to maintain service in case the CDN disappears"

There are almost certainly more complexities which FB is not discussion in their design/deployment which
affected their services last week, but it doesn't look like they were very far off on their deployment, if they
need to maintain back-end connectivity to serve customers from the CDN locales.

-chris

Having worked on trying to solve health-checking situations 
in large production complexes in the past, I can definitely 
say that is is an exponentially difficult problem for a single 
site to determine whether it is "safe" for it to fail out, or if 
doing so will result in an entire service going offline, short 
of having a central controller which tracks every edge site's 
health, and can determine "no, we're below $magic_threshold 
number of sites, you can't fail yourself out no matter how 
unhealthy you think you are".   Which of course you can't 
really have, without undoing one of the key reasons for 
distributing your serving sites to geographically distant 
places in different buildings on different providers--namely 
to eliminate single points of failure in your serving infrastructure.

Doing the equivalent of "no router bgp" on your core backbone 
is going to make things suck, no matter how you slice it, and 
I don't think any amount of tweaking the anycast setup or 
DNS values would have made a whit of difference to the 
underlying outage.

I think the only question we can armchair quarterback 
at this point is whether there were prudent steps that 
could go into a design to shorten the recovery interval. 

So far, we seem to have collected a few key points:

1) make sure your disaster recovery plan doesn't depend 
    on your production DNS servers being usable; have 
    key nodes in /etc/hosts files that are periodically updated 
    via $automation_tool, but ONLY for non-production, 
    out-of-band recovery nodes; don't static any of your 
    production-facing entries.

2) Have a working out-of-band that exists entirely independent 
    of your production network.  Dial, frame relay, SMDS, LTE 
    modems, starlink dishes on the roof; pick your poison, but 
    budget it in for every production site.  Test it monthly to ensure 
    connectivity to all sites works.  Audit regularly to ensure no 
    dependencies on the production infrastructure have crept in.

3) Ensure you have a good "oh sh**" physical access plan for 
    key personnel.  Some of you at a recent virtual happy hour 
    heard me talk about the time I isolated the credit card payment 
    center for a $dayjob, which also cut off access for the card readers 
    to get into it to restore the network.   Use of a fire axe was granted 
    to on-site personnel during that.  Take the time to think through 
    how physical access is controlled for every key site in your network, 
    think about failure scenarios, and have a "in case of emergency, 
    break glass to get the key" plan in place to shorten recovery times.

4) Have a dependency map/graph of your production network.  
 a) if everything dies, and you have to restart, what has to come up first?
 b) what dependencies are there that have to be done in the right order
 c) what services are independent that can be brought up in parallel to speed
   up recovery?
d) does every team supporting services on the critical, dependent pathway 
  have 24x7 on-call coverage, and do they know where in the recovery graph 
  they're needed?  It doesn't help to have teams that can't start back up until 
  step 9 crowding around asking "are you ready for us yet?" when you still can't 
  raise the team needed for step 1 on the dependency graph.  ^_^;

5) do you know how close the nearest personnel are to each POP/CDN node, 
   in case you have to do emergency "drive over with a laptop, hop on the console, 
   and issue the following commands" rousting in the middle of the night?  If someone 
   lives.3 miles from the CDN node, it's good to know that, so you don't call the person 
   who is on-call but is 2 hours away without first checking if the person 3 miles away 
   can do it faster.
   
I'm sure others have even better experiences than I, who can contribute 
and add to the list.  If nothing else, perhaps collectively we can help 
other companies prepare a bit better, so that when the next big "ooops" 
happens, the recovery time can be a little bit shorter.   :)

Thanks!

Matt