On Mon, Oct 11, 2021 at 8:07 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:

On Sat, Oct 9, 2021 at 11:16 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Bill Woodcock wrote:

[...]

it seems that the problem FB ran into was really that there wasn't either:
"secondary path to communicate: "You are the last one standing, do not die" (to an edge node)
or:
"maintain a very long/less-preferred path to a core location(s) to maintain service in case the CDN disappears"

There are almost certainly more complexities which FB is not discussion in their design/deployment which
affected their services last week, but it doesn't look like they were very far off on their deployment, if they
need to maintain back-end connectivity to serve customers from the CDN locales.

-chris

Having worked on trying to solve health-checking situations

in large production complexes in the past, I can definitely

say that is is an exponentially difficult problem for a single

site to determine whether it is "safe" for it to fail out, or if

doing so will result in an entire service going offline, short

of having a central controller which tracks every edge site's

health, and can determine "no, we're below $magic_threshold

number of sites, you can't fail yourself out no matter how

unhealthy you think you are". Which of course you can't

really have, without undoing one of the key reasons for

distributing your serving sites to geographically distant

places in different buildings on different providers--namely

to eliminate single points of failure in your serving infrastructure.

Doing the equivalent of "no router bgp" on your core backbone

is going to make things suck, no matter how you slice it, and

I don't think any amount of tweaking the anycast setup or

DNS values would have made a whit of difference to the

underlying outage.

I think the only question we can armchair quarterback

at this point is whether there were prudent steps that

could go into a design to shorten the recovery interval.

So far, we seem to have collected a few key points:

1) make sure your disaster recovery plan doesn't depend

on your production DNS servers being usable; have

key nodes in /etc/hosts files that are periodically updated

via $automation_tool, but ONLY for non-production,

out-of-band recovery nodes; don't static any of your

production-facing entries.

2) Have a working out-of-band that exists entirely independent

of your production network. Dial, frame relay, SMDS, LTE

modems, starlink dishes on the roof; pick your poison, but

budget it in for every production site. Test it monthly to ensure

connectivity to all sites works. Audit regularly to ensure no

dependencies on the production infrastructure have crept in.

3) Ensure you have a good "oh sh**" physical access plan for

key personnel. Some of you at a recent virtual happy hour

heard me talk about the time I isolated the credit card payment

center for a $dayjob, which also cut off access for the card readers

to get into it to restore the network. Use of a fire axe was granted

to on-site personnel during that. Take the time to think through

how physical access is controlled for every key site in your network,

think about failure scenarios, and have a "in case of emergency,

break glass to get the key" plan in place to shorten recovery times.

4) Have a dependency map/graph of your production network.

a) if everything dies, and you have to restart, what has to come up first?

b) what dependencies are there that have to be done in the right order

c) what services are independent that can be brought up in parallel to speed

up recovery?

d) does every team supporting services on the critical, dependent pathway

have 24x7 on-call coverage, and do they know where in the recovery graph

they're needed? It doesn't help to have teams that can't start back up until

step 9 crowding around asking "are you ready for us yet?" when you still can't

raise the team needed for step 1 on the dependency graph. ^_^;

5) do you know how close the nearest personnel are to each POP/CDN node,

in case you have to do emergency "drive over with a laptop, hop on the console,

and issue the following commands" rousting in the middle of the night? If someone

lives.3 miles from the CDN node, it's good to know that, so you don't call the person

who is on-call but is 2 hours away without first checking if the person 3 miles away

can do it faster.

I'm sure others have even better experiences than I, who can contribute

and add to the list. If nothing else, perhaps collectively we can help

other companies prepare a bit better, so that when the next big "ooops"

happens, the recovery time can be a little bit shorter. :)

Thanks!

Matt