On Mon, Oct 11, 2021 at 8:07 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Sat, Oct 9, 2021 at 11:16 AM Masataka Ohta < mohta@necom830.hpcl.titech.ac.jp> wrote:
Bill Woodcock wrote:
[...]
it seems that the problem FB ran into was really that there wasn't either: "secondary path to communicate: "You are the last one standing, do not die" (to an edge node) or: "maintain a very long/less-preferred path to a core location(s) to maintain service in case the CDN disappears"
There are almost certainly more complexities which FB is not discussion in their design/deployment which affected their services last week, but it doesn't look like they were very far off on their deployment, if they need to maintain back-end connectivity to serve customers from the CDN locales.
-chris
Having worked on trying to solve health-checking situations in large production complexes in the past, I can definitely say that is is an exponentially difficult problem for a single site to determine whether it is "safe" for it to fail out, or if doing so will result in an entire service going offline, short of having a central controller which tracks every edge site's health, and can determine "no, we're below $magic_threshold number of sites, you can't fail yourself out no matter how unhealthy you think you are". Which of course you can't really have, without undoing one of the key reasons for distributing your serving sites to geographically distant places in different buildings on different providers--namely to eliminate single points of failure in your serving infrastructure. Doing the equivalent of "no router bgp" on your core backbone is going to make things suck, no matter how you slice it, and I don't think any amount of tweaking the anycast setup or DNS values would have made a whit of difference to the underlying outage. I think the only question we can armchair quarterback at this point is whether there were prudent steps that could go into a design to shorten the recovery interval. So far, we seem to have collected a few key points: 1) make sure your disaster recovery plan doesn't depend on your production DNS servers being usable; have key nodes in /etc/hosts files that are periodically updated via $automation_tool, but ONLY for non-production, out-of-band recovery nodes; don't static any of your production-facing entries. 2) Have a working out-of-band that exists entirely independent of your production network. Dial, frame relay, SMDS, LTE modems, starlink dishes on the roof; pick your poison, but budget it in for every production site. Test it monthly to ensure connectivity to all sites works. Audit regularly to ensure no dependencies on the production infrastructure have crept in. 3) Ensure you have a good "oh sh**" physical access plan for key personnel. Some of you at a recent virtual happy hour heard me talk about the time I isolated the credit card payment center for a $dayjob, which also cut off access for the card readers to get into it to restore the network. Use of a fire axe was granted to on-site personnel during that. Take the time to think through how physical access is controlled for every key site in your network, think about failure scenarios, and have a "in case of emergency, break glass to get the key" plan in place to shorten recovery times. 4) Have a dependency map/graph of your production network. a) if everything dies, and you have to restart, what has to come up first? b) what dependencies are there that have to be done in the right order c) what services are independent that can be brought up in parallel to speed up recovery? d) does every team supporting services on the critical, dependent pathway have 24x7 on-call coverage, and do they know where in the recovery graph they're needed? It doesn't help to have teams that can't start back up until step 9 crowding around asking "are you ready for us yet?" when you still can't raise the team needed for step 1 on the dependency graph. ^_^; 5) do you know how close the nearest personnel are to each POP/CDN node, in case you have to do emergency "drive over with a laptop, hop on the console, and issue the following commands" rousting in the middle of the night? If someone lives.3 miles from the CDN node, it's good to know that, so you don't call the person who is on-call but is 2 hours away without first checking if the person 3 miles away can do it faster. I'm sure others have even better experiences than I, who can contribute and add to the list. If nothing else, perhaps collectively we can help other companies prepare a bit better, so that when the next big "ooops" happens, the recovery time can be a little bit shorter. :) Thanks! Matt