On Sat, Oct 9, 2021 at 11:16 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Bill Woodcock wrote:

>> It may be that facebook uses all the four name server IP addresses
>> in each edge node. But, it effectively kills essential redundancy
>> of DNS to have two or more name servers (at separate locations)
>> and the natural consequence is, as you can see, mass disaster.
>
> Yep.  I think we even had a NANOG talk on exactly that specific topic a long time ago.
>
> https://www.pch.net/resources/Papers/dns-service-architecture/dns-service-architecture-v10.pdf

Yes, having separate sets of anycast addresses by two or more pops
should be fine.


To be fair, it looks like FB has 4 /32's (and 4 /128's) for their DNS authoritatives.
All from different /24's or /48's, so they should have decent routing diversity.
They could choose to announce half/half from alternate pops, or other games such as this.
I don't know that that would have solved any of the problems last week nor any problems in the future.
I think Bill's slide 30 is pretty much what FB has/had deployed:
  1) I would think the a/b cloud is really 'as similar a set of paths from like deployments as possible
  2) redundant pairs of servers in the same transit/network
  3) hidden masters (almost certainly these are in the depths of the FB datacenter network)
      (though also this part isn't important for the conversation)
  4) control/sync traffic on a different topology than the customer serving one
 
However, if CDN provider has their own transit backbone, which is,
seemingly, not assumed by your slides, and retail ISPs are tightly

I think it is, actually, in slide 30 ?
   "We need a network topology to carry control and synchronization traffic between the nodes"

connected to only one pop of the CDN provider, the CDN provider

it's also not clear that FB is connecting their CDN to single points in any provider...
I'd guess there are some cases of that, but for larger networks I would imagine there are multiple CDN
deployments per network. I can't imagine that it's safe to deploy 1 CDN node for all of 7018 or 3320...
for instance.
 
may be motivated to let users access only one pop killing essential
redundancy of DNS, which should be overengineering, which is my
concern of the paragraph quoted by you.


it seems that the problem FB ran into was really that there wasn't either:
   "secondary path to communicate: "You are the last one standing, do not die"  (to an edge node)
 or:
  "maintain a very long/less-preferred path to a core location(s) to maintain service in case the CDN disappears"

There are almost certainly more complexities which FB is not discussion in their design/deployment which
affected their services last week, but it doesn't look like they were very far off on their deployment, if they
need to maintain back-end connectivity to serve customers from the CDN locales.

-chris