On Sat, Oct 9, 2021 at 1:40 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Christopher Morrow wrote:
>> means their DNS servers were serving the zone, even after they
>> recognize their zone data were too old, that is, expired.

> that's not what this means. I think Mr. Petach previously described
> this,

He wrote:

> So, the idea is that if the edge CDN node loses connectivity to
> the core datacenters, the DNS servers should stop answering
> queries for A records with the local CDN node's address, and
> let a different site respond back to the client's DNS request.

which may be performed by standard DNS with short expire period,
after which name servers will return SERVFAIL and other name
servers in other edge node with different IP addresses are tried.

(Apologies for the delayed response--I had back-to-back board 
meetings the past two days which had me completely tied up.)

That is one way in which it *could* be done--but is by no means 
the ONLY way in which it can be done.

With an anycast setup using the same IP addresses in every 
location, returning SERVFAIL doesn't have the same effect, 
however, because failing over from anycast address 1 to 
anycast address 2 is likely to be routed to the same pop 
location, where the same result will occur.

You don't really want to hunt among different *IP addresses*,
you want to hunt to a different *location*.

This is why withdrawing the BGP announcement from that 
location works more effectively, because it allows the clients 
to continue querying the same IP address, but get routed to 
the next most proximal location.

If you simply return SERVFAIL and have the client pick a 
different IP address from the list of NS entries, it falls into 
one of two situations:
a) the new IP address is also anycasted, and is therefore 
     likely to pick the same pop that is unhealthy, with similar
     results, or
b) the new IP address is *not* anycasted, but is served from 
    a single geographical location, which means answers given 
    back by that DNS server are unlikely to be geolocated with 
    any accuracy, and therefore the content served is also unlikely 
    to be geographically relevant or correct.
 

It may be that facebook uses all the four name server IP addresses
in each edge node. But, it effectively kills essential redundancy
of DNS to have two or more name servers (at separate locations)
and the natural consequence is, as you can see, mass disaster.

Even if the four anycasted nameserver IP addresses weren't 
completely overlapping (let's assume as a hypothetical that 
a.ns is served out of EU pops, b.ns is served out of NA pops,
c.ns is served out of SA pops, and d.ns is served out of APAC 
pops), if all sites run the same healthcheck code, then if the 
underlying healthcheck fails, *every site* will decide it is 
unhealthy, and stop answering requests; so, all the EU sites 
fail health check and stop serving a.ns; all the North America 
sites fail health check, and stop serving b.ns...and so forth.

You followed the best practices, you had different NS entries 
that were on different subnets, that were geographically 
dispersed around the globe, that were redundant for each 
other.  But because they all used the same fundamental 
health check, they all *independently* decided they were 
unhealthy and needed to stop giving out DNS answers, 
and instead let one of the other healthier sites take over.
 

> but: 1) dns server in pop serves some content (ttls aren't
> important right now)

You MUST distinguish TTL and EXPIRE. They are different.

TTL and EXPIRE are irrelevant here.
The only thing changing those values would do is change 
how long it took for caching resolvers to reflect the loss of 
connectivity at the DNS layer.  Once the underlying layer 3 
connectivity had broken, DNS answers became meaningless.
No matter what records were returned, or cached, you couldn't 
reach the servers.

Yes, yes, as an academic exercise you can point out that 
there's a difference in how and when those DNS records 
stop being used, and you're right about that--but in terms 
of this particular failure, this particular post-mortem we're 
beating to a horse-shaped pulp, it's entirely meaningless.   ^_^;
 

 > there's not a lot of magic here... and it's not about the zone data
 > really at all.

Statement of Petach: "the edge CDN node loses connectivity to
the core datacenters, the DNS servers should stop answering"
means, with DNS terminology, zone data is expired, which has
nothing to do with TTL.

As you're using my words, I'm going to have to point out that
"the DNS servers should stop answering" does not require that 
any change happens *at the DNS layer* -- in this case, the 
change can happen at the routing layer, ensuring that even 
if some caching resolver out there is completely defiant of 
your expire time, you *will not answer* because the query 
packets can never reach you in the first place.
 
                                                Masataka Ohta

Thanks!

Matt