On Sat, Oct 9, 2021 at 1:40 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

Christopher Morrow wrote:
>> means their DNS servers were serving the zone, even after they
>> recognize their zone data were too old, that is, expired.

> that's not what this means. I think Mr. Petach previously described
> this,

He wrote:

> So, the idea is that if the edge CDN node loses connectivity to
> the core datacenters, the DNS servers should stop answering
> queries for A records with the local CDN node's address, and
> let a different site respond back to the client's DNS request.

which may be performed by standard DNS with short expire period,
after which name servers will return SERVFAIL and other name
servers in other edge node with different IP addresses are tried.

(Apologies for the delayed response--I had back-to-back board

meetings the past two days which had me completely tied up.)

That is one way in which it *could* be done--but is by no means

the ONLY way in which it can be done.

With an anycast setup using the same IP addresses in every

location, returning SERVFAIL doesn't have the same effect,

however, because failing over from anycast address 1 to

anycast address 2 is likely to be routed to the same pop

location, where the same result will occur.

You don't really want to hunt among different *IP addresses*,

you want to hunt to a different *location*.

This is why withdrawing the BGP announcement from that

location works more effectively, because it allows the clients

to continue querying the same IP address, but get routed to

the next most proximal location.

If you simply return SERVFAIL and have the client pick a

different IP address from the list of NS entries, it falls into

one of two situations:

a) the new IP address is also anycasted, and is therefore

likely to pick the same pop that is unhealthy, with similar

results, or

b) the new IP address is *not* anycasted, but is served from

a single geographical location, which means answers given

back by that DNS server are unlikely to be geolocated with

any accuracy, and therefore the content served is also unlikely

to be geographically relevant or correct.

It may be that facebook uses all the four name server IP addresses
in each edge node. But, it effectively kills essential redundancy
of DNS to have two or more name servers (at separate locations)
and the natural consequence is, as you can see, mass disaster.

Even if the four anycasted nameserver IP addresses weren't

completely overlapping (let's assume as a hypothetical that

a.ns is served out of EU pops, b.ns is served out of NA pops,

c.ns is served out of SA pops, and d.ns is served out of APAC

pops), if all sites run the same healthcheck code, then if the

underlying healthcheck fails, *every site* will decide it is

unhealthy, and stop answering requests; so, all the EU sites

fail health check and stop serving a.ns; all the North America

sites fail health check, and stop serving b.ns...and so forth.

You followed the best practices, you had different NS entries

that were on different subnets, that were geographically

dispersed around the globe, that were redundant for each

other. But because they all used the same fundamental

health check, they all *independently* decided they were

unhealthy and needed to stop giving out DNS answers,

and instead let one of the other healthier sites take over.

> but: 1) dns server in pop serves some content (ttls aren't
> important right now)

You MUST distinguish TTL and EXPIRE. They are different.

TTL and EXPIRE are irrelevant here.

The only thing changing those values would do is change

how long it took for caching resolvers to reflect the loss of

connectivity at the DNS layer. Once the underlying layer 3

connectivity had broken, DNS answers became meaningless.

No matter what records were returned, or cached, you couldn't

reach the servers.

Yes, yes, as an academic exercise you can point out that

there's a difference in how and when those DNS records

stop being used, and you're right about that--but in terms

of this particular failure, this particular post-mortem we're

beating to a horse-shaped pulp, it's entirely meaningless. ^_^;

> there's not a lot of magic here... and it's not about the zone data
> really at all.

Statement of Petach: "the edge CDN node loses connectivity to
the core datacenters, the DNS servers should stop answering"
means, with DNS terminology, zone data is expired, which has
nothing to do with TTL.

As you're using my words, I'm going to have to point out that

"the DNS servers should stop answering" does not require that

any change happens *at the DNS layer* -- in this case, the

change can happen at the routing layer, ensuring that even

if some caching resolver out there is completely defiant of

your expire time, you *will not answer* because the query

packets can never reach you in the first place.

Masataka Ohta

Thanks!

Matt