Christopher Morrow wrote:
>> means their DNS servers were serving the zone, even after they
>> recognize their zone data were too old, that is, expired.
> that's not what this means. I think Mr. Petach previously described
> this,
He wrote:
> So, the idea is that if the edge CDN node loses connectivity to
> the core datacenters, the DNS servers should stop answering
> queries for A records with the local CDN node's address, and
> let a different site respond back to the client's DNS request.
which may be performed by standard DNS with short expire period,
after which name servers will return SERVFAIL and other name
servers in other edge node with different IP addresses are tried.
(Apologies for the delayed response--I had back-to-back board
meetings the past two days which had me completely tied up.)
That is one way in which it *could* be done--but is by no means
the ONLY way in which it can be done.
With an anycast setup using the same IP addresses in every
location, returning SERVFAIL doesn't have the same effect,
however, because failing over from anycast address 1 to
anycast address 2 is likely to be routed to the same pop
location, where the same result will occur.
You don't really want to hunt among different *IP addresses*,
you want to hunt to a different *location*.
This is why withdrawing the BGP announcement from that
location works more effectively, because it allows the clients
to continue querying the same IP address, but get routed to
the next most proximal location.
If you simply return SERVFAIL and have the client pick a
different IP address from the list of NS entries, it falls into
one of two situations:
a) the new IP address is also anycasted, and is therefore
likely to pick the same pop that is unhealthy, with similar
results, or
b) the new IP address is *not* anycasted, but is served from
a single geographical location, which means answers given
back by that DNS server are unlikely to be geolocated with
any accuracy, and therefore the content served is also unlikely
to be geographically relevant or correct.
It may be that facebook uses all the four name server IP addresses
in each edge node. But, it effectively kills essential redundancy
of DNS to have two or more name servers (at separate locations)
and the natural consequence is, as you can see, mass disaster.
Even if the four anycasted nameserver IP addresses weren't
completely overlapping (let's assume as a hypothetical that
a.ns is served out of EU pops, b.ns is served out of NA pops,
c.ns is served out of SA pops, and d.ns is served out of APAC
pops), if all sites run the same healthcheck code, then if the
underlying healthcheck fails, *every site* will decide it is
unhealthy, and stop answering requests; so, all the EU sites
fail health check and stop serving a.ns; all the North America
sites fail health check, and stop serving b.ns...and so forth.
You followed the best practices, you had different NS entries
that were on different subnets, that were geographically
dispersed around the globe, that were redundant for each
other. But because they all used the same fundamental
health check, they all *independently* decided they were
unhealthy and needed to stop giving out DNS answers,
and instead let one of the other healthier sites take over.
> but: 1) dns server in pop serves some content (ttls aren't
> important right now)
You MUST distinguish TTL and EXPIRE. They are different.
TTL and EXPIRE are irrelevant here.
The only thing changing those values would do is change
how long it took for caching resolvers to reflect the loss of
connectivity at the DNS layer. Once the underlying layer 3
connectivity had broken, DNS answers became meaningless.
No matter what records were returned, or cached, you couldn't
reach the servers.
Yes, yes, as an academic exercise you can point out that
there's a difference in how and when those DNS records
stop being used, and you're right about that--but in terms
of this particular failure, this particular post-mortem we're
beating to a horse-shaped pulp, it's entirely meaningless. ^_^;
> there's not a lot of magic here... and it's not about the zone data
> really at all.
Statement of Petach: "the edge CDN node loses connectivity to
the core datacenters, the DNS servers should stop answering"
means, with DNS terminology, zone data is expired, which has
nothing to do with TTL.
As you're using my words, I'm going to have to point out that
"the DNS servers should stop answering" does not require that
any change happens *at the DNS layer* -- in this case, the
change can happen at the routing layer, ensuring that even
if some caching resolver out there is completely defiant of
your expire time, you *will not answer* because the query
packets can never reach you in the first place.
Masataka Ohta
Thanks!
Matt