On Wed, Oct 6, 2021 at 10:45 AM Michael Thomas <mike@mtcc.com> wrote:
So if I understand their post correctly, their DNS servers have the
ability to withdraw routes if they determine are sub-optimal (fsvo). I
can certainly understand for the DNS servers to not give answers they
think are unreachable but there is always the problem that they may be
partitioned and not the routes themselves. At a minimum, I would think
they'd need some consensus protocol that says that it's broken across
multiple servers.

But I just don't understand why this is a good idea at all. Network
topology is not DNS's bailiwick so using it as a trigger to withdraw
routes seems really strange and fraught with unintended consequences.
Why is it a good idea to withdraw the route if it doesn't seem reachable
from the DNS server? Give answers that are reachable, sure, but to
actually make a topology decision? Yikes. And what happens to the cached
answers that still point to the supposedly dead route? They're going to
fail until the TTL expires anyway so why is it preferable withdraw the
route too?

My guess is that their post while more clear that most doesn't go into
enough detail, but is it me or does it seem like this is a really weird
thing to do?

Mike


Hi Mike,

You're kinda thinking about this from the wrong angle.

It's not that the route is withdrawn if doesn't seem reachable 
from the DNS server.

It's that your DNS server is geolocating requests to the nearest 
content delivery cluster, where the CDN cluster is likely fetching 
content from a core datacenter elsewhere.  You don't want that 
remote/edge CDN node to give back A records for a CDN node 
that is isolated from the rest of the network and can't reach the 
datacenter to fetch the necessary content; otherwise, you'll have 
clients that reach the page, can load the static elements on the 
page, but all the dynamic elements hang, waiting for a fetch to 
complete from the origin which won't ever complete.  Not a very 
good end user experience.

So, the idea is that if the edge CDN node loses connectivity to 
the core datacenters, the DNS servers should stop answering 
queries for A records with the local CDN node's address, and 
let a different site respond back to the client's DNS request.
In particular, you really don't want the client to even send the 
request to the edge CDN node that's been isolated, you want 
to allow anycast to find the next-best edge site; so, once the 
DNS servers fail the "can-I-reach-my-datacenter" health check, 
they stop announcing the Anycast service address to the local 
routers; that way, they drop out of the Anycast pool, and normal 
Internet routing will ensure the client DNS requests are now sent 
to the next-nearest edge CDN cluster for resolution and retrieving 
data.

This works fine for ensuring that one or two edge sites that get 
isolated due to fiber cuts don't end up pulling client requests into 
them, and subsequently leaving the users hanging, waiting for 
data that will never arrive.

However, it fails big-time if *all* sites fail their "can-I-reach-the-datacenter" 
check simultaneously.  When I was involved in the decision making 
on a design like this, a choice was made to have a set of "really core" 
sites in the middle of the network always announce the anycast prefixes, 
as a fallback, so even if the routing wasn't optimal to reach them, the 
users would still get *some* level of reply back. 

In this situation, that would have ensured that at least some DNS 
servers were reachable; but it wouldn't have fixed the "oh crap we 
pushed 'no router bgp' out to all the routers at the same time" type 
problem.  But that isn't really the core of your question, so we'll 
just quietly push that aside for now.   ^_^;

Point being--it's useful and normal for edge sites that may become 
isolated from the rest of the network to be configured to stop announcing 
the Anycast service address for DNS out to local peers and transit 
providers at that site during the period in which they are isolated, to 
prevent users from being directed to CDN servers which can't fetch 
content from the origin servers in the datacenter.  It's just generally 
assumed that not every site will become "isolated" at the same time 
like that.   :)

I hope this helps clear up the confusion.

Thanks!

Matt