
On Fri, Aug 8, 2025 at 2:24 AM Saku Ytti via NANOG <nanog@lists.nanog.org> wrote:
On Fri, 8 Aug 2025 at 12:19, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:
my one advice on anycast is to make _certain_ that the routing reflects service availability on individual nodes -- i.e a node that can't answer queries MUST stop advertising the resolver /128 (or /32 if you have that).
If you do this in a single ASN, where you can guarantee preferences are honored, then instead of pulling advertisement, deprefer it.
Eventually you will manage to cause an issue, where all advertisements are falsely pulled.
Same strategy works in any domain where you are testing if something works, like default route by pinging 8.8.8.8, don't pull, depref.
Having been bitten by this in the past...never base your determination of "healthy" or "working" on a single external data reference. It can be tempting to just assume 8.8.8.8 will always be "up" and "pingable" to verify your internet connectivity is good...right up to the point where Google has a routing snafu, and your DNS infrastructure goes into cascading failure as every one of your sites begins depreferencing its announcements based on the failure of the external health check, and the load begins shifting to a smaller and smaller number of serving sites that were slower at detecting and depreferencing their route announcements, often to the point where the final site is so overwhelmed by all the traffic slamming it that it can't perform healthcheck/depreferencing anymore. Always have at least 3 external probe destinations or health check sites, operated by different entities, and only depreference upon failure to reach 3/3 or 2/3. Do not make decisions about the health of your network based upon the health of a single external entity (unless they are your only upstream provider, or you otherwise share fate with them). If you're pinging someone else to make sure the internet is still alive, ping several, like 8.8.8.8, 1.1.1.1, and 9.9.9.9, and don't react unless you see failures to reach multiple of them. Otherwise, it's likely to be their failure, not yours, and there's no reason to make things worse by changing your systems based on their problems. ...so many painful lessons learned the hard way over the years... ^_^; Matt