Recommended DNS server for a medium 20-30k users isp

older
Re: : Can a prefix be never routed...

DurgaPrasad - DatasoftComnet

8 Aug 2025 8 Aug '25

12:44 a.m.

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any. Thank you /DP

Show replies by date

Mel Beckman

8 Aug 8 Aug

1:08 a.m.

We find SimpleDNSPlus (https://simpledns.plus<https://simpledns.plus/>) scales quite well. It runs under windows, but as long as you operated behind a good firewall, it’s not a security issue. It also can host zones as well, so can double as a recursive name server and as a primary or secondary name server for your hosted domains. It includes a nice graphical display of performance metrics. -mel via cell On Aug 7, 2025, at 5:45 PM, DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote: Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any. Thank you /DP _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

Andrew Latham

1:21 a.m.

For millions of customers I would use Knot https://www.knot-resolver.cz/ - Andrew "lathama" Latham - On Thu, Aug 7, 2025, 18:45 DurgaPrasad - DatasoftComnet via NANOG < nanog@lists.nanog.org> wrote:

...

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

Rusty Dekema

1:33 a.m.

Maybe I'm naive here, but would ISC BIND not be a reasonably good choice? -Rusty On Thu, Aug 7, 2025 at 8:45 PM DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

...

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

Mel Beckman

1:44 a.m.

We ran BIND on Linux boxes for decades, but just got tired of the tedious maintenance tasks and error-prone zone file maintenance. SimpleDNSPlus has a great GUI, simple software update procedures, excellent realtime performance monitoring, and paid technical support. The pointless tedium of bare-bones BIND doesn’t make a lot of sense for resilient infrastructure. SimpleDNSPlus is just one of many packaged DNS products. You choose one based on the features and scalability you need. -mel via cell

...

On Aug 7, 2025, at 6:36 PM, Rusty Dekema via NANOG <nanog@lists.nanog.org> wrote:

Maybe I'm naive here, but would ISC BIND not be a reasonably good choice?

-Rusty

...
On Thu, Aug 7, 2025 at 8:45 PM DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7JEVOCYI...

Uesley Correa

1:54 a.m.

BrbOS: https://brbytelatam.com/brbos Enviado do Gmail para iPad On Thu, 7 Aug 2025 at 22:45 Mel Beckman via NANOG <nanog@lists.nanog.org> wrote:

...

We ran BIND on Linux boxes for decades, but just got tired of the tedious maintenance tasks and error-prone zone file maintenance. SimpleDNSPlus has a great GUI, simple software update procedures, excellent realtime performance monitoring, and paid technical support. The pointless tedium of bare-bones BIND doesn’t make a lot of sense for resilient infrastructure.

SimpleDNSPlus is just one of many packaged DNS products. You choose one based on the features and scalability you need.

-mel via cell

...
On Aug 7, 2025, at 6:36 PM, Rusty Dekema via NANOG < nanog@lists.nanog.org> wrote:

Maybe I'm naive here, but would ISC BIND not be a reasonably good choice?

-Rusty

...
On Thu, Aug 7, 2025 at 8:45 PM DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS... _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7JEVOCYI... _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7D6INWHS...

Josh Luthman

2:02 a.m.

Webmin is a gui for bind if you're managing a zone. For 30k users it sounds more like a cache server, which bind/named would be a great option for. On Thu, Aug 7, 2025 at 9:45 PM Mel Beckman via NANOG <nanog@lists.nanog.org> wrote:

...

We ran BIND on Linux boxes for decades, but just got tired of the tedious maintenance tasks and error-prone zone file maintenance. SimpleDNSPlus has a great GUI, simple software update procedures, excellent realtime performance monitoring, and paid technical support. The pointless tedium of bare-bones BIND doesn’t make a lot of sense for resilient infrastructure.

SimpleDNSPlus is just one of many packaged DNS products. You choose one based on the features and scalability you need.

-mel via cell

...
On Aug 7, 2025, at 6:36 PM, Rusty Dekema via NANOG < nanog@lists.nanog.org> wrote:

Maybe I'm naive here, but would ISC BIND not be a reasonably good choice?

-Rusty

...
On Thu, Aug 7, 2025 at 8:45 PM DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS... _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7JEVOCYI... _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7D6INWHS...

Smoot Carl-Mitchell

2:17 a.m.

On Thu, 2025-08-07 at 22:02 -0400, Josh Luthman via NANOG wrote:

...

Webmin is a gui for bind if you're managing a zone.

For 30k users it sounds more like a cache server, which bind/named would be a great option for.

You can of course should spread the load out across several servers for redundancy. DNS clients typically round robin requests between servers. On Linux boxes, you can also set up client caches which will reduce requests across the network. Looks like Windows clients and OSX have a similar caching service. Bind also caches requests from remote zones as well. I am not familiar with other DNS servers, but I bet they all have similar functionality. It may just come down to ease of administration. -- Smoot Carl-Mitchell System/Network Architect voice: +1 480 922-7313 cell: +1 602 421-9005 smoot@tic.com

brent saner

3:53 a.m.

On Thu, Aug 7, 2025, 20:45 DurgaPrasad - DatasoftComnet via NANOG < nanog@lists.nanog.org> wrote:

...

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP

<https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISSISPWQY3YGF25FBQNN2JD5HDP/> It's surprising that you didn't get the performance you hoped for out of PowerDNS. You already tried the suggestions in their tuning guide[0], I'm assuming? You may also want to load in entire zones to the hot cache[1]. And there's always horizontal scaling; sometimes you just plain hit limits on vertical scale. I haven't tried it yet, but dnsdist[2] should let you do this. (Or keepalived and/or HAproxy, or... etc. Any loadbalancer that can handle raw TCP and UDP.) Dnsdist in particular seems explicitly targeted towards a large set of untrusted clients with additional optional "safeguarding/consumer protection" features. Quad9 uses it in some fashion, if I recall correctly. [0] https://doc.powerdns.com/recursor/performance.html [1] https://docs.powerdns.com/recursor/lua-config/ztc.html [2] https://www.dnsdist.org/index.html

John Todd

4:41 a.m.

On 7 Aug 2025, at 20:53, brent saner via NANOG wrote:

...

On Thu, Aug 7, 2025, 20:45 DurgaPrasad - DatasoftComnet via NANOG < nanog@lists.nanog.org> wrote:

...
Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP

<https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISSISPWQY3YGF25FBQNN2JD5HDP/>

It's surprising that you didn't get the performance you hoped for out of PowerDNS. You already tried the suggestions in their tuning guide[0], I'm assuming?

You may also want to load in entire zones to the hot cache[1].

And there's always horizontal scaling; sometimes you just plain hit limits on vertical scale.

I haven't tried it yet, but dnsdist[2] should let you do this. (Or keepalived and/or HAproxy, or... etc. Any loadbalancer that can handle raw TCP and UDP.) Dnsdist in particular seems explicitly targeted towards a large set of untrusted clients with additional optional "safeguarding/consumer protection" features. Quad9 uses it in some fashion, if I recall correctly.

[0] https://doc.powerdns.com/recursor/performance.html [1] https://docs.powerdns.com/recursor/lua-config/ztc.html [2] https://www.dnsdist.org/index.html

You beat me to it - dnsdist is an exceptionally robust solution for front-ending recursive (or authoritative) servers. Quad9 is indeed using it for all our recursive systems, and we split traffic on the "back-end" between PowerDNS recursor and Unbound. It (dnsdist) has a "packet cache" feature which handles much of the load once warmed, and it answers on DOT/DOH as well as providing for a very rich set of tooling that allows management of unwanted behaviors. The combination of dnsdist plus a good recursive resolver should easily be able to handle 30k users on a single modest chassis with ease, though of course it there are very good reasons to have several systems similarly configured in fail-over models using ECMP or your favorite routing protocol. Hot caches work better - try not to spread load too much.) At this point, I can't imagine running a recursive system that is open to anything other than a tiny number of users without ensuring that dnsdist is in front of it - it's exactly the right thing and has been sandblasted by a lot of trial-and-error to make it fast and reliable with lots of features for ISP environments. If a decent-sized system doesn't seem fast, there may be some other underlying issue that is at the root of a perceived speed issue. There is useful data that can be pulled out of dnsdist with prometheus-style outputs - I would suggest instrumenting things and seeing where the problems are. Now, the original question of "points on how much we tune the settings" - that is a much longer discussion, but honestly you can get to 80% goodput without too much fiddling. JT

Crist Clark

5:22 a.m.

Not a lot of detail on your needs, but you may consider just providing service through one of the very big DNS providers. The expense of building, managing, and supporting your own infrastructure is not insignificant. You may be able to offer add-on services through a big provider that may be difficult to roll your own like security features, safe searches, parental controls, etc. On Thu, Aug 7, 2025 at 9:42 PM John Todd via NANOG <nanog@lists.nanog.org> wrote:

...

On 7 Aug 2025, at 20:53, brent saner via NANOG wrote:

...
On Thu, Aug 7, 2025, 20:45 DurgaPrasad - DatasoftComnet via NANOG < nanog@lists.nanog.org> wrote:

...
Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP

< https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

It's surprising that you didn't get the performance you hoped for out of PowerDNS. You already tried the suggestions in their tuning guide[0], I'm assuming?

You may also want to load in entire zones to the hot cache[1].

And there's always horizontal scaling; sometimes you just plain hit limits on vertical scale.

I haven't tried it yet, but dnsdist[2] should let you do this. (Or keepalived and/or HAproxy, or... etc. Any loadbalancer that can handle raw TCP and UDP.) Dnsdist in particular seems explicitly targeted towards a large set of untrusted clients with additional optional "safeguarding/consumer protection" features. Quad9 uses it in some fashion, if I recall correctly.

[0] https://doc.powerdns.com/recursor/performance.html [1] https://docs.powerdns.com/recursor/lua-config/ztc.html [2] https://www.dnsdist.org/index.html

You beat me to it - dnsdist is an exceptionally robust solution for front-ending recursive (or authoritative) servers. Quad9 is indeed using it for all our recursive systems, and we split traffic on the "back-end" between PowerDNS recursor and Unbound. It (dnsdist) has a "packet cache" feature which handles much of the load once warmed, and it answers on DOT/DOH as well as providing for a very rich set of tooling that allows management of unwanted behaviors. The combination of dnsdist plus a good recursive resolver should easily be able to handle 30k users on a single modest chassis with ease, though of course it there are very good reasons to have several systems similarly configured in fail-over models using ECMP or your favorite routing protocol. Hot caches work better - try not to spread load too much.) At this point, I can't imagine running a recursive system that is open to anything other than a tiny number of users without ensuring that dnsdist is in front of it -! it's exa ctly the right thing and has been sandblasted by a lot of trial-and-error to make it fast and reliable with lots of features for ISP environments.

If a decent-sized system doesn't seem fast, there may be some other underlying issue that is at the root of a perceived speed issue. There is useful data that can be pulled out of dnsdist with prometheus-style outputs - I would suggest instrumenting things and seeing where the problems are.

Now, the original question of "points on how much we tune the settings" - that is a much longer discussion, but honestly you can get to 80% goodput without too much fiddling.

JT _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/J4WSKWYC...

Mike Hammett

9 Aug 9 Aug

1:44 a.m.

*NEVER* use an off-net resolving DNS server for an ISP. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Crist Clark via NANOG" <nanog@lists.nanog.org> To: "North American Network Operators Group" <nanog@lists.nanog.org> Cc: "Crist Clark" <cjc+nanog@pumpky.net> Sent: Friday, August 8, 2025 12:22:03 AM Subject: Re: Recommended DNS server for a medium 20-30k users isp Not a lot of detail on your needs, but you may consider just providing service through one of the very big DNS providers. The expense of building, managing, and supporting your own infrastructure is not insignificant. You may be able to offer add-on services through a big provider that may be difficult to roll your own like security features, safe searches, parental controls, etc. On Thu, Aug 7, 2025 at 9:42 PM John Todd via NANOG <nanog@lists.nanog.org> wrote:

...

On 7 Aug 2025, at 20:53, brent saner via NANOG wrote:

...
On Thu, Aug 7, 2025, 20:45 DurgaPrasad - DatasoftComnet via NANOG < nanog@lists.nanog.org> wrote:

...
Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP

< https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

It's surprising that you didn't get the performance you hoped for out of PowerDNS. You already tried the suggestions in their tuning guide[0], I'm assuming?

You may also want to load in entire zones to the hot cache[1].

And there's always horizontal scaling; sometimes you just plain hit limits on vertical scale.

I haven't tried it yet, but dnsdist[2] should let you do this. (Or keepalived and/or HAproxy, or... etc. Any loadbalancer that can handle raw TCP and UDP.) Dnsdist in particular seems explicitly targeted towards a large set of untrusted clients with additional optional "safeguarding/consumer protection" features. Quad9 uses it in some fashion, if I recall correctly.

[0] https://doc.powerdns.com/recursor/performance.html [1] https://docs.powerdns.com/recursor/lua-config/ztc.html [2] https://www.dnsdist.org/index.html

You beat me to it - dnsdist is an exceptionally robust solution for front-ending recursive (or authoritative) servers. Quad9 is indeed using it for all our recursive systems, and we split traffic on the "back-end" between PowerDNS recursor and Unbound. It (dnsdist) has a "packet cache" feature which handles much of the load once warmed, and it answers on DOT/DOH as well as providing for a very rich set of tooling that allows management of unwanted behaviors. The combination of dnsdist plus a good recursive resolver should easily be able to handle 30k users on a single modest chassis with ease, though of course it there are very good reasons to have several systems similarly configured in fail-over models using ECMP or your favorite routing protocol. Hot caches work better - try not to spread load too much.) At this point, I can't imagine running a recursive system that is open to anything other than a tiny number of users without ensuring that dnsdist is in front of it -! it's exa ctly the right thing and has been sandblasted by a lot of trial-and-error to make it fast and reliable with lots of features for ISP environments.

If a decent-sized system doesn't seem fast, there may be some other underlying issue that is at the root of a perceived speed issue. There is useful data that can be pulled out of dnsdist with prometheus-style outputs - I would suggest instrumenting things and seeing where the problems are.

Now, the original question of "points on how much we tune the settings" - that is a much longer discussion, but honestly you can get to 80% goodput without too much fiddling.

JT _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/J4WSKWYC...

_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/WD56K3TZ...

brent saner

8 Aug 8 Aug

6:25 a.m.

On Fri, Aug 8, 2025, 00:41 John Todd <jtodd@loligo.com> wrote:

...

You beat me to it - dnsdist is an exceptionally robust solution for front-ending recursive (or authoritative) servers. Quad9 is indeed using it for all our recursive systems, and we split traffic on the "back-end" between PowerDNS recursor and Unbound. It (dnsdist) has a "packet cache" feature which handles much of the load once warmed, and it answers on DOT/DOH as well as providing for a very rich set of tooling that allows management of unwanted behaviors.

Thanks, John! I was considering evaling/deploying dnsdist for our own customers, and this has me convinced that's a solid direction; if it works well for y'all at Quad9, it'd definitely work for us. Cheers!

...

Robert L Mathews

4:12 p.m.

On Aug 7, 2025, at 9:41 PM, John Todd via NANOG <nanog@lists.nanog.org> wrote:

...

we split traffic on the "back-end" between PowerDNS recursor and Unbound

Using multiple products is definitely best practice. At my company, we have half of our (anycasted) authoritative DNS servers using BIND, and the other half using PowerDNS. If you don't do this, you can be vulnerable to something like CVE-2025-40775, where an attacker can terminate all your DNS servers simultaneously by sending each a malicious packet. Or maybe there's some other bug in the software that makes it randomly crash at a certain time. If this happens, you want to make sure that only half of them go offline. -- Robert L Mathews

Marco Moock

5:58 a.m.

Am 08.08.2025 um 00:44:40 Uhr schrieb DurgaPrasad - DatasoftComnet via NANOG:

...

Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

The university I worked at ran ISC BIND9 on 2 machines as recursive resolver and authoritative server. You can run multiple of them and assign them randomly to your customers if the load is too high. -- Gruß Marco Send unsolicited bulk mail to 1754606680muell@cartoonies.org

William Herrin

8:20 a.m.

On Thu, Aug 7, 2025 at 5:44 PM DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

...

Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Howdy, For 30k users, a pair of bind9 servers will do just fine without any special performance tuning. Whether you use bind9 or any other DNS server software, the key things are that these should be bare metal, not virtual machines, and they should be dedicated to the DNS task. VMs or competing workloads introduce latency which will be perceptible in your DNS performance. You'll observe that the CPU is lightly used on these machines, and that's the result you want to see. This is true even if, for some reason, the bulk of your users do not employ DOH to a public server for the web browser DNS lookups. On Thu, Aug 7, 2025 at 7:17 PM Smoot Carl-Mitchell via NANOG <nanog@lists.nanog.org> wrote:

...

DNS clients typically round robin requests between servers.

They do not. DNS resolvers may round-robin requests between authoritative servers, but clients usually talk to resolvers in the order configured. It's something to keep in mind if you want to spread the load between the DNS resolvers. 30k users is not enough for it to make much difference. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

David Guo

8:31 a.m.

Hi, We use PowerDNS Recursor together with dnsdist to handle millions of DNS requests per day for more than 100k users. In our experience, a small server such as one from the Intel E22xx series with 32 GB of RAM is sufficient for this setup. Based on my experience, you only need to install dnsdist for load balancing and implement per-IP rate limiting. Best regards, David From: William Herrin via NANOG <nanog@lists.nanog.org> Date: Friday, August 8, 2025 at 17:21 To: North American Network Operators Group <nanog@lists.nanog.org> Cc: DurgaPrasad - DatasoftComnet <dp@datasoftcomnet.com>, William Herrin <bill@herrin.us> Subject: Re: Recommended DNS server for a medium 20-30k users isp On Thu, Aug 7, 2025 at 5:44 PM DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

...

Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

...

DNS clients typically round robin requests between servers.

Måns Nilsson

9:17 a.m.

Subject: Recommended DNS server for a medium 20-30k users isp Date: Fri, Aug 08, 2025 at 12:44:40AM +0000 Quoting DurgaPrasad - DatasoftComnet via NANOG (nanog@lists.nanog.org):

...

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

anycast unbound, preferably on something more mature than Linux, so like FreeBSD or OpenBSD. crucial part being _anycast_ so you don't have to pay protection money to the likes of haproxy or F5, but still can have good service availability. troublish thing with resolver service is that the clients have a tendency to wait painfully long before they try No. 2 in the resolver list, so fast answers from the first one are kind of important. my one advice on anycast is to make _certain_ that the routing reflects service availability on individual nodes -- i.e a node that can't answer queries MUST stop advertising the resolver /128 (or /32 if you have that). I have built this several times at various organisations. it is solid. as in "it just works". also, since I made certain my resolvers speak ipv6, resolution is much snappier. auth DNS service has a very good v6 roll out status, overall. on tuning, you have a metric ton of options in unbound -- considerably more so than in BIND. otoh, since I learnt of unbound I have avoided BIND for recursive service, so there mightabeen some evolution there. with that, the people at cz.nic (knot resolver) are quite competent, so I would follow the advice given and look at their offering too. of course you can run anyast with knot resolver too. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 Hmmm ... a PINHEAD, during an EARTHQUAKE, encounters an ALL-MIDGET FIDDLE ORCHESTRA ... ha ... ha ...

Saku Ytti

9:23 a.m.

On Fri, 8 Aug 2025 at 12:19, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...

my one advice on anycast is to make _certain_ that the routing reflects service availability on individual nodes -- i.e a node that can't answer queries MUST stop advertising the resolver /128 (or /32 if you have that).

If you do this in a single ASN, where you can guarantee preferences are honored, then instead of pulling advertisement, deprefer it. Eventually you will manage to cause an issue, where all advertisements are falsely pulled. Same strategy works in any domain where you are testing if something works, like default route by pinging 8.8.8.8, don't pull, depref. -- ++ytti

Måns Nilsson

10:05 a.m.

Subject: Re: Recommended DNS server for a medium 20-30k users isp Date: Fri, Aug 08, 2025 at 12:23:32PM +0300 Quoting Saku Ytti (saku@ytti.fi):

...

On Fri, 8 Aug 2025 at 12:19, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

Eventually you will manage to cause an issue, where all advertisements are falsely pulled.

good advice. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 Do you like "TENDER VITTLES"?

Nick Hilliard

11:05 a.m.

Saku Ytti via NANOG wrote on 08/08/2025 10:23:

...

Eventually you will manage to cause an issue, where all advertisements are falsely pulled.

Someone up-thread mentioned firewalling DNS servers. Withdrawing DNS service workers due to firewall state overloading can cause cascading service failure which can take out an entire DNS infrastructure within milliseconds. Don't ask me how I know this. Also obviously works when n=1. tl;dr: packet filters only for DNS, preferably in hardware. Don't ever use state tracking. Nick

Mel Beckman

4:08 p.m.

Nick Hilliard said: Withdrawing DNS service workers due to firewall state overloading can cause cascading service failure which can take out an entire DNS infrastructure within milliseconds. Don't ask me how I know this. Also obviously works when n=1. tl;dr: packet filters only for DNS, preferably in hardware. Don't ever use state tracking Nick, Appropriately sized, HA firewall pairs mitigate this pretty handily. In my opinion, the days of not firewaling critical infrastructure are pretty much over. There are just two many potential vulnerabilites to expect packet filters alone to addres them. If necessary, you can use multiple segregated firewalled networks for redundancy to mitigate cascading service failures. -mel ________________________________ From: Nick Hilliard via NANOG <nanog@lists.nanog.org> Sent: Friday, August 8, 2025 4:05 AM To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Nick Hilliard <nick@foobar.org> Subject: Re: Recommended DNS server for a medium 20-30k users isp Saku Ytti via NANOG wrote on 08/08/2025 10:23:

...

Eventually you will manage to cause an issue, where all advertisements are falsely pulled.

Nick Hilliard

4:19 p.m.

Mel Beckman wrote on 08/08/2025 17:08:

...

Appropriately sized, HA firewall pairs mitigate this pretty handily.

Mel, Please don't let me stop you from doing this. The failure modes are really quite entertaining, at least from a distance. Anyone got popcorn? Nick

Mel Beckman

4:48 p.m.

We do do it. No problems in ten years. We upgrade the firewalls to cheaper, faster, more reliable models every few years. In the meantime, DNS traffic has actual declined, probably due to DOH. I'm happy to hear your war stories 🙂 -mel ________________________________ From: Nick Hilliard <nick@foobar.org> Sent: Friday, August 8, 2025 9:19 AM To: Mel Beckman <mel@beckman.org> Cc: North American Network Operators Group <nanog@lists.nanog.org> Subject: Re: Recommended DNS server for a medium 20-30k users isp Mel Beckman wrote on 08/08/2025 17:08:

...

Appropriately sized, HA firewall pairs mitigate this pretty handily.

Mel, Please don't let me stop you from doing this. The failure modes are really quite entertaining, at least from a distance. Anyone got popcorn? Nick

Łukasz Bromirski

9 Aug 9 Aug

8:05 a.m.

Yeah, As a person that in my $dailyjob builds hardware firewalls (so called NGFWs but also "SP class" boxes), I can assure you properly configured DNS servers can absolutely defend themselves. If they need protection, you're doing it wrong. And there are design choices (load balancers, ECMP/UCMP, anycast) that makes these designs scale and switch over without any problems if additional "capabilities" don't go in their way. Adding stateful firewall in front of them is waste of good hardware. More over, if you insist on doing so, you'll likely suffer from state exhaustion or self-DDoS at one point in time. That typically leads you to blame firewall vendor, and not your poor thinking, design or planning skills. Don't do that. KISS is decent design practice. Doing "tricks" with firewall may be relevant to Enterprise type of deployment, where "fusing" DNS info with other pieces (identity, data plane telemetry, etc) is typically element of your security architecture (and defense). What is way more useful for layered defence is applying QoS on upstream switch/router if it is enforced in hardware. "QoS" as expressed in maximum packets/second (which are roughly requests), not as in bits/second (which is pretty useless). That is, if you do know your rough levels exceeding which makes your server behave in less stable/predictable way. This is hardly unique or innovative though. I did deploy myself, and helped others to deploy FreeBSD-based BIND and nsd+unbound anycasted DNS servers. Biggest one (two pairs of Xeon based servers) was handling requests from ~3 million users while mostly idling last time I checked. And that was couple of years ago. I know it's still in production and handling "more". The only firewall they have is pf with pretty generic set of rules to drop host attacks and protect management access, DNS traffic is unfiltered as it doesn't make any sense. -- ./

...

On 8 Aug 2025, at 18:20, Nick Hilliard via NANOG <nanog@lists.nanog.org> wrote:

Mel Beckman wrote on 08/08/2025 17:08:

...
Appropriately sized, HA firewall pairs mitigate this pretty handily.

Mel,

Please don't let me stop you from doing this. The failure modes are really quite entertaining, at least from a distance. Anyone got popcorn?

Nick _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/H5WQB2KF...

Måns Nilsson

12:41 p.m.

Subject: Re: Recommended DNS server for a medium 20-30k users isp Date: Fri, Aug 08, 2025 at 05:19:39PM +0100 Quoting Nick Hilliard via NANOG (nanog@lists.nanog.org):

...

Mel Beckman wrote on 08/08/2025 17:08:

...
Appropriately sized, HA firewall pairs mitigate this pretty handily.

Mel,

Please don't let me stop you from doing this. The failure modes are really quite entertaining, at least from a distance. Anyone got popcorn?

I suppose you bring the beer then, because it's going to take both to endure the cringefest that is "cascading resource exhaustion in DNS / firewall setup" -- it can pretty fast end up snowballing completely out of hand. Don't ask me how I know without picking up the bar tab. /Måns -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 Am I accompanied by a PARENT or GUARDIAN?

Mel Beckman

2:01 p.m.

Sheesh! People claiming firewalling DNS is bad, but hide the receipts behind “pay my bar tab” evasion. Here’s the real bar talk: put up or shut up. LOL! Data or it never happened. -mel via cell

...

On Aug 9, 2025, at 5:42 AM, Måns Nilsson <mansaxel@besserwisser.org> wrote:

Subject: Re: Recommended DNS server for a medium 20-30k users isp Date: Fri, Aug 08, 2025 at 05:19:39PM +0100 Quoting Nick Hilliard via NANOG (nanog@lists.nanog.org):

...
Mel Beckman wrote on 08/08/2025 17:08:

...
Appropriately sized, HA firewall pairs mitigate this pretty handily.

Mel,

Please don't let me stop you from doing this. The failure modes are really quite entertaining, at least from a distance. Anyone got popcorn?

I suppose you bring the beer then, because it's going to take both to endure the cringefest that is "cascading resource exhaustion in DNS / firewall setup" -- it can pretty fast end up snowballing completely out of hand. Don't ask me how I know without picking up the bar tab.

/Måns -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 Am I accompanied by a PARENT or GUARDIAN? <signature.asc>

Saku Ytti

3:40 p.m.

On Sat, 9 Aug 2025 at 15:42, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...

I suppose you bring the beer then, because it's going to take both to endure the cringefest that is "cascading resource exhaustion in DNS / firewall setup" -- it can pretty fast end up snowballing completely out of hand. Don't ask me how I know without picking up the bar tab.

I can share lessons from personal mistakes. a) FW is always additional fuse in front of service, failure modes are union of FW and Service, so MTBF is lower and MTTR is higher - state establishment rate is reduced - state count is reduced - either FW has protocol intelligence and occasionally as protocols evolve or more exotic use cases exist drops valid protocol packets or protocol unintelligent and doesn't add anything to stateless HW based filter on edge router - any service protected by FW is easier to DoS than same service without FW b) Even if FW is ran (like in front of corporate LAN which doesn't have to deal with denial of service issues and regulator or PCI or equivalent may require FW) valid configurations in my mind are - if 2 == cluster, 1 == single and + == routing separation - 1, 1+1, 2+1 are valid configurations - 2 and 2+2 are invalid configurations - every time i've ran '2', eventually there has been case where cluster is dead and MTTR is high as vendor needs to be engaged and depending on hour the people at vendor who actually can troubleshoot the issue are not at work (used to be US hours, now increasingly experts are in India time) - So if you can only afford 2 devices, have two devices separated by routing, you'll lose state during failure, but you have less failures, even if you can afford 4 devices, don't buy two clusters, since the problem that breaks cluster may affect both clusters Generally FW is needed if what is behind FW has dubious and únknown state (like user LAN). But if what is behind FW is well thought out DNS or HTTP service FW adds no utility and a lot of liability. -- ++ytti

Mel Beckman

5:44 p.m.

Saku, Thanks for the well delineated examples. I agree with them. You clearly illustratewrong configurations that can cause unanticipated failure modes. Thus it’s best to follow established design patterns, rather than cooking your own recipe. But how is this different than using a firewall to protect any other service? Firewalls can fail, and thus require resiliency considerations. But they also can do a lot to insulate underlying services from attacks — source IP flooding, for example, or the myriad of sequence attacks — the kinds of attacks that are difficult to protect against in the pure IP stack. I submit that one major firewall advantage is consistency of implementation. People who are protecting their DNS by cleverly hardening them using packet filters and load balancing are doing so with error-prone manual methods. Human error, as HAL says, is always a problem. Firewall code, on the other hand, goes through certification processes and deep regression testing before being deployed. Firewall developers are dedicated to the protection mission, while people standing up DNS at many enterprises, including ISPs, are not DNS experts. DNS is just one of many services they must manage. I appreciate your anecdotes, but as every good scientist knows, the plural of anecdote is not data. I need to see some data backing up these claims about the relative unreliability of firewalls. -mel

...

On Aug 9, 2025, at 8:41 AM, Saku Ytti via NANOG <nanog@lists.nanog.org> wrote:

On Sat, 9 Aug 2025 at 15:42, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...
I suppose you bring the beer then, because it's going to take both to endure the cringefest that is "cascading resource exhaustion in DNS / firewall setup" -- it can pretty fast end up snowballing completely out of hand. Don't ask me how I know without picking up the bar tab.

I can share lessons from personal mistakes.

a) FW is always additional fuse in front of service, failure modes are union of FW and Service, so MTBF is lower and MTTR is higher - state establishment rate is reduced - state count is reduced - either FW has protocol intelligence and occasionally as protocols evolve or more exotic use cases exist drops valid protocol packets or protocol unintelligent and doesn't add anything to stateless HW based filter on edge router - any service protected by FW is easier to DoS than same service without FW

b) Even if FW is ran (like in front of corporate LAN which doesn't have to deal with denial of service issues and regulator or PCI or equivalent may require FW) valid configurations in my mind are - if 2 == cluster, 1 == single and + == routing separation - 1, 1+1, 2+1 are valid configurations - 2 and 2+2 are invalid configurations - every time i've ran '2', eventually there has been case where cluster is dead and MTTR is high as vendor needs to be engaged and depending on hour the people at vendor who actually can troubleshoot the issue are not at work (used to be US hours, now increasingly experts are in India time) - So if you can only afford 2 devices, have two devices separated by routing, you'll lose state during failure, but you have less failures, even if you can afford 4 devices, don't buy two clusters, since the problem that breaks cluster may affect both clusters

Generally FW is needed if what is behind FW has dubious and únknown state (like user LAN). But if what is behind FW is well thought out DNS or HTTP service FW adds no utility and a lot of liability.

-- ++ytti _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DZZOX5JW...

Mark Andrews

9:01 p.m.

Firewall have a long history of breaking DNS. They have been known to throw away UDP fragments. This breaks responses that exceed path MTU. There is this myth that IPv6 doesn’t have fragments so they can just be blocked so IPv6 is particularly bad in this respect. Drop ICMP PTB. This breaks PMTU discovery which partially affects IPv6 UDP responses getting though as the sender needs to fragment. It also stops TCP responses where the MSS and PMTU don’t align. MSS fix up wouldn’t be needed if ICMP PTB weren’t blocked and are consistently generated. Filter out every query type but a handful that are magically blessed. The firewalls this are oftern years behind the current query mix and DNS servers don’t need this service anyway. DNS servers know how to return this record does not exist. Additionally if you have added the record to the zone you don’t need a firewall second guessing your desires. Block DNS over TCP. DNS has ALWAYS used both UDP and TCP for normal queries. There have been plenty of times where UDP responses have said retry over TCP because the answer is to big only for the TCP request that be blocked because of the myth that DNS is only TCP. Run out of state tracking. Recursive servers make hundreds of queries per incoming query when their caches are empty. We’ve seen connection tracking tables overwhelmed often. Stupid firewalls that “know” that this bit is 0 or this type never appears in this section or there aren’t any EDNS options in requests or drop requests with unknown EDNS options. Nameservers have rules for dealing with the unknown and they are infinitely better than drop the request. I’m sure there are other stupidities I’ve seen firewalls do. Juniper were particularly bad until we complained enough to get the defaults changed. -- Mark Andrews

...

El 10 ago 2025, a las 3:45, Mel Beckman via NANOG <nanog@lists.nanog.org> escribió:

Saku,

Thanks for the well delineated examples. I agree with them. You clearly illustratewrong configurations that can cause unanticipated failure modes. Thus it’s best to follow established design patterns, rather than cooking your own recipe.

But how is this different than using a firewall to protect any other service? Firewalls can fail, and thus require resiliency considerations. But they also can do a lot to insulate underlying services from attacks — source IP flooding, for example, or the myriad of sequence attacks — the kinds of attacks that are difficult to protect against in the pure IP stack.

I submit that one major firewall advantage is consistency of implementation. People who are protecting their DNS by cleverly hardening them using packet filters and load balancing are doing so with error-prone manual methods. Human error, as HAL says, is always a problem. Firewall code, on the other hand, goes through certification processes and deep regression testing before being deployed. Firewall developers are dedicated to the protection mission, while people standing up DNS at many enterprises, including ISPs, are not DNS experts. DNS is just one of many services they must manage.

I appreciate your anecdotes, but as every good scientist knows, the plural of anecdote is not data. I need to see some data backing up these claims about the relative unreliability of firewalls.

-mel

...
...
On Aug 9, 2025, at 8:41 AM, Saku Ytti via NANOG <nanog@lists.nanog.org> wrote:

On Sat, 9 Aug 2025 at 15:42, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

I suppose you bring the beer then, because it's going to take both to endure the cringefest that is "cascading resource exhaustion in DNS / firewall setup" -- it can pretty fast end up snowballing completely out of hand. Don't ask me how I know without picking up the bar tab.

I can share lessons from personal mistakes.

a) FW is always additional fuse in front of service, failure modes are union of FW and Service, so MTBF is lower and MTTR is higher - state establishment rate is reduced - state count is reduced - either FW has protocol intelligence and occasionally as protocols evolve or more exotic use cases exist drops valid protocol packets or protocol unintelligent and doesn't add anything to stateless HW based filter on edge router - any service protected by FW is easier to DoS than same service without FW

b) Even if FW is ran (like in front of corporate LAN which doesn't have to deal with denial of service issues and regulator or PCI or equivalent may require FW) valid configurations in my mind are - if 2 == cluster, 1 == single and + == routing separation - 1, 1+1, 2+1 are valid configurations - 2 and 2+2 are invalid configurations - every time i've ran '2', eventually there has been case where cluster is dead and MTTR is high as vendor needs to be engaged and depending on hour the people at vendor who actually can troubleshoot the issue are not at work (used to be US hours, now increasingly experts are in India time) - So if you can only afford 2 devices, have two devices separated by routing, you'll lose state during failure, but you have less failures, even if you can afford 4 devices, don't buy two clusters, since the problem that breaks cluster may affect both clusters

Generally FW is needed if what is behind FW has dubious and únknown state (like user LAN). But if what is behind FW is well thought out DNS or HTTP service FW adds no utility and a lot of liability.

-- ++ytti _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DZZOX5JW...

NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/CSKJGBIL...

Tom Beecher

10 Aug 10 Aug

12:12 p.m.

...

But how is this different than using a firewall to protect any other service? Firewalls can fail, and thus require resiliency considerations. But they also can do a lot to insulate underlying services from attacks — source IP flooding, for example, or the myriad of sequence attacks — the kinds of attacks that are difficult to protect against in the pure IP stack.

For a DNS service , a few network ACLs upstream, combined with very standard protection mechanisms on host , is more than sufficient. I submit that one major firewall advantage is consistency of

...

implementation. People who are protecting their DNS by cleverly hardening them using packet filters and load balancing are doing so with error-prone manual methods

It has been trivial for many years now to manage entire fleets of servers with automation and tooling to maintain consistent configurations. Firewall code, on the other hand, goes through certification processes and

...

deep regression testing before being deployed. Firewall developers are dedicated to the protection mission, while people standing up DNS at many enterprises, including ISPs, are not DNS experts. DNS is just one of many services they must manage.

The large firewall vendors do way less testing than you are assuming here. Some are better than others, but there isn't a single one that is releasing excellent quality code on a timely basis. It's also fair to say that most enterprises aren't doing their own extensive testing of firewall code and operation themselves before deploying. I need to see some data backing up these claims about the relative

...

unreliability of firewalls.

I'm not sure anyone is making the claim that firewalls are unreliable. The statements are that putting stateful firewalls in front of a DNS service can cause said DNS service to become unreliable, because of the way stateful firewalls function, and the nature of DNS traffic and operation. On Sat, Aug 9, 2025 at 1:46 PM Mel Beckman via NANOG <nanog@lists.nanog.org> wrote:

...

Saku,

Thanks for the well delineated examples. I agree with them. You clearly illustratewrong configurations that can cause unanticipated failure modes. Thus it’s best to follow established design patterns, rather than cooking your own recipe.

But how is this different than using a firewall to protect any other service? Firewalls can fail, and thus require resiliency considerations. But they also can do a lot to insulate underlying services from attacks — source IP flooding, for example, or the myriad of sequence attacks — the kinds of attacks that are difficult to protect against in the pure IP stack.

I submit that one major firewall advantage is consistency of implementation. People who are protecting their DNS by cleverly hardening them using packet filters and load balancing are doing so with error-prone manual methods. Human error, as HAL says, is always a problem. Firewall code, on the other hand, goes through certification processes and deep regression testing before being deployed. Firewall developers are dedicated to the protection mission, while people standing up DNS at many enterprises, including ISPs, are not DNS experts. DNS is just one of many services they must manage.

I appreciate your anecdotes, but as every good scientist knows, the plural of anecdote is not data. I need to see some data backing up these claims about the relative unreliability of firewalls.

-mel

...
On Aug 9, 2025, at 8:41 AM, Saku Ytti via NANOG <nanog@lists.nanog.org> wrote:

On Sat, 9 Aug 2025 at 15:42, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...
I suppose you bring the beer then, because it's going to take both to endure the cringefest that is "cascading resource exhaustion in DNS / firewall setup" -- it can pretty fast end up snowballing completely out of hand. Don't ask me how I know without picking up the bar tab.

I can share lessons from personal mistakes.

a) FW is always additional fuse in front of service, failure modes are union of FW and Service, so MTBF is lower and MTTR is higher - state establishment rate is reduced - state count is reduced - either FW has protocol intelligence and occasionally as protocols evolve or more exotic use cases exist drops valid protocol packets or protocol unintelligent and doesn't add anything to stateless HW based filter on edge router - any service protected by FW is easier to DoS than same service without FW

b) Even if FW is ran (like in front of corporate LAN which doesn't have to deal with denial of service issues and regulator or PCI or equivalent may require FW) valid configurations in my mind are - if 2 == cluster, 1 == single and + == routing separation - 1, 1+1, 2+1 are valid configurations - 2 and 2+2 are invalid configurations - every time i've ran '2', eventually there has been case where cluster is dead and MTTR is high as vendor needs to be engaged and depending on hour the people at vendor who actually can troubleshoot the issue are not at work (used to be US hours, now increasingly experts are in India time) - So if you can only afford 2 devices, have two devices separated by routing, you'll lose state during failure, but you have less failures, even if you can afford 4 devices, don't buy two clusters, since the problem that breaks cluster may affect both clusters

Generally FW is needed if what is behind FW has dubious and únknown state (like user LAN). But if what is behind FW is well thought out DNS or HTTP service FW adds no utility and a lot of liability.

-- ++ytti _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DZZOX5JW... _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/CSKJGBIL...

Matthew Petach

11 Aug 11 Aug

10:08 p.m.

On Fri, Aug 8, 2025 at 2:24 AM Saku Ytti via NANOG <nanog@lists.nanog.org> wrote:

...

On Fri, 8 Aug 2025 at 12:19, Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...
my one advice on anycast is to make _certain_ that the routing reflects service availability on individual nodes -- i.e a node that can't answer queries MUST stop advertising the resolver /128 (or /32 if you have that).

If you do this in a single ASN, where you can guarantee preferences are honored, then instead of pulling advertisement, deprefer it.

Eventually you will manage to cause an issue, where all advertisements are falsely pulled.

Same strategy works in any domain where you are testing if something works, like default route by pinging 8.8.8.8, don't pull, depref.

Having been bitten by this in the past...never base your determination of "healthy" or "working" on a single external data reference. It can be tempting to just assume 8.8.8.8 will always be "up" and "pingable" to verify your internet connectivity is good...right up to the point where Google has a routing snafu, and your DNS infrastructure goes into cascading failure as every one of your sites begins depreferencing its announcements based on the failure of the external health check, and the load begins shifting to a smaller and smaller number of serving sites that were slower at detecting and depreferencing their route announcements, often to the point where the final site is so overwhelmed by all the traffic slamming it that it can't perform healthcheck/depreferencing anymore. Always have at least 3 external probe destinations or health check sites, operated by different entities, and only depreference upon failure to reach 3/3 or 2/3. Do not make decisions about the health of your network based upon the health of a single external entity (unless they are your only upstream provider, or you otherwise share fate with them). If you're pinging someone else to make sure the internet is still alive, ping several, like 8.8.8.8, 1.1.1.1, and 9.9.9.9, and don't react unless you see failures to reach multiple of them. Otherwise, it's likely to be their failure, not yours, and there's no reason to make things worse by changing your systems based on their problems. ...so many painful lessons learned the hard way over the years... ^_^; Matt

William Herrin

10:39 p.m.

On Mon, Aug 11, 2025 at 3:08 PM Matthew Petach via NANOG <nanog@lists.nanog.org> wrote:

...

often to the point where the final site is so overwhelmed by all the traffic slamming it that it can't perform healthcheck/depreferencing anymore.

Hi Matthew, The unix "nice" command helps in this situation. It's counterintuitive to run the critical Internet-facing service at a below-normal priority, but it works. Under normal load there's no difference in performance but when the server is overloaded administrative access and health checks have priority access to the CPU. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Matthew Petach

12 Aug 12 Aug

1:16 a.m.

On Mon, Aug 11, 2025 at 3:40 PM William Herrin <bill@herrin.us> wrote:

...

On Mon, Aug 11, 2025 at 3:08 PM Matthew Petach via NANOG <nanog@lists.nanog.org> wrote:

...
often to the point where the final site is so overwhelmed by all the traffic slamming it that it can't perform healthcheck/depreferencing anymore.

Hi Matthew,

The unix "nice" command helps in this situation. It's counterintuitive to run the critical Internet-facing service at a below-normal priority, but it works. Under normal load there's no difference in performance but when the server is overloaded administrative access and health checks have priority access to the CPU.

Oh--I wasn't talking about the CPU having issues. I was talking about DDoSing your own site, with all the inbound traffic worldwide traffic focusing in on the last remaining site, hammering the network links to the point of absolute congestion. At that point, trying to send update messages to depref the anycast routes for the site generally fails, leading to an extended outage as all the traffic gets stuck trying to reach that last site. It's helpful to set a minimum number of anycast sites in your topology automation systems, such that sites will no longer remove themselves from rotation/distribution if doing so would reduce the count of active sites below the minimum required site count. Dynamic systems are great things, but as with most things in the world, "all things in moderation" is a good motto to keep in mind. Allow sites to dynamically adjust, but only within reasonably set bounds. Don't let too many sites decide they need to shed load at once; the first several, sure; but if the conditions continue, have a floor below which the system stops trying to react, and instead holds steady while paging a human to look at the bigger picture problem, before the entire system goes off line due to the lemmings of automation all chasing one another off the proverbial cliff. Fortunately for me, the search engine caches have long since purged out the evidence of how some of these lessons were learned. ^_^;; Matt

William Herrin

3:52 a.m.

On Mon, Aug 11, 2025 at 6:16 PM Matthew Petach <mpetach@netflight.com> wrote:

...

Oh--I wasn't talking about the CPU having issues. I was talking about DDoSing your own site, with all the inbound traffic worldwide traffic focusing in on the last remaining site, hammering the network links to the point of absolute congestion. At that point, trying to send update messages to depref the anycast routes for the site generally fails, leading to an extended outage as all the traffic gets stuck trying to reach that last site.

Howdy. Why wouldn't the server itself be originating the announcement so that the high-pref route goes away when the routing session collapses?

...

It's helpful to set a minimum number of anycast sites in your topology automation systems, such that sites will no longer remove themselves from rotation/distribution if doing so would reduce the count of active sites below the minimum required site count.

Treading dangerous territory since the participants can't necessarily know the difference between a site that's down and a site that's inaccessible to them (but not other people). Might be safer for the system's components to intentionally collapse to the neutral routing preference at that point rather than waiting for the failure cascade to push system there. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Damian Menscher

12:08 a.m.

On Mon, Aug 11, 2025 at 3:08 PM Matthew Petach via NANOG < nanog@lists.nanog.org> wrote:

...

Having been bitten by this in the past...never base your determination of "healthy" or "working" on a single external data reference. It can be tempting to just assume 8.8.8.8 will always be "up" and "pingable" to verify your internet connectivity is good...right up to the point where Google has a routing snafu

... No need for a routing snafu... 8.8.8.8 is current getting a steady-state 27Mpps (million packets/second) of ICMP ECHO_REQUEST. Internet connectivity checking is not a service we offer, and there is no SLA for it, therefore it may go away at any time. There is a very real risk of me running an April 1st experiment of "what would happen if I just ACL off all the pings?". I might have guessed I'd light up a couple dozen pagers and start a nanog@ flamewar... but if anyone is basing routing decisions on that, it will be a "fun" day indeed! Damian -- Damian Menscher :: Security Reliability Engineer :: Google :: AS15169

David Prall

2:15 p.m.

This here has always been my biggest concern with external monitoring. If the chosen site decides to deny ping one day then your monitoring tool is broken. Can do a quick DNS lookup via a DNS server, since they shouldn't turn that off. But, what happens when they notice the same site doing the same lookup(s) every x minutes. In the past I've utilized the root DNS servers as a good measurement tool. Majority are anycast. All are dual-stack so I get both IPv4 and IPv6 verification. If 60% of them are responding we should be good. But again this is load they aren't expecting, but I assume they know is happening. I can rotate through doing a DNS lookup for .com, .net, .org, .gov, etc. so that I'm not doing the same thing over and over and I'm utilizing something they are designed and prepared to handle. David -- https://dprall.net On 8/11/2025 8:08 PM, Damian Menscher via NANOG wrote:

...

On Mon, Aug 11, 2025 at 3:08 PM Matthew Petach via NANOG < nanog@lists.nanog.org> wrote:

...
Having been bitten by this in the past...never base your determination of "healthy" or "working" on a single external data reference. It can be tempting to just assume 8.8.8.8 will always be "up" and "pingable" to verify your internet connectivity is good...right up to the point where Google has a routing snafu

...

No need for a routing snafu... 8.8.8.8 is current getting a steady-state 27Mpps (million packets/second) of ICMP ECHO_REQUEST. Internet connectivity checking is not a service we offer, and there is no SLA for it, therefore it may go away at any time. There is a very real risk of me running an April 1st experiment of "what would happen if I just ACL off all the pings?". I might have guessed I'd light up a couple dozen pagers and start a nanog@ flamewar... but if anyone is basing routing decisions on that, it will be a "fun" day indeed!

Damian

John Todd

4:25 p.m.

You would be surprised as to what percentage of DNS recursive resolution traffic is "a.root-servers.net" and "www.example.com" and other more specific names like "connectivitycheck.gstatic.com" (which I know has different purposes.) Related: there is a draft at IETF about probing for "reachability" using the DNS rather than picking random names which tends to skew data or present un-necessary costs in various ways, or using ICMP echo. Since query-based status checking seems to be a thing that people do anyway, so maybe it should be formalized so everyone can use/expect the same methods. https://datatracker.ietf.org/doc/draft-sst-dnsop-probe-name/ Despite the flippant comment below about "april 1st experiment" with the largest global resolver, there is a significant risk associated with the concentration of measurements on systems with unintentional shared fate issues. I expect there is a large community of services which expect correct DNS resolution and ICMP echo response from "a.root-servers.net" and "www.google.com" as indicators of general network accessibility. If (for example) the services in .com/.net/.org were to be offline, this would probably create much larger impact than their localized outage since both those services would be offline which would trigger undetermined failure behaviors in many network monitoring/automation or application software stacks. Using IP addresses for service check destinations is slightly better but as noted, ICMP is rarely a service with an SLA, and ICMP echo is frequently blocked or heavily rate-limited. I will comment with my Quad9 hat on that there is no risk of us doing an April 1st experiment of turning off ICMP echo packets to 9.9.9.9. There are however real risks of ICMP having increased failure rates in DDOS conditions in any network, either locally or at the receiving end. As another DNS-oriented friend of mine has in his .sig: "The Prudent Mariner never relies solely on any single aid to navigation." JT On 12 Aug 2025, at 7:15, David Prall via NANOG wrote:

...

This here has always been my biggest concern with external monitoring. If the chosen site decides to deny ping one day then your monitoring tool is broken.

Can do a quick DNS lookup via a DNS server, since they shouldn't turn that off. But, what happens when they notice the same site doing the same lookup(s) every x minutes.

In the past I've utilized the root DNS servers as a good measurement tool. Majority are anycast. All are dual-stack so I get both IPv4 and IPv6 verification. If 60% of them are responding we should be good. But again this is load they aren't expecting, but I assume they know is happening. I can rotate through doing a DNS lookup for .com, .net, .org, .gov, etc. so that I'm not doing the same thing over and over and I'm utilizing something they are designed and prepared to handle.

David

-- https://dprall.net

On 8/11/2025 8:08 PM, Damian Menscher via NANOG wrote:

...
On Mon, Aug 11, 2025 at 3:08 PM Matthew Petach via NANOG < nanog@lists.nanog.org> wrote:

...
Having been bitten by this in the past...never base your determination of "healthy" or "working" on a single external data reference. It can be tempting to just assume 8.8.8.8 will always be "up" and "pingable" to verify your internet connectivity is good...right up to the point where Google has a routing snafu

...

No need for a routing snafu... 8.8.8.8 is current getting a steady-state 27Mpps (million packets/second) of ICMP ECHO_REQUEST. Internet connectivity checking is not a service we offer, and there is no SLA for it, therefore it may go away at any time. There is a very real risk of me running an April 1st experiment of "what would happen if I just ACL off all the pings?". I might have guessed I'd light up a couple dozen pagers and start a nanog@ flamewar... but if anyone is basing routing decisions on that, it will be a "fun" day indeed!

Damian

_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/YIM6ZS3Z...

Jay Acuna

5:38 p.m.

On Tue, Aug 12, 2025 at 10:48 AM David Prall via NANOG <nanog@lists.nanog.org> wrote:

...

Can do a quick DNS lookup via a DNS server, since they shouldn't turn that off. But, what happens when they notice the same site doing the same lookup(s) every x minutes.

I think they won't notice, because that kind of query volume is orders of magnitude less than average usage of 1 internet-connected device. That is if you are running 2 or 3 queries every 3 or 4 minutes. Meanwhile the average web-surfing user connects to websites that easily cause 20+ DNS queries over the span of a couple seconds in order to load a whole web page with all its JS frameworks, CSS, and Fonts being remote-loaded from various domains. Querying the service on the IP with an actual query is the best test, but it should be: use a few common FQDNs on different domains to run the lookup on, and not just one FQDN. If any of the lookups succeed, then the resolver is deemed "alive and working / available". If you only query one FQDN per resolver, then you might not always be able to easily distinguish between a failure of the target authoritative domain you are querying, versus a lack of responsiveness by that resolver in general -- -JA

gary＠netsyst.fr

14 Aug 14 Aug

10:32 a.m.

Hello, You should try setting up an ACL, but I'm not sure your Google Home will appreciate it. 12:25:46.240055 IP 192.168.0.153 > dns.google: ICMP echo request, id 21282, seq 1, length 64 12:25:46.246829 IP dns.google > 192.168.0.153: ICMP echo reply, id 21282, seq 1, length 64

Jérôme Nicolle

15 Aug 15 Aug

12:18 p.m.

Hello Damian, Le 12/08/2025 à 02:08, Damian Menscher via NANOG a écrit :

...

There is a very real risk of me running an April 1st experiment of "what would happen if I just ACL off all the pings?".

This would be SO grand and beautifull I can't wait for this day. I already picture media outlets crying out loud as "The Internet Is Broken", mostly because their networks are pegged to your services. Would you mind a little experiment with that ? In France there's a call for protests and a total lockdown on September the 10th. It'd be great to have quad8 down just for France on that day. I'm sure it would provide us with meaningfull datasets… hem… solely for research purposes of course. Thanks in advance ! P.S. : Mind you, the crackdown on age verification ousted Mindgeek from local trafic, and required several peering-policy modifications to cope with trafic going to other sites. That's a delight on a research point of view. Incidents like this should occur more often to help us enhance global resiliency. -- Jérôme Nicolle +33 6 19 31 27 14

Tom Beecher

2:58 p.m.

...

Would you mind a little experiment with that ? In France there's a call for protests and a total lockdown on September the 10th. It'd be great to have quad8 down just for France on that day.

Asking Google to inject themselves into protests about the French budget takes some pretty massive balls. Almost impressive. On Fri, Aug 15, 2025 at 8:19 AM Jérôme Nicolle via NANOG < nanog@lists.nanog.org> wrote:

...

Hello Damian,

Le 12/08/2025 à 02:08, Damian Menscher via NANOG a écrit :

...
There is a very real risk of me running an April 1st experiment of "what would happen if I just ACL off all the pings?".

This would be SO grand and beautifull I can't wait for this day. I already picture media outlets crying out loud as "The Internet Is Broken", mostly because their networks are pegged to your services.

Would you mind a little experiment with that ? In France there's a call for protests and a total lockdown on September the 10th. It'd be great to have quad8 down just for France on that day.

I'm sure it would provide us with meaningfull datasets… hem… solely for research purposes of course.

Thanks in advance !

P.S. : Mind you, the crackdown on age verification ousted Mindgeek from local trafic, and required several peering-policy modifications to cope with trafic going to other sites. That's a delight on a research point of view. Incidents like this should occur more often to help us enhance global resiliency.

-- Jérôme Nicolle +33 6 19 31 27 14

_______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SMAKLUKZ...

Jérôme Nicolle

3:58 p.m.

Hi Tom, Le 15/08/2025 à 16:58, Tom Beecher a écrit :

...

Asking Google to inject themselves into protests about the French budget takes some pretty massive balls. Almost impressive.

There's a saying here : "Forget you have no chance, go on with it. If misunderstood, it could work." Words to live by. Also, takes a wheelbarrow. Source : https://www.youtube.com/watch?v=6KZo-xKxuLY Best regards, P.S: on a more serious note, such an event as gDNS going down would most certainly create havoc. It is run by trained professionals, don't try this at home. -- Jérôme Nicolle +33 6 19 31 27 14

Tom Beecher

4:13 p.m.

I'll restate with a more adult argument instead of snark. I completely support protests to voice displeasure. ( Non-violent ones of course. ) I do NOT support turning off bits of internet infrastructure to bring attention to a thing being protested, or as part of the protest itself. Nobody should support doing this. Ever. In my view even suggesting it is bad. If your suggestion was mostly in jest, then I missed that and I apologize for the strong reply. If you were actually shooting your shot to have this happen, I would urge you to seriously consider what the consequences of that would be. On Fri, Aug 15, 2025 at 11:58 AM Jérôme Nicolle <jerome@ceriz.fr> wrote:

...

Hi Tom,

Le 15/08/2025 à 16:58, Tom Beecher a écrit :

...
Asking Google to inject themselves into protests about the French budget takes some pretty massive balls. Almost impressive.

There's a saying here : "Forget you have no chance, go on with it. If misunderstood, it could work."

Words to live by. Also, takes a wheelbarrow.

Source : https://www.youtube.com/watch?v=6KZo-xKxuLY

Best regards,

P.S: on a more serious note, such an event as gDNS going down would most certainly create havoc. It is run by trained professionals, don't try this at home.

-- Jérôme Nicolle +33 6 19 31 27 14

Jérôme Nicolle

4:30 p.m.

Tom, Rest assured I'll defend public infrastructures up until cold. Though I'm always on for a good joke, and just wanted to point out induced dependencies that weakens most IT setups. What worries me is unforeseen consequences of the small pieces of infrastructure or software that everyone took for granted, and that could just vanish overnight, over a targeted attack or geopolitical situation for example. Sorry if it looked like a real suggestion. The world is so trumped by idiocracy it's getting harder and harder to distinguish satire from the truth. Better assume stupid than malignant ©®™ Hanlon. Best regards, Le 15/08/2025 à 18:13, Tom Beecher a écrit :

...

I'll restate with a more adult argument instead of snark.

I completely support protests to voice displeasure. ( Non-violent ones of course. )

I do NOT support turning off bits of internet infrastructure to bring attention to a thing being protested, or as part of the protest itself. Nobody should support doing this. Ever. In my view even suggesting it is bad.

If your suggestion was mostly in jest, then I missed that and I apologize for the strong reply. If you were actually shooting your shot to have this happen, I would urge you to seriously consider what the consequences of that would be.

On Fri, Aug 15, 2025 at 11:58 AM Jérôme Nicolle <jerome@ceriz.fr <mailto:jerome@ceriz.fr>> wrote:

Hi Tom,

Le 15/08/2025 à 16:58, Tom Beecher a écrit : > Asking Google to inject themselves into protests about the French budget > takes some pretty massive balls. Almost impressive.

There's a saying here : "Forget you have no chance, go on with it. If misunderstood, it could work."

Words to live by. Also, takes a wheelbarrow.

Source : https://www.youtube.com/watch?v=6KZo-xKxuLY <https:// www.youtube.com/watch?v=6KZo-xKxuLY>

Best regards,

P.S: on a more serious note, such an event as gDNS going down would most certainly create havoc. It is run by trained professionals, don't try this at home.

-- Jérôme Nicolle +33 6 19 31 27 14

-- Jérôme Nicolle +33 6 19 31 27 14

Randy Bush

5:41 p.m.

back in teh day, jon suggested turning the internet off for a day so people would 'get over it.' randy

Saku Ytti

12 Aug 12 Aug

7:15 a.m.

On Tue, 12 Aug 2025 at 01:08, Matthew Petach <mpetach@netflight.com> wrote:

...

If you're pinging someone else to make sure the internet is still alive, ping several, like 8.8.8.8, 1.1.1.1, and 9.9.9.9, and don't react unless you see failures to reach multiple of them. Otherwise, it's likely to be their failure, not yours, and there's no reason to make things worse by changing your systems based on their problems.

I am bit repeating myself, apologies. But do also ensure that your health check is demoting, not removing. Like changing admin weight to inferior, this way if everything 'fails' because the health check is bogus, you are back to square 1. Very easy to with IP SLA/tracking and the like, yet most examples, even at vendor documentation remove, instead of demote. -- ++ytti

William Herrin

8 Aug 8 Aug

4:20 p.m.

On Fri, Aug 8, 2025 at 2:17 AM Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...

anycast unbound, preferably on something more mature than Linux, so like FreeBSD or OpenBSD.

You don't need anycast DNS for 30k users. Stay away from anycast unless you really, really, really know what you're doing. DNS is also TCP and no commodity DNS software environment implements an anycast TCP stack, only the normal unicast stack. Route splitting shows up in the most unexpected places and it won't just give you a bad day, it'll give you a bad month with intractable and seemingly (but not really) intermittent problems that are challenging to nail down. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Josh Luthman

4:41 p.m.

I do Anycast for much much smaller. It's great to reboot one server and have the other take all of the load. 0 customer interruption, not even a single DNS query lost. On Fri, Aug 8, 2025, 12:21 PM William Herrin via NANOG < nanog@lists.nanog.org> wrote:

...

On Fri, Aug 8, 2025 at 2:17 AM Måns Nilsson via NANOG <nanog@lists.nanog.org> wrote:

...
anycast unbound, preferably on something more mature than Linux, so like FreeBSD or OpenBSD.

You don't need anycast DNS for 30k users. Stay away from anycast unless you really, really, really know what you're doing.

DNS is also TCP and no commodity DNS software environment implements an anycast TCP stack, only the normal unicast stack. Route splitting shows up in the most unexpected places and it won't just give you a bad day, it'll give you a bad month with intractable and seemingly (but not really) intermittent problems that are challenging to nail down.

Regards, Bill Herrin

-- William Herrin bill@herrin.us https://bill.herrin.us/ _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/ZBFC32QZ...

William Herrin

5:09 p.m.

On Fri, Aug 8, 2025 at 9:42 AM Josh Luthman <josh@imaginenetworksllc.com> wrote:

...

I do Anycast for much much smaller. It's great to reboot one server and have the other take all of the load. 0 customer interruption, not even a single DNS query lost.

Hi Josh, You don't need anycast routing to do that, or more precisely you don't need the route to persist in an anycast state for more than a few seconds during the handoff. You can implement dynamic but still unicast routing to the DNS servers without incurring the wrath of the anycast gods. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Måns Nilsson

9 Aug 9 Aug

12:38 p.m.

Subject: Re: Recommended DNS server for a medium 20-30k users isp Date: Fri, Aug 08, 2025 at 10:09:04AM -0700 Quoting William Herrin via NANOG (nanog@lists.nanog.org):

...

On Fri, Aug 8, 2025 at 9:42 AM Josh Luthman <josh@imaginenetworksllc.com> wrote:

...
I do Anycast for much much smaller. It's great to reboot one server and have the other take all of the load. 0 customer interruption, not even a single DNS query lost.

Hi Josh,

You don't need anycast routing to do that, or more precisely you don't need the route to persist in an anycast state for more than a few seconds during the handoff. You can implement dynamic but still unicast routing to the DNS servers without incurring the wrath of the anycast gods.

The elephant in the room is cascading failures. Other than that, I'd not want to be without anycast for its service level record. I don't have to be up in the middle of the night to patch my resolvers. I can take the most loaded one out of service at any time by shutting down BGP, waiting a couple seconds, and it will be completely drained from requests, and I can reboot. No customer or end user is going to notice. Regarding TCP, yes, this is a potential issue. You can think about it and it will grow in your mind, or you can do some observations and conclude that unless you messed your routing up really badly (which is not DNS' fault but still on-topic here) the mean session length for a client-to 1st hop resolver TCP session is going to be orders of magnitude shorter than the times between routing updates that make a certain router change its mind about which anycast node is the closest one. Further, I'd make an educated guess and say that the recursion traffic going from resolver to auth server is much more likely to hit TCP. And that is unicast all the way. Also, EDNS0. We usually have ~1200 bytes to play with. Not 512. YMMV. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE SA0XLR +46 705 989668 YOW!! Everybody out of the GENETIC POOL!

William Herrin

1:18 p.m.

On Sat, Aug 9, 2025 at 5:38 AM Måns Nilsson <mansaxel@besserwisser.org> wrote:

...

Regarding TCP, yes, this is a potential issue. You can think about it and it will grow in your mind, or you can do some observations and conclude that unless you messed your routing up really badly (which is not DNS' fault but still on-topic here) the mean session length for a client-to 1st hop resolver TCP session is going to be orders of magnitude shorter than the times between routing updates that make a certain router change its mind about which anycast node is the closest one.

Hi Måns, This is a case of misunderstanding what the numbers are telling you. Yes, the failure rate is low, but it's not random. It's not a case of 99 queries work, 1 doesn't. and you try again and it works. It's a case of queries work for 99 people and 1 person with just the wrong connections to the network graph experiences persistent failures. And then your front-line customer support blames the customer for your error because obviously it's working for everybody else. If it doesn't work in the corner cases then it doesn't work. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Bryan Fields

8 Aug 8 Aug

11:57 a.m.

On 8/7/25 20:44, DurgaPrasad - DatasoftComnet via NANOG wrote:

...

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? I've been happy with PowerDNS Recursor. What sort of latency were you seeing in it and at what loading?

-- Bryan Fields 727-409-1194 - Voice http://bryanfields.net

sthaug＠nethelp.no

12:04 p.m.

...

...
Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? I've been happy with PowerDNS Recursor. What sort of latency were you seeing in it and at what loading?

I wonder if we're trying to answer the wrong question. "sometimes find the caching times a bit on upper side" could certainly be interpreted to refer to caching according to the TTL specified for the zone on the *authoritative* server. If so - it's possible that the user simply needs to be able to specify a maximum TTL, independent of the setting on the authoritative server. Steinar Haug, AS2116

Rich Kulawiec

1:19 p.m.

On Fri, Aug 08, 2025 at 12:44:40AM +0000, DurgaPrasad - DatasoftComnet via NANOG wrote:

...

Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP.

Yes. ISC BIND, running on OpenBSD. Performs well on minimal hardware, plus the OpenBSD firewall implementation ("pf") is excellent. And since both can be configured and operated from the command line, this setup readily lends itself to revision control, scripting, and synchronization. ---rsk

Mike Simpson

3:35 p.m.

And completely the opposite in every possible way from running some gui dependent security nightmare on windows, a platform renowned for its amazing scheduler. If you want authoritative only on OpenBSD then NSD works well and can be synced from BIND if you want to only present OpenBSD to the internet.

...

On 8 Aug 2025, at 14:20, Rich Kulawiec via NANOG <nanog@lists.nanog.org> wrote:

On Fri, Aug 08, 2025 at 12:44:40AM +0000, DurgaPrasad - DatasoftComnet via NANOG wrote:

...
Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP.

Yes. ISC BIND, running on OpenBSD. Performs well on minimal hardware, plus the OpenBSD firewall implementation ("pf") is excellent. And since both can be configured and operated from the command line, this setup readily lends itself to revision control, scripting, and synchronization.

---rsk _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/VDMXUUAI...

Tim Burke

3:37 p.m.

At $lastjob, we had 60k subs using a pair of BSD boxes running a pretty simple BIND instance with no issues. We used frr to advertise a vip for customers to query against, but otherwise it was a pretty simple install/config.

...

On Aug 7, 2025, at 19:45, DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

Stefan Schmidt

9 Aug 9 Aug

11:52 a.m.

On 8/8/25 02:44, DurgaPrasad - DatasoftComnet via NANOG wrote:

...

Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

In my experience with ~700k DSL customers before 2010 and DC setups after that the default PowerDNS recursor settings do not really need tuning apart from limiting the amount of entries in cache [0] which directly corresponds to memory usage. The amount of memory required per entry depends on your platform and has changed over time so you should monitor resource usage and adjust accordingly. I also usually limit the max-negative-ttl to 10 minutes instead of the 1 hour default [1] which helps with recovery after some misconfiguration out there. For monitoring these and other metrics can recommend the use of prometheus/grafana via the provided metrics endpoint. [2] The average response latency in particular can also let you know when the quality of your recursive nameservers network connection deteriorates. Since there also is dnsdist [3] these days i can wholeheartedly recommend putting your recursive DNS Service behind it or an HA-setup of them so you can seamlessly switch between nodes or even implementations. dnsdist also provides a /metrics endpoint. [4] [0] https://doc.powerdns.com/recursor/settings.html#max-cache-entries [1] https://doc.powerdns.com/recursor/settings.html#max-negative-ttl [2] https://doc.powerdns.com/recursor/metrics.html#using-prometheus-export [3] https://www.dnsdist.org/index.html [4] https://www.dnsdist.org/statistics.html

Mike Hammett

9:44 p.m.

Responding to the thread in general, not any particular person. For those of you with a firewall in front of your DNS servers, what are you having the firewall do? ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "DurgaPrasad - DatasoftComnet via NANOG" <nanog@lists.nanog.org> To: nanog@lists.nanog.org Cc: "DurgaPrasad - DatasoftComnet" <dp@datasoftcomnet.com> Sent: Thursday, August 7, 2025 7:44:40 PM Subject: Recommended DNS server for a medium 20-30k users isp Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any. Thank you /DP _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

Tom Beecher

10 Aug 10 Aug

12:24 p.m.

...

We have used powerdns and unbound but sometimes find the caching times a bit on upper side.

By default, unbound caches an entry for the duration of the TTL received from the authoritative server. You can modify this . - cache-min-ttl : if TTL less than this, cache it at least this long - cache-max-ttl : if TTL is more than this, only cache it this long There are many other config options in unbound that allow you to tune cache behaviors to your desired use case. A cursory look at PowerDNS docs show that has similar options. I'd suggest working with the software you already have to see if it can be configured to meet your requirements first. Likely to be less effort. On Thu, Aug 7, 2025 at 8:45 PM DurgaPrasad - DatasoftComnet via NANOG < nanog@lists.nanog.org> wrote:

...

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

Ryan Hamel

11 Aug 11 Aug

1 a.m.

I've had good luck with knot-resolver, combined with ExaBGP and a health check script controlling the announcements upstream among other things. Ryan Hamel ________________________________ From: DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> Sent: Thursday, August 7, 2025 5:44 PM To: nanog@lists.nanog.org <nanog@lists.nanog.org> Cc: DurgaPrasad - DatasoftComnet <dp@datasoftcomnet.com> Subject: Recommended DNS server for a medium 20-30k users isp Caution: This is an external email and may be malicious. Please take care when clicking links or opening attachments. Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any. Thank you /DP _______________________________________________ NANOG mailing list https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.nanog.org%2Farchives%2Flist%2Fnanog%40lists.nanog.org%2Fmessage%2FSUTKDISSISPWQY3YGF25FBQNN2JD5HDP%2F&data=05%7C02%7Cryan%40rkhtech.org%7C887d41a221174314345408ddd614e919%7C81c24bb4f9ec4739ba4d25c42594d996%7C0%7C0%7C638902107516087426%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=zNAPhzQH2Mjon3Cc%2F1GkJDqkug5BjraBHshuP2MNHEs%3D&reserved=0<https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISSISPWQY3YGF25FBQNN2JD5HDP/>

Mark Andrews

2:22 a.m.

Use nameservers that support DNS COOKIE (RFC 7873) and enable it if it is not already on by default. If the nameserver vendor that you are currently using doesn’t support DNS COOKIE find a better nameserver. DNS COOKIE provides cheap protection against off path DNS spoofing but it only provides protection if both server and client support it. It’s been 9 years since RFC 7873 was published and in that time just about all of the servers with broken EDNS implementations that failed to ignore unknown EDNS options, as per RFC 6981, have been replaced with ones that are RFC compliant. If you previously disabled sending DNS COOKIE requests in the past it is time to re-enable it. Mark

...

On 8 Aug 2025, at 10:44, DurgaPrasad - DatasoftComnet via NANOG <nanog@lists.nanog.org> wrote:

Hello all, Do you have any recommendations for recursive DNS servers for a medium sized (20-30k users) ISP. We have used powerdns and unbound but sometimes find the caching times a bit on upper side. Any suggestions between these two or anything new? Also need points on how much we tune the settings pros and cons if any.

Thank you /DP _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SUTKDISS...

-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Damian Menscher

21 Aug 21 Aug

12:23 a.m.

Responding to pings to 8.8.8.8 is not something we've ever advertised as a service, we have on multiple occasions told people to _not_ rely on, and it comes with no SLA -- it's just a best^Wworst effort "service". Seeing people rely on this is concerning, since it's not something we're committed to supporting. And, the best thing to do when you suspect there might be unknown or undocumented reliance on something is to expose it by triggering a brief (and reversible!) disruption. That gives people time to identify the dependency before an unintentional long-term outage causes significant problems, when you may not have the option to revert to the previous configuration. Lest anyone still think I'm being "flippant" about this, I've actually done it before... in 2016 we speculated there might be machines talking directly to one of our authoritative nameservers (ns[1-4].google.com) rather than going through a recursive resolver (which would know how to fail over if an authoritative were unreachable for any reason). To flush out any such dependencies, I deployed an ACL that dropped all traffic to ns4.google.com, for about an hour. Graphs showed traffic immediately shifted to ns[1-3], we saw no discussion of the test here or on dns-operations@, and no Google service owners reported disruptions, or we could have ended the test earlier. At the end of the test we removed the ACL and traffic shifted back to a roughly equal balance over the next 10 minutes. Back to today's discussion, I'm only threatening to drop (or more severely throttle) ICMP to 8.8.8.8, _not_ DNS resolution, since DNS resolution _is_ a service we offer. That said, we do reserve the right to drop abuse to that service (including UDP amplification attacks and DNS cache-busting attacks) to protect ourselves and others. Damian On Tue, Aug 12, 2025 at 9:26 AM John Todd via NANOG <nanog@lists.nanog.org> wrote:

...

You would be surprised as to what percentage of DNS recursive resolution traffic is "a.root-servers.net" and "www.example.com" and other more specific names like "connectivitycheck.gstatic.com" (which I know has different purposes.)

Related: there is a draft at IETF about probing for "reachability" using the DNS rather than picking random names which tends to skew data or present un-necessary costs in various ways, or using ICMP echo. Since query-based status checking seems to be a thing that people do anyway, so maybe it should be formalized so everyone can use/expect the same methods.

https://datatracker.ietf.org/doc/draft-sst-dnsop-probe-name/

Despite the flippant comment below about "april 1st experiment" with the largest global resolver, there is a significant risk associated with the concentration of measurements on systems with unintentional shared fate issues. I expect there is a large community of services which expect correct DNS resolution and ICMP echo response from "a.root-servers.net" and "www.google.com" as indicators of general network accessibility. If (for example) the services in .com/.net/.org were to be offline, this would probably create much larger impact than their localized outage since both those services would be offline which would trigger undetermined failure behaviors in many network monitoring/automation or application software stacks.

Using IP addresses for service check destinations is slightly better but as noted, ICMP is rarely a service with an SLA, and ICMP echo is frequently blocked or heavily rate-limited. I will comment with my Quad9 hat on that there is no risk of us doing an April 1st experiment of turning off ICMP echo packets to 9.9.9.9. There are however real risks of ICMP having increased failure rates in DDOS conditions in any network, either locally or at the receiving end. As another DNS-oriented friend of mine has in his .sig: "The Prudent Mariner never relies solely on any single aid to navigation."

JT

On 12 Aug 2025, at 7:15, David Prall via NANOG wrote:

...
This here has always been my biggest concern with external monitoring. If the chosen site decides to deny ping one day then your monitoring tool is broken.

Can do a quick DNS lookup via a DNS server, since they shouldn't turn that off. But, what happens when they notice the same site doing the same lookup(s) every x minutes.

In the past I've utilized the root DNS servers as a good measurement tool. Majority are anycast. All are dual-stack so I get both IPv4 and IPv6 verification. If 60% of them are responding we should be good. But again this is load they aren't expecting, but I assume they know is happening. I can rotate through doing a DNS lookup for .com, .net, .org, .gov, etc. so that I'm not doing the same thing over and over and I'm utilizing something they are designed and prepared to handle.

David

-- https://dprall.net

On 8/11/2025 8:08 PM, Damian Menscher via NANOG wrote:

...
On Mon, Aug 11, 2025 at 3:08 PM Matthew Petach via NANOG < nanog@lists.nanog.org> wrote:

...
Having been bitten by this in the past...never base your determination of "healthy" or "working" on a single external data reference. It can be tempting to just assume 8.8.8.8 will always be "up" and "pingable" to verify your internet connectivity is good...right up to the point where Google has a routing snafu

...

No need for a routing snafu... 8.8.8.8 is current getting a steady-state 27Mpps (million packets/second) of ICMP ECHO_REQUEST. Internet connectivity checking is not a service we offer, and there is no SLA for it, therefore it may go away at any time. There is a very real risk of me running an April 1st experiment of "what would happen if I just ACL off all the pings?". I might have guessed I'd light up a couple dozen pagers and start a nanog@ flamewar... but if anyone is basing routing decisions on that, it will be a "fun" day indeed!

Damian

_______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/YIM6ZS3Z... _______________________________________________ NANOG mailing list

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/FGMPHVNA...

Age (days ago)

102

Last active (days ago)

List overview

Download

62 comments

35 participants

participants (35)

Andrew Latham
brent saner
Bryan Fields
Crist Clark
Damian Menscher
David Guo
David Prall
DurgaPrasad - DatasoftComnet
gary＠netsyst.fr
Jay Acuna
John Todd
Josh Luthman
Jérôme Nicolle
Marco Moock
Mark Andrews
Matthew Petach
Mel Beckman
Mike Hammett
Mike Simpson
Måns Nilsson
Nick Hilliard
Randy Bush
Rich Kulawiec
Robert L Mathews
Rusty Dekema
Ryan Hamel
Saku Ytti
Smoot Carl-Mitchell
Stefan Schmidt
sthaug＠nethelp.no
Tim Burke
Tom Beecher
Uesley Correa
William Herrin
Łukasz Bromirski

Recommended DNS server for a medium 20-30k users isp

tags

participants (35)