On Mon, May 18, 2020 at 11:46:24PM -0400, Justin Wilson (Lists) wrote:
What are you folk doing to validate your DNS cache server configs and operation? In other words, what are you doing to make sure they are performing well, not just alive.
There are various things you can do. With resolvers like BIND's named, you have a command called "rndc stats" that dumps statistics counters to a file. This contains a variety of statistics that can do with monitoring. I'll list some things for BIND's named, but you can probably do something similar for other products too: (1) Check the size of your cache and ensure that it is not too small and that it is bound. The max-cache-size config option will limit it. In old versions of named, there was no limit. Current versions have an automatic limit. You want this size to be at least a few hundred MB for a small LAN and larger if it is a widely used resolver. Check the "cache records deleted due to memory exhaustion" counter in rndc stats output. (2) Check your cache hit rate (CHR). This is the number of queries that were answered from cache vs. number of overall client queries. You should be able to compute this from rndc stats output. It can usually be anything between 50% to 95% depending on the usage, but if you see it dipping below 50% and this is an ordinary resolver, you may want to look into why that is so. CHR is typically graphed and monitored that way. (3) Check the number of outstanding queries that the resolver is performing. This should not be very high (the CHR influences it, but other factors can cause this to go high too). "rndc recursing" dumps the list of the clients that are waiting on recursion to finish (because the cache didn't have an answer for them; note that many clients waiting for the same question doesn't mean the resolver makes as many queries to upstream authorities). The "recursive-clients" named.conf option is related. The rndc recursing clients dump also contains a timestamp of seconds since epoch, and it lists IPv4 and IPv6 clients in sequence of arrival. The first timestamp of IPv4 or IPv6 client should not be very far off from current time. (It can be due to various issues). (4) Check the resolver and socket I/O counters in the rndc stats dump. Check the "NNNN queries caused recursion", "NNNN queries caused recursion", "NNNN recursing clients", "NNNN UDP queries in progress", "NNNN active fetches", etc. They have other identifiers in named's XML statistics. Check what number of UDP recv and send errors are happening. Keep an eye on querylog if you have them for attack patterns (random subdomain style attacks are common, but there are also some other attack patterns which I won't mention). Mitigation may involve contacting your DNS vendor. Keep an eye on response sizes and amplification attacks if you're running a publicly reachable service. With very low TTLs to highly popular questions's answers (names not typically used by humans but by apps), some mitigations don't work very well. If you are at this scale, you will likely be able to contact your DNS product's support for help. A resolver needs to maximize its CHR so it does very little work and serves its purpose of being a cache. A resolver also has to be a good internet citizen and not query upstream nameservers excessively, or upstream namservers will sometimes drop queries. The above are examples of what to monitor which happen outside the normal, where normal includes things like failures to contact nameservers, DNSSEC validation failures, etc. A resolver is a hugely complex process, routinely with software bugs, and attacks happening if it is public facing. What is documented may be different from what actually happens. Your question is a good one, and it is good to monitor a highly used resolver. Mukund