
Hi, Honestly, the best and safest way to combat this, is to ensure that it's very cheap and fast for you to generate the pages that the bots request, this way, you wouldn't really care if they request said pages or not. For example, a lot of software invalidly sets random/useless cookies for literally no reason. This prevents caching. What you could do, is, strip out all of the cookies both ways, and then simply cache the generic responses for the generic requests that your backend generates. This can be done with standard OSS nginx by clearing Cookie and Set-Cookie headers. If you really want to limit requests by the User-Agent, instead of by REMOTE_ADDR like it's normally done, the standard OSS nginx can do that, too, see http://nginx.org/r/limit_req_zone and nginx.org/r/$http_ for `$http_user_agent`, although you might inadvertently blacklist popular browsers this way. C. On Wed, 16 Jul 2025 at 11:49, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham -