Re: Correctly dealing with bots and scrapers.

17 Jul 2025

      Hi,

Honestly, the best and safest way to combat this, is to ensure that
it's very cheap and fast for you to generate the pages that the bots
request, this way, you wouldn't really care if they request said pages
or not.

For example, a lot of software invalidly sets random/useless cookies
for literally no reason.  This prevents caching.

What you could do, is, strip out all of the cookies both ways, and
then simply cache the generic responses for the generic requests that
your backend generates.  This can be done with standard OSS nginx by
clearing Cookie and Set-Cookie headers.

If you really want to limit requests by the User-Agent, instead of by
REMOTE_ADDR like it's normally done, the standard OSS nginx can do
that, too, see http://nginx.org/r/limit_req_zone and
nginx.org/r/$http_ for `$http_user_agent`, although you might
inadvertently blacklist popular browsers this way.

C.

On Wed, 16 Jul 2025 at 11:49, Andrew Latham via NANOG
<nanog@lists.nanog.org> wrote:
...
I just had an issue with a web-server where I had to block a /18 of a
large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing
with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that
cycle over a large variety of IPs with the exact same user agent?
3.  Has anyone written or found a tool to concentrate IP addresses
into networks for IPTABLES or NFT? (60% of IPs for network X in list
so add network X and remove individual IP entries.)
--
- Andrew "lathama" Latham -

Re: Correctly dealing with bots and scrapers.

Constantine A. Murenin