Correctly dealing with bots and scrapers.

I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on. 1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example? 2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent? 3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.) -- - Andrew "lathama" Latham -

Am 16.07.2025 um 10:48:39 Uhr schrieb Andrew Latham via NANOG:
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt. Then use fail2ban to block all IP addresses that poll the file. -- Gruß Marco Send unsolicited bulk mail to 1752655719muell@cartoonies.org

Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net>

Chris Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow. On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham -

As Chris states, broad IP based blocking is unlikely to be very effective , and likely more problematic down the line anyway. For the slightly more 'honorable' crawlers, they'll respect robots.txt, and you can block their UAs there. Fail2ban is a very good option right now. It will be even better if nepenthes eventually integrates with it. Then you can have some real fun. On Wed, Jul 16, 2025 at 3:39 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DHUYTBIX...

I've thought about using client-side tokens with behavioral analysis and a points system, but haven't implemented it in one of my sites to test it's reliability just yet. But basically issue a signed token to each client and watch that client using js instantiated on each page of the site. Give actions (or lack thereof) scores based on likelyhood to be a bot and fail2ban them past a certain threshold. look at mouse movement, scroll depth, keypress timings, page transitions, dwell time, js events, start/end routes, etc. - so if a client is on a page for under 100ms, it accumulates 'bot' points, say +10 pts - no EventListener JS events +40 pts - client joins site at a nested page and drills down +5 pts (low because people could bookmark a specific page that triggers this) - etc. then decay these points based on the opposite effects happening. There's no guaranteed solution and this would require careful tweaking of the ban threshold and whatnot with real testing to not accidentally block your users, but I feel it could mitigate a lot of bot situations. For banned users just redirect them to a page with a little unban form they can quickly fill out or something. -- Ryland ------ Original Message ------ From "Tom Beecher via NANOG" <nanog@lists.nanog.org> To "North American Network Operators Group" <nanog@lists.nanog.org> Cc "Tom Beecher" <beecher@beecher.cc> Date 7/16/2025 2:43:51 PM Subject Re: Correctly dealing with bots and scrapers.
As Chris states, broad IP based blocking is unlikely to be very effective , and likely more problematic down the line anyway.
For the slightly more 'honorable' crawlers, they'll respect robots.txt, and you can block their UAs there.
Fail2ban is a very good option right now. It will be even better if nepenthes eventually integrates with it. Then you can have some real fun.
On Wed, Jul 16, 2025 at 3:39 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DHUYTBIX...
NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/ECB77Z6S...

I don’t have experience with it, but I was pointed to this project recently which may be of interest to you. “Anubis is a Web AI Firewall Utility that weighs the soul of your connection<https://en.wikipedia.org/wiki/Weighing_of_souls> using one or more challenges in order to protect upstream resources from scraper bots. “ https://github.com/TecharoHQ/anubis From: Ryland Kremeier via NANOG <nanog@lists.nanog.org> Date: Wednesday, July 16, 2025 at 2:08 PM To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Ryland Kremeier <rkremeier@barryelectric.com> Subject: Re[2]: Correctly dealing with bots and scrapers. I've thought about using client-side tokens with behavioral analysis and a points system, but haven't implemented it in one of my sites to test it's reliability just yet. But basically issue a signed token to each client and watch that client using js instantiated on each page of the site. Give actions (or lack thereof) scores based on likelyhood to be a bot and fail2ban them past a certain threshold. look at mouse movement, scroll depth, keypress timings, page transitions, dwell time, js events, start/end routes, etc. - so if a client is on a page for under 100ms, it accumulates 'bot' points, say +10 pts - no EventListener JS events +40 pts - client joins site at a nested page and drills down +5 pts (low because people could bookmark a specific page that triggers this) - etc. then decay these points based on the opposite effects happening. There's no guaranteed solution and this would require careful tweaking of the ban threshold and whatnot with real testing to not accidentally block your users, but I feel it could mitigate a lot of bot situations. For banned users just redirect them to a page with a little unban form they can quickly fill out or something. -- Ryland ------ Original Message ------ From "Tom Beecher via NANOG" <nanog@lists.nanog.org> To "North American Network Operators Group" <nanog@lists.nanog.org> Cc "Tom Beecher" <beecher@beecher.cc> Date 7/16/2025 2:43:51 PM Subject Re: Correctly dealing with bots and scrapers.
As Chris states, broad IP based blocking is unlikely to be very effective , and likely more problematic down the line anyway.
For the slightly more 'honorable' crawlers, they'll respect robots.txt, and you can block their UAs there.
Fail2ban is a very good option right now. It will be even better if nepenthes eventually integrates with it. Then you can have some real fun.
On Wed, Jul 16, 2025 at 3:39 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list

There is what we could call a "growth market" of grey market organizations and individuals selling residential proxies that route traffic through actual residential cablemodem, DSL, FTTH connections at peoples' houses. Usually this is implemented one of two ways, a router/home gateway device that's been pwned and is part of a botnet, or by tricking people into installing some form of proxy software on one of their persistently connected home computers. Used for both bot scraping and also more mundane credit card fraud and various scams, providing people access to georestricted online casinos, weird hawala-adjacent online money/cryptocurrency exchange and transfer platforms. Just a vast array of shady stuff. Google "residential proxies for sale" to see the tip of the iceberg. On Wed, Jul 16, 2025 at 4:38 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DHUYTBIX...

On Wed, Jul 16, 2025 at 11:57 AM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work
So do reactive un-blocking instead. The first request from any IP address gets an empty web page with, "<meta http-equiv=refresh content=1>," and triggers a permit rule which allows subsequent requests to reach the actual web server and content. That way the one-shot sources get an empty page no matter what URL they request. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

On Wed, Jul 16, 2025 at 1:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues,
Append a canary ID to all URLs displayed on the page. For example: https://example.com/example.html?visitor=1234ABCDEF&signature=XYZ Upon receiving a page request that is missing a valid "Visitor=" tag, or in case the visitor's IP address does not match the correct IP address linked to that visitor tag: create a new visitor tag in the database and Return an empty page with a 302 redirect redirecting the visitor back to the homepage with the new tag added and refusing to display the individual page requested, until they click a link provided by the website. Do the same if the signature= attribute is missing or fails to verify. The signature attribute is a HMAC which authenticates that the combination of the URL path and visitor ID are from a page displayed by the web server and have not been altered by the client. For example, they cannot simply learn their client ID and append it on their own to https://example.com/example.html they need the unique signature= added to the link by the web server in order to have access to the example.html page. -- -JA

Yes Marco that is one of the tactics I see to deal with this. Fail2ban is a great piece of software that I use. This reminded me of another related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use. On Wed, Jul 16, 2025 at 12:36 PM Marco Moock via NANOG <nanog@lists.nanog.org> wrote:
Am 16.07.2025 um 10:48:39 Uhr schrieb Andrew Latham via NANOG:
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
-- Gruß Marco
Send unsolicited bulk mail to 1752655719muell@cartoonies.org _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/WX65VXNA...
-- - Andrew "lathama" Latham -

On 16.07.2025 13:33 Andrew Latham <lathama@gmail.com> wrote:
Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
I've added various paths and user agents to my banfilter. If somebody is interested, let me know. -- kind regards Marco Send spam to abfall1752665585@stinkedores.dorfdsl.de

On Wed, 16 Jul 2025 at 14:33, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
For WordPress and PHP, I think it's simply easier to catch the scenarios with a nginx config, and cheaply return errors from the front end webserver, without wasting any of the real backend resources. C.

Constantine Good call there, I need to investigate the 404 responses to see if there are any improvements to be made. On Wed, Jul 16, 2025 at 11:22 PM Constantine A. Murenin <mureninc@gmail.com> wrote:
On Wed, 16 Jul 2025 at 14:33, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
For WordPress and PHP, I think it's simply easier to catch the scenarios with a nginx config, and cheaply return errors from the front end webserver, without wasting any of the real backend resources.
C.
-- - Andrew "lathama" Latham -

Hi Andrew, Yes, you could use something like the following with nginx.conf: location ^~ /wp- { return 444; } The `^~` modifier will ensure that the regex locations will not be checked. The 444 return is a special nginx code that does a connection shutdown without sending a response, this may tie up the resources of the bot doing the scans. References: * http://nginx.org/r/location * http://nginx.org/r/return Best regards, Constantine. On Thu, 17 Jul 2025 at 12:07, Andrew Latham <lathama@gmail.com> wrote:
Constantine
Good call there, I need to investigate the 404 responses to see if there are any improvements to be made.
On Wed, Jul 16, 2025 at 11:22 PM Constantine A. Murenin <mureninc@gmail.com> wrote:
On Wed, 16 Jul 2025 at 14:33, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
For WordPress and PHP, I think it's simply easier to catch the scenarios with a nginx config, and cheaply return errors from the front end webserver, without wasting any of the real backend resources.
C.

Hi, Honestly, the best and safest way to combat this, is to ensure that it's very cheap and fast for you to generate the pages that the bots request, this way, you wouldn't really care if they request said pages or not. For example, a lot of software invalidly sets random/useless cookies for literally no reason. This prevents caching. What you could do, is, strip out all of the cookies both ways, and then simply cache the generic responses for the generic requests that your backend generates. This can be done with standard OSS nginx by clearing Cookie and Set-Cookie headers. If you really want to limit requests by the User-Agent, instead of by REMOTE_ADDR like it's normally done, the standard OSS nginx can do that, too, see http://nginx.org/r/limit_req_zone and nginx.org/r/$http_ for `$http_user_agent`, although you might inadvertently blacklist popular browsers this way. C. On Wed, 16 Jul 2025 at 11:49, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham -

I fight too, but im slowly loosing the battle. Luicky, its just my private stuff. I run distributed FW to guard all my server at once: fwcli> stats rules 946 rules in DB All subnets are /24 or bigger.. And they keep coming... I slowly start to think that its time to pull my basic services out of Internet.. ---------- Original message ---------- From: Andrew Latham via NANOG <nanog@lists.nanog.org> To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Andrew Latham <lathama@gmail.com> Subject: Correctly dealing with bots and scrapers. Date: Wed, 16 Jul 2025 10:48:39 -0600 I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on. 1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example? 2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent? 3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.) -- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...

Speaking of residential proxies, I wrote up a summary of the last Security Track (and ran it by the participants): https://hydrolix.io/blog/residential-criminal-proxies/?utm_source=smlk Regards, Krassi

On Jul 16, 2025, at 9:48 AM, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
If the bots are impersonating real browser User-Agents, and you use something like ModSecurity that can examine HTTP headers, you can look at a few requests and probably find that they send or omit things compared to real browsers. Today, for example, I blocked some of the requests from a botnet that often sends this pair of headers: User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Sec-Ch-Ua-Platform: "macOS" Note the mismatch of "Windows NT" vs. "macOS": it appears the bot randomizes "Sec-Ch-Ua-Platform" but not the "User-Agent", so a good percentage of their requests show this mismatch. Another recent high volume botnet impersonating Chrome/134 is sending this header: Referrer: https://www.google.com/ [sic]: They forgot to misspell "Referer". Most botnets I look at have multiple "tells" like this in the HTTP headers. You have to be mindful to avoid false positives from proxies that mess with headers, but it's otherwise an effective way to block them and stop them from consuming CPU time. Whether this is worth your time is a different matter. It's worth mine because we host thousands of sites, but I probably wouldn't waste the effort on it if it was just my own site, unless the botnet was making the site not work. -- Robert L Mathews

Robert This is a good observation. It has been a decade or two since I worked in the hosting world. Back in the days of hosting work I think I did something like a an IPTABLES --recent on high number of new connections with a 5min block. I have made several observations and developed a few ideas. This whole process is like the telemarketer torture systems we discussed back in the Asterisk project. Some of my thinking is around A. What is a search bot verses an AI bot verses a vulnerability scanner verses an email address scraper (I know but documenting it has shown the filter issues) B. What is the CPU, Logging resource usage and were is the balance on inspection C. Are there dynamic lists like SpamHaus DROP of AI scrapers? D. For or against AI scraper bots? E. Response Rate Limiting options verses drop or reject F. Scan depth issues. (Gitea instance and per commit diff getting scanned) On Fri, Jul 18, 2025 at 4:13 PM Robert L Mathews via NANOG <nanog@lists.nanog.org> wrote:
On Jul 16, 2025, at 9:48 AM, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
If the bots are impersonating real browser User-Agents, and you use something like ModSecurity that can examine HTTP headers, you can look at a few requests and probably find that they send or omit things compared to real browsers.
Today, for example, I blocked some of the requests from a botnet that often sends this pair of headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Sec-Ch-Ua-Platform: "macOS"
Note the mismatch of "Windows NT" vs. "macOS": it appears the bot randomizes "Sec-Ch-Ua-Platform" but not the "User-Agent", so a good percentage of their requests show this mismatch.
Another recent high volume botnet impersonating Chrome/134 is sending this header:
Referrer: https://www.google.com/
[sic]: They forgot to misspell "Referer".
Most botnets I look at have multiple "tells" like this in the HTTP headers. You have to be mindful to avoid false positives from proxies that mess with headers, but it's otherwise an effective way to block them and stop them from consuming CPU time.
Whether this is worth your time is a different matter. It's worth mine because we host thousands of sites, but I probably wouldn't waste the effort on it if it was just my own site, unless the botnet was making the site not work.
-- Robert L Mathews
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DXWVYDR4...
-- - Andrew "lathama" Latham -
participants (13)
-
Andrew Latham
-
borg@uu3.net
-
Chris Adams
-
Compton, Rich
-
Constantine A. Murenin
-
Eric Kuhnke
-
Jay Acuna
-
maillists@krassi.biz
-
Marco Moock
-
Robert L Mathews
-
Ryland Kremeier
-
Tom Beecher
-
William Herrin