I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on. 1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example? 2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent? 3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.) -- - Andrew "lathama" Latham -
Am 16.07.2025 um 10:48:39 Uhr schrieb Andrew Latham via NANOG:
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt. Then use fail2ban to block all IP addresses that poll the file. -- Gruß Marco Send unsolicited bulk mail to 1752655719muell@cartoonies.org
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net>
Chris Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow. On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham -
As Chris states, broad IP based blocking is unlikely to be very effective , and likely more problematic down the line anyway. For the slightly more 'honorable' crawlers, they'll respect robots.txt, and you can block their UAs there. Fail2ban is a very good option right now. It will be even better if nepenthes eventually integrates with it. Then you can have some real fun. On Wed, Jul 16, 2025 at 3:39 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DHUYTBIX...
I've thought about using client-side tokens with behavioral analysis and a points system, but haven't implemented it in one of my sites to test it's reliability just yet. But basically issue a signed token to each client and watch that client using js instantiated on each page of the site. Give actions (or lack thereof) scores based on likelyhood to be a bot and fail2ban them past a certain threshold. look at mouse movement, scroll depth, keypress timings, page transitions, dwell time, js events, start/end routes, etc. - so if a client is on a page for under 100ms, it accumulates 'bot' points, say +10 pts - no EventListener JS events +40 pts - client joins site at a nested page and drills down +5 pts (low because people could bookmark a specific page that triggers this) - etc. then decay these points based on the opposite effects happening. There's no guaranteed solution and this would require careful tweaking of the ban threshold and whatnot with real testing to not accidentally block your users, but I feel it could mitigate a lot of bot situations. For banned users just redirect them to a page with a little unban form they can quickly fill out or something. -- Ryland ------ Original Message ------ From "Tom Beecher via NANOG" <nanog@lists.nanog.org> To "North American Network Operators Group" <nanog@lists.nanog.org> Cc "Tom Beecher" <beecher@beecher.cc> Date 7/16/2025 2:43:51 PM Subject Re: Correctly dealing with bots and scrapers.
As Chris states, broad IP based blocking is unlikely to be very effective , and likely more problematic down the line anyway.
For the slightly more 'honorable' crawlers, they'll respect robots.txt, and you can block their UAs there.
Fail2ban is a very good option right now. It will be even better if nepenthes eventually integrates with it. Then you can have some real fun.
On Wed, Jul 16, 2025 at 3:39 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DHUYTBIX...
NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/ECB77Z6S...
I don’t have experience with it, but I was pointed to this project recently which may be of interest to you. “Anubis is a Web AI Firewall Utility that weighs the soul of your connection<https://en.wikipedia.org/wiki/Weighing_of_souls> using one or more challenges in order to protect upstream resources from scraper bots. “ https://github.com/TecharoHQ/anubis From: Ryland Kremeier via NANOG <nanog@lists.nanog.org> Date: Wednesday, July 16, 2025 at 2:08 PM To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Ryland Kremeier <rkremeier@barryelectric.com> Subject: Re[2]: Correctly dealing with bots and scrapers. I've thought about using client-side tokens with behavioral analysis and a points system, but haven't implemented it in one of my sites to test it's reliability just yet. But basically issue a signed token to each client and watch that client using js instantiated on each page of the site. Give actions (or lack thereof) scores based on likelyhood to be a bot and fail2ban them past a certain threshold. look at mouse movement, scroll depth, keypress timings, page transitions, dwell time, js events, start/end routes, etc. - so if a client is on a page for under 100ms, it accumulates 'bot' points, say +10 pts - no EventListener JS events +40 pts - client joins site at a nested page and drills down +5 pts (low because people could bookmark a specific page that triggers this) - etc. then decay these points based on the opposite effects happening. There's no guaranteed solution and this would require careful tweaking of the ban threshold and whatnot with real testing to not accidentally block your users, but I feel it could mitigate a lot of bot situations. For banned users just redirect them to a page with a little unban form they can quickly fill out or something. -- Ryland ------ Original Message ------ From "Tom Beecher via NANOG" <nanog@lists.nanog.org> To "North American Network Operators Group" <nanog@lists.nanog.org> Cc "Tom Beecher" <beecher@beecher.cc> Date 7/16/2025 2:43:51 PM Subject Re: Correctly dealing with bots and scrapers.
As Chris states, broad IP based blocking is unlikely to be very effective , and likely more problematic down the line anyway.
For the slightly more 'honorable' crawlers, they'll respect robots.txt, and you can block their UAs there.
Fail2ban is a very good option right now. It will be even better if nepenthes eventually integrates with it. Then you can have some real fun.
On Wed, Jul 16, 2025 at 3:39 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
There is what we could call a "growth market" of grey market organizations and individuals selling residential proxies that route traffic through actual residential cablemodem, DSL, FTTH connections at peoples' houses. Usually this is implemented one of two ways, a router/home gateway device that's been pwned and is part of a botnet, or by tricking people into installing some form of proxy software on one of their persistently connected home computers. Used for both bot scraping and also more mundane credit card fraud and various scams, providing people access to georestricted online casinos, weird hawala-adjacent online money/cryptocurrency exchange and transfer platforms. Just a vast array of shady stuff. Google "residential proxies for sale" to see the tip of the iceberg. On Wed, Jul 16, 2025 at 4:38 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Chris
Spot on, and I am getting the feeling this is where the value to a geo-ip service comes to play that offers defined "eyeball networks" to allow.
On Wed, Jul 16, 2025 at 12:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
Once upon a time, Marco Moock <mm@dorfdsl.de> said:
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues, like trying to block 100,000 IPs, which fail2ban for example doesn't really handle well). -- Chris Adams <cma@cmadams.net> _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/AFJF4UQJ...
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DHUYTBIX...
On Wed, Jul 16, 2025 at 11:57 AM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work
So do reactive un-blocking instead. The first request from any IP address gets an empty web page with, "<meta http-equiv=refresh content=1>," and triggers a permit rule which allows subsequent requests to reach the actual web server and content. That way the one-shot sources get an empty page no matter what URL they request. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Wed, Jul 16, 2025 at 1:57 PM Chris Adams via NANOG <nanog@lists.nanog.org> wrote:
The problem with a lot of the "AI" scrapers is that they're apparently using botnets and will often only make a single request from a given IP address, so reactive blocking doesn't work (and can cause other issues,
Append a canary ID to all URLs displayed on the page. For example: https://example.com/example.html?visitor=1234ABCDEF&signature=XYZ Upon receiving a page request that is missing a valid "Visitor=" tag, or in case the visitor's IP address does not match the correct IP address linked to that visitor tag: create a new visitor tag in the database and Return an empty page with a 302 redirect redirecting the visitor back to the homepage with the new tag added and refusing to display the individual page requested, until they click a link provided by the website. Do the same if the signature= attribute is missing or fails to verify. The signature attribute is a HMAC which authenticates that the combination of the URL path and visitor ID are from a page displayed by the web server and have not been altered by the client. For example, they cannot simply learn their client ID and append it on their own to https://example.com/example.html they need the unique signature= added to the link by the web server in order to have access to the example.html page. -- -JA
Yes Marco that is one of the tactics I see to deal with this. Fail2ban is a great piece of software that I use. This reminded me of another related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use. On Wed, Jul 16, 2025 at 12:36 PM Marco Moock via NANOG <nanog@lists.nanog.org> wrote:
Am 16.07.2025 um 10:48:39 Uhr schrieb Andrew Latham via NANOG:
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
Place a link to a file that is hidden to normal people. Exclude the directory via robots.txt.
Then use fail2ban to block all IP addresses that poll the file.
-- Gruß Marco
Send unsolicited bulk mail to 1752655719muell@cartoonies.org _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/WX65VXNA...
-- - Andrew "lathama" Latham -
On 16.07.2025 13:33 Andrew Latham <lathama@gmail.com> wrote:
Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
I've added various paths and user agents to my banfilter. If somebody is interested, let me know. -- kind regards Marco Send spam to abfall1752665585@stinkedores.dorfdsl.de
On Wed, 16 Jul 2025 at 14:33, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
For WordPress and PHP, I think it's simply easier to catch the scenarios with a nginx config, and cheaply return errors from the front end webserver, without wasting any of the real backend resources. C.
Constantine Good call there, I need to investigate the 404 responses to see if there are any improvements to be made. On Wed, Jul 16, 2025 at 11:22 PM Constantine A. Murenin <mureninc@gmail.com> wrote:
On Wed, 16 Jul 2025 at 14:33, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
For WordPress and PHP, I think it's simply easier to catch the scenarios with a nginx config, and cheaply return errors from the front end webserver, without wasting any of the real backend resources.
C.
-- - Andrew "lathama" Latham -
Hi Andrew, Yes, you could use something like the following with nginx.conf: location ^~ /wp- { return 444; } The `^~` modifier will ensure that the regex locations will not be checked. The 444 return is a special nginx code that does a connection shutdown without sending a response, this may tie up the resources of the bot doing the scans. References: * http://nginx.org/r/location * http://nginx.org/r/return Best regards, Constantine. On Thu, 17 Jul 2025 at 12:07, Andrew Latham <lathama@gmail.com> wrote:
Constantine
Good call there, I need to investigate the 404 responses to see if there are any improvements to be made.
On Wed, Jul 16, 2025 at 11:22 PM Constantine A. Murenin <mureninc@gmail.com> wrote:
On Wed, 16 Jul 2025 at 14:33, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
related topic. Security Scans. Any requests for wordpress could be an easy way to flag and block with fail2ban when wordpress is not in use.
For WordPress and PHP, I think it's simply easier to catch the scenarios with a nginx config, and cheaply return errors from the front end webserver, without wasting any of the real backend resources.
C.
Hi, Honestly, the best and safest way to combat this, is to ensure that it's very cheap and fast for you to generate the pages that the bots request, this way, you wouldn't really care if they request said pages or not. For example, a lot of software invalidly sets random/useless cookies for literally no reason. This prevents caching. What you could do, is, strip out all of the cookies both ways, and then simply cache the generic responses for the generic requests that your backend generates. This can be done with standard OSS nginx by clearing Cookie and Set-Cookie headers. If you really want to limit requests by the User-Agent, instead of by REMOTE_ADDR like it's normally done, the standard OSS nginx can do that, too, see http://nginx.org/r/limit_req_zone and nginx.org/r/$http_ for `$http_user_agent`, although you might inadvertently blacklist popular browsers this way. C. On Wed, 16 Jul 2025 at 11:49, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham -
I fight too, but im slowly loosing the battle. Luicky, its just my private stuff. I run distributed FW to guard all my server at once: fwcli> stats rules 946 rules in DB All subnets are /24 or bigger.. And they keep coming... I slowly start to think that its time to pull my basic services out of Internet.. ---------- Original message ---------- From: Andrew Latham via NANOG <nanog@lists.nanog.org> To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Andrew Latham <lathama@gmail.com> Subject: Correctly dealing with bots and scrapers. Date: Wed, 16 Jul 2025 10:48:39 -0600 I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on. 1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example? 2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent? 3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.) -- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
Speaking of residential proxies, I wrote up a summary of the last Security Track (and ran it by the participants): https://hydrolix.io/blog/residential-criminal-proxies/?utm_source=smlk Regards, Krassi
On Jul 16, 2025, at 9:48 AM, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
If the bots are impersonating real browser User-Agents, and you use something like ModSecurity that can examine HTTP headers, you can look at a few requests and probably find that they send or omit things compared to real browsers. Today, for example, I blocked some of the requests from a botnet that often sends this pair of headers: User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Sec-Ch-Ua-Platform: "macOS" Note the mismatch of "Windows NT" vs. "macOS": it appears the bot randomizes "Sec-Ch-Ua-Platform" but not the "User-Agent", so a good percentage of their requests show this mismatch. Another recent high volume botnet impersonating Chrome/134 is sending this header: Referrer: https://www.google.com/ [sic]: They forgot to misspell "Referer". Most botnets I look at have multiple "tells" like this in the HTTP headers. You have to be mindful to avoid false positives from proxies that mess with headers, but it's otherwise an effective way to block them and stop them from consuming CPU time. Whether this is worth your time is a different matter. It's worth mine because we host thousands of sites, but I probably wouldn't waste the effort on it if it was just my own site, unless the botnet was making the site not work. -- Robert L Mathews
Robert This is a good observation. It has been a decade or two since I worked in the hosting world. Back in the days of hosting work I think I did something like a an IPTABLES --recent on high number of new connections with a 5min block. I have made several observations and developed a few ideas. This whole process is like the telemarketer torture systems we discussed back in the Asterisk project. Some of my thinking is around A. What is a search bot verses an AI bot verses a vulnerability scanner verses an email address scraper (I know but documenting it has shown the filter issues) B. What is the CPU, Logging resource usage and were is the balance on inspection C. Are there dynamic lists like SpamHaus DROP of AI scrapers? D. For or against AI scraper bots? E. Response Rate Limiting options verses drop or reject F. Scan depth issues. (Gitea instance and per commit diff getting scanned) On Fri, Jul 18, 2025 at 4:13 PM Robert L Mathews via NANOG <nanog@lists.nanog.org> wrote:
On Jul 16, 2025, at 9:48 AM, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
If the bots are impersonating real browser User-Agents, and you use something like ModSecurity that can examine HTTP headers, you can look at a few requests and probably find that they send or omit things compared to real browsers.
Today, for example, I blocked some of the requests from a botnet that often sends this pair of headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Sec-Ch-Ua-Platform: "macOS"
Note the mismatch of "Windows NT" vs. "macOS": it appears the bot randomizes "Sec-Ch-Ua-Platform" but not the "User-Agent", so a good percentage of their requests show this mismatch.
Another recent high volume botnet impersonating Chrome/134 is sending this header:
Referrer: https://www.google.com/
[sic]: They forgot to misspell "Referer".
Most botnets I look at have multiple "tells" like this in the HTTP headers. You have to be mindful to avoid false positives from proxies that mess with headers, but it's otherwise an effective way to block them and stop them from consuming CPU time.
Whether this is worth your time is a different matter. It's worth mine because we host thousands of sites, but I probably wouldn't waste the effort on it if it was just my own site, unless the botnet was making the site not work.
-- Robert L Mathews
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DXWVYDR4...
-- - Andrew "lathama" Latham -
This am, dealing with scrapers coming out of TOR nodes with a user-agent of Windows 95 and Windows 98 which is so funny I just have to share... On Mon, Jul 21, 2025 at 11:20 AM Andrew Latham <lathama@gmail.com> wrote:
Robert
This is a good observation. It has been a decade or two since I worked in the hosting world. Back in the days of hosting work I think I did something like a an IPTABLES --recent on high number of new connections with a 5min block.
I have made several observations and developed a few ideas. This whole process is like the telemarketer torture systems we discussed back in the Asterisk project.
Some of my thinking is around
A. What is a search bot verses an AI bot verses a vulnerability scanner verses an email address scraper (I know but documenting it has shown the filter issues) B. What is the CPU, Logging resource usage and were is the balance on inspection C. Are there dynamic lists like SpamHaus DROP of AI scrapers? D. For or against AI scraper bots? E. Response Rate Limiting options verses drop or reject F. Scan depth issues. (Gitea instance and per commit diff getting scanned)
On Fri, Jul 18, 2025 at 4:13 PM Robert L Mathews via NANOG <nanog@lists.nanog.org> wrote:
On Jul 16, 2025, at 9:48 AM, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
If the bots are impersonating real browser User-Agents, and you use something like ModSecurity that can examine HTTP headers, you can look at a few requests and probably find that they send or omit things compared to real browsers.
Today, for example, I blocked some of the requests from a botnet that often sends this pair of headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Sec-Ch-Ua-Platform: "macOS"
Note the mismatch of "Windows NT" vs. "macOS": it appears the bot randomizes "Sec-Ch-Ua-Platform" but not the "User-Agent", so a good percentage of their requests show this mismatch.
Another recent high volume botnet impersonating Chrome/134 is sending this header:
Referrer: https://www.google.com/
[sic]: They forgot to misspell "Referer".
Most botnets I look at have multiple "tells" like this in the HTTP headers. You have to be mindful to avoid false positives from proxies that mess with headers, but it's otherwise an effective way to block them and stop them from consuming CPU time.
Whether this is worth your time is a different matter. It's worth mine because we host thousands of sites, but I probably wouldn't waste the effort on it if it was just my own site, unless the botnet was making the site not work.
-- Robert L Mathews
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DXWVYDR4...
-- - Andrew "lathama" Latham -
-- - Andrew "lathama" Latham -
Am 21.03.2026 um 09:44:37 Uhr schrieb Andrew Latham via NANOG:
This am, dealing with scrapers coming out of TOR nodes with a user-agent of Windows 95 and Windows 98 which is so funny I just have to share...
Which other parts does it have? I have doubt that are real Win9x machines, even if the Tor expert bundle said it supported Win 98 long after its EoL. -- kind regards Marco Send unsolicited bulk mail to 1774082677muell@cartoonies.org
Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense. Redirect scrapers to it, and poison whatever LLM they are trying to train. Andrew On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
Andrew The issue is that it is hitting a gitea instance and a mediawiki instance, then getting lost in the commit/change diff system. I should just just figue out how to dissable the difference tools on gitea and mediawiki to keep the bots from going down a rabbit hole on a decade or more of commits and page edits. These are CPU intensive which is my issue at the moment. On Sat, Mar 21, 2026 at 9:53 AM Andrew Kirch via NANOG <nanog@lists.nanog.org> wrote:
Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense.
Redirect scrapers to it, and poison whatever LLM they are trying to train.
Andrew
On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7N5RWJOU...
-- - Andrew "lathama" Latham -
Ron Guilmette / rfg wrote this perl tool he called wpoison back in the 90s that’d feed junk email addresses to spambots. If a 90s Perl script can do this i seriously doubt we need an LLM of any sort. Why waste all that compute just to give scrapers indigestion? --srs ________________________________ From: Andrew Kirch via NANOG <nanog@lists.nanog.org> Sent: Saturday, March 21, 2026 9:23:00 PM To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Andrew Kirch <trelane@trelane.net> Subject: Re: Correctly dealing with bots and scrapers. Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense. Redirect scrapers to it, and poison whatever LLM they are trying to train. Andrew On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7N5RWJOU...
Am 21.03.2026 um 15:59:38 Uhr schrieb Suresh Ramasubramanian via NANOG:
Ron Guilmette / rfg wrote this perl tool he called wpoison back in the 90s that’d feed junk email addresses to spambots.
More interesting is what happens if such bots are presented spamtrap addresses nowadays. -- Gruß Marco Send unsolicited bulk mail to 1774105178muell@cartoonies.org
Marco A small set of the last few minutes... Mozilla/5.0 (Windows 95; et-EE; rv:1.9.1.20) Gecko/4223-11-01 05:21:19.603653 Firefox/8.0 Mozilla/5.0 (Windows; U; Windows 95) AppleWebKit/532.39.7 (KHTML, like Gecko) Version/4.0 Safari/532.39.7 Opera/9.58.(Windows 95; wae-CH) Presto/2.9.165 Version/12.00 Mozilla/5.0 (compatible; MSIE 6.0; Windows 95; Trident/5.1) Mozilla/5.0 (compatible; MSIE 5.0; Windows 95; Trident/4.0) Mozilla/5.0 (compatible; MSIE 9.0; Windows 95; Trident/4.0) Mozilla/5.0 (compatible; MSIE 6.0; Windows 95; Trident/5.0) On Sat, Mar 21, 2026 at 9:52 AM Marco Moock via NANOG <nanog@lists.nanog.org> wrote:
Am 21.03.2026 um 09:44:37 Uhr schrieb Andrew Latham via NANOG:
This am, dealing with scrapers coming out of TOR nodes with a user-agent of Windows 95 and Windows 98 which is so funny I just have to share...
Which other parts does it have?
I have doubt that are real Win9x machines, even if the Tor expert bundle said it supported Win 98 long after its EoL.
-- kind regards Marco
Send unsolicited bulk mail to 1774082677muell@cartoonies.org _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/UVPVU23A...
-- - Andrew "lathama" Latham -
Modern scrapers will eventually move on. Generating a large amount of AI hallucinations with a crappy LLM will do the maximum damage before it goes. Andrew On Sat, Mar 21, 2026 at 11:59 AM Suresh Ramasubramanian <ops.lists@gmail.com> wrote:
Ron Guilmette / rfg wrote this perl tool he called wpoison back in the 90s that’d feed junk email addresses to spambots.
If a 90s Perl script can do this i seriously doubt we need an LLM of any sort. Why waste all that compute just to give scrapers indigestion?
--srs ------------------------------ *From:* Andrew Kirch via NANOG <nanog@lists.nanog.org> *Sent:* Saturday, March 21, 2026 9:23:00 PM *To:* North American Network Operators Group <nanog@lists.nanog.org> *Cc:* Andrew Kirch <trelane@trelane.net> *Subject:* Re: Correctly dealing with bots and scrapers.
Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense.
Redirect scrapers to it, and poison whatever LLM they are trying to train.
Andrew
On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
_______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7N5RWJOU...
Also, synthetic data leads to premature model collapse in even the best of underlying breed. You could use a super low resolution llm for the poison task, prolly do it on a pi5 /mike
On 21 Mar 2026, at 16:08, Andrew Kirch via NANOG <nanog@lists.nanog.org> wrote:
Modern scrapers will eventually move on. Generating a large amount of AI hallucinations with a crappy LLM will do the maximum damage before it goes.
Andrew
On Sat, Mar 21, 2026 at 11:59 AM Suresh Ramasubramanian <ops.lists@gmail.com> wrote:
Ron Guilmette / rfg wrote this perl tool he called wpoison back in the 90s that’d feed junk email addresses to spambots.
If a 90s Perl script can do this i seriously doubt we need an LLM of any sort. Why waste all that compute just to give scrapers indigestion?
--srs ------------------------------ *From:* Andrew Kirch via NANOG <nanog@lists.nanog.org> *Sent:* Saturday, March 21, 2026 9:23:00 PM *To:* North American Network Operators Group <nanog@lists.nanog.org> *Cc:* Andrew Kirch <trelane@trelane.net> *Subject:* Re: Correctly dealing with bots and scrapers.
Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense.
Redirect scrapers to it, and poison whatever LLM they are trying to train.
Andrew
On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
_______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7N5RWJOU...
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SOPURKAQ...
More often than not, gibberish formatted as email addresses --srs ________________________________ From: Marco Moock via NANOG <nanog@lists.nanog.org> Sent: Saturday, March 21, 2026 9:33:23 PM To: nanog@lists.nanog.org <nanog@lists.nanog.org> Cc: Marco Moock <mm@dorfdsl.de> Subject: Re: Correctly dealing with bots and scrapers. Am 21.03.2026 um 15:59:38 Uhr schrieb Suresh Ramasubramanian via NANOG:
Ron Guilmette / rfg wrote this perl tool he called wpoison back in the 90s that’d feed junk email addresses to spambots.
More interesting is what happens if such bots are presented spamtrap addresses nowadays. -- Gruß Marco Send unsolicited bulk mail to 1774105178muell@cartoonies.org
It appears that Andrew Kirch via NANOG <nanog@lists.nanog.org> said:
Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense.
See https://www.web.sp.am/ Feel free to point stuff at it, it's a tiny overloaded VPS that has nothing better to do. You may think it's hopelessly lame, but see https://jl.ly/Internet/scrapeup.html Anthropic's bot has visited it over 48 million times since November. R's, John
Update Many software projects have solutions for this, I was not searching for the right thing. The term to search for is "expensive" like in https://github.com/go-gitea/gitea/blob/main/custom/conf/app.example.ini#L786... from https://github.com/go-gitea/gitea/issues/33966 I am also finding solutions for other software I use along the lines of requiring sign-in for paths that have computational costs. It was a show thought to require sign-in to read my gitea instance and when I found the setting I was very happy. My sites use TLS(80 to 443 redirect) so I also banned user-agents of Windows 95(4k hits an hour), Windows 98(9k hits per hour), Windows CE(400 hits per hour), Windows NT 4(4k hits per hour), Windows NT 5(5k hits per hour) to name a few. Next up: Make sense of the Apple OS versions and wich would most likely be able to reach TLS endpoints P.S. I know people mean well, but no off list emails please. On Sat, Mar 21, 2026 at 9:59 AM Andrew Latham <lathama@gmail.com> wrote:
Andrew
The issue is that it is hitting a gitea instance and a mediawiki instance, then getting lost in the commit/change diff system. I should just just figue out how to dissable the difference tools on gitea and mediawiki to keep the bots from going down a rabbit hole on a decade or more of commits and page edits.
These are CPU intensive which is my issue at the moment.
On Sat, Mar 21, 2026 at 9:53 AM Andrew Kirch via NANOG <nanog@lists.nanog.org> wrote:
Get a small version of a very old very fast very inaccurate LLM. Have it generate a couple terabytes of endless nonsense.
Redirect scrapers to it, and poison whatever LLM they are trying to train.
Andrew
On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
_______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7N5RWJOU...
-- - Andrew "lathama" Latham -
-- - Andrew "lathama" Latham -
I’ve been collecting lots of bot traffic and other naughty things and have some automated stuff that reads and updates my nginx files. Feel free to check out my GitHub repository for IoC, signatures, and mitigations. https://github.com/trumb/nginx-hardening I’m also experimenting with a Claude Code plugin that does something similar but it’s a work in progress. https://github.com/trumb/claude-nginx-hardening On Wed, Mar 25, 2026 at 2:44 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
Update
Many software projects have solutions for this, I was not searching for the right thing. The term to search for is "expensive" like in
https://github.com/go-gitea/gitea/blob/main/custom/conf/app.example.ini#L786... from https://github.com/go-gitea/gitea/issues/33966
I am also finding solutions for other software I use along the lines of requiring sign-in for paths that have computational costs. It was a show thought to require sign-in to read my gitea instance and when I found the setting I was very happy.
My sites use TLS(80 to 443 redirect) so I also banned user-agents of Windows 95(4k hits an hour), Windows 98(9k hits per hour), Windows CE(400 hits per hour), Windows NT 4(4k hits per hour), Windows NT 5(5k hits per hour) to name a few.
Next up: Make sense of the Apple OS versions and wich would most likely be able to reach TLS endpoints
P.S. I know people mean well, but no off list emails please.
On Sat, Mar 21, 2026 at 9:59 AM Andrew Latham <lathama@gmail.com> wrote:
Andrew
The issue is that it is hitting a gitea instance and a mediawiki
getting lost in the commit/change diff system. I should just just figue out how to dissable the difference tools on gitea and mediawiki to keep the bots from going down a rabbit hole on a decade or more of commits and page edits.
These are CPU intensive which is my issue at the moment.
On Sat, Mar 21, 2026 at 9:53 AM Andrew Kirch via NANOG <nanog@lists.nanog.org> wrote:
Get a small version of a very old very fast very inaccurate LLM. Have
it
generate a couple terabytes of endless nonsense.
Redirect scrapers to it, and poison whatever LLM they are trying to
instance, then train.
Andrew
On Wed, Jul 16, 2025 at 12:49 PM Andrew Latham via NANOG < nanog@lists.nanog.org> wrote:
I just had an issue with a web-server where I had to block a /18 of a large scraper. I have some topics I could use some input on.
1. What tools or setups have people found most successful for dealing with bots/scrapers that do not respect robots.txt for example?
2. What tools for response rate limiting deal with bots/scrapers that cycle over a large variety of IPs with the exact same user agent?
3. Has anyone written or found a tool to concentrate IP addresses into networks for IPTABLES or NFT? (60% of IPs for network X in list so add network X and remove individual IP entries.)
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/Z2J6CFBK...
_______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/7N5RWJOU...
-- - Andrew "lathama" Latham -
-- - Andrew "lathama" Latham - _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/W6YNLWVR...
participants (18)
-
Andrew Kirch -
Andrew Latham -
borg@uu3.net -
Chris Adams -
Compton, Rich -
Constantine A. Murenin -
Eric Kuhnke -
Jay Acuna -
John Levine -
maillists@krassi.biz -
Marco Moock -
Maurice Brown -
Mike Simpson -
Robert L Mathews -
Ryland Kremeier -
Suresh Ramasubramanian -
Tom Beecher -
William Herrin