Re: Correctly dealing with bots and scrapers.

21 Jul 2025

      Robert

This is a good observation. It has been a decade or two since I worked
in the hosting world. Back in the days of hosting work I think I did
something like a an IPTABLES --recent on high number of new
connections with a 5min block.

I have made several observations and developed a few ideas. This whole
process is like the telemarketer torture systems we discussed back in
the Asterisk project.

Some of my thinking is around

A. What is a search bot verses an AI bot verses a vulnerability
scanner verses an email address scraper (I know but documenting it has
shown the filter issues)
B. What is the CPU, Logging resource usage and were is the balance on inspection
C. Are there dynamic lists like SpamHaus DROP of AI scrapers?
D. For or against AI scraper bots?
E. Response Rate Limiting options verses drop or reject
F. Scan depth issues. (Gitea instance and per commit diff getting scanned)

On Fri, Jul 18, 2025 at 4:13 PM Robert L Mathews via NANOG
<nanog@lists.nanog.org> wrote:
...
On Jul 16, 2025, at 9:48 AM, Andrew Latham via NANOG <nanog@lists.nanog.org> wrote:
...
2. What tools for response rate limiting deal with bots/scrapers that
cycle over a large variety of IPs with the exact same user agent?
If the bots are impersonating real browser User-Agents, and you use something like ModSecurity that can examine HTTP headers, you can look at a few requests and probably find that they send or omit things compared to real browsers.
Today, for example, I blocked some of the requests from a botnet that often sends this pair of headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36
 Sec-Ch-Ua-Platform: "macOS"
Note the mismatch of "Windows NT" vs. "macOS": it appears the bot randomizes "Sec-Ch-Ua-Platform" but not the "User-Agent", so a good percentage of their requests show this mismatch.
Another recent high volume botnet impersonating Chrome/134 is sending this header:
Referrer: https://www.google.com/
[sic]: They forgot to misspell "Referer".
Most botnets I look at have multiple "tells" like this in the HTTP headers. You have to be mindful to avoid false positives from proxies that mess with headers, but it's otherwise an effective way to block them and stop them from consuming CPU time.
Whether this is worth your time is a different matter. It's worth mine because we host thousands of sites, but I probably wouldn't waste the effort on it if it was just my own site, unless the botnet was making the site not work.
--
Robert L Mathews
_______________________________________________
NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/DXWVYDR4...
-- 
- Andrew "lathama" Latham -