
On Wed, 2 Jul 2025 at 05:50, niels=nanog--- via NANOG <nanog@lists.nanog.org> wrote:
* Constantine A. Murenin [Wed 02 Jul 2025, 05:23 CEST]:
But the bots are not a problem if you're doing proper caching and throttling.
Have you been following the news at all lately? Website operators are complaining left and right about the load from scrapers related to AI companies. They're seeing 10x, 100x the normal visitor load, with not just User-Agents but also source IP addresses masked to present as regular visitors. Captchas is unfortunately one of the more visible ways to address this, even if not perfect.
For example, https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-for...
That article describes a classic case of impedance mismatch, which is an engineering issue. It also fails to provide any actual engineering details, beyond simply the fact that it's about a git webservice, and, more specifically, about a self-hosted instance of Gitea. Git requires a LOT of resources and is NOT web scale. Git is, in fact, super-fast for its tasks when used LOCALLY, compared to CVS/SVN, but it's simply NOT web scale. Putting git onto the web, for anonymous access, using a tool optimised for local access, without doing any of the most basic caching or rate limiting with nginx or the like, as well as on the app itself, is the cause of the issue here. Adjusting robots.txt is NOT a "standard defensive measure" in this case; using the rate limts in nginx would be. If your website only has 10 active daily users, with each visit lasting only a few minutes, of course 100 daily bots using the site 24/7 will bring his site down. The solution? Maybe have more than 10 daily users so you don't have to worry about the 100 bots? Rate limiting with nginx isn't even mentioned. Why is the website so inefficient that just like presumably 100 users can bring it down? Why are you even running the website by yourself instead of using GitHub for public access in that case? On another note, about this Anubis spam… The most ridiculous recent adoption of Anubis was on www.OpenWrt.org. It's literally a "static" website with like 230 pages (ironically, they didn't adopt it on their forum.openwrt.org as of now). They've been using Anubis on all pages of www.openwrt.org, even on the front page. Why? The site literally changes like twice a year, and might as well be simply cached in its entirety on a daily basis without any external users noticing a thing. That would be correct solution: * Not logged into the Wiki? You get last day's cache. * Logged in? Fresh data. And, I mean, how inefficient does your wiki has to be that 100 simultaneous users could bring the site down? What's the purpose of having Anubis again? C.