Re: Captchas on Cloudflare-Proxied Sites

2 Jul 2025

      On Wed, 2 Jul 2025 at 05:50, niels=nanog--- via NANOG
<nanog@lists.nanog.org> wrote:
...
* Constantine A. Murenin [Wed 02 Jul 2025, 05:23 CEST]:
...
But the bots are not a problem if you're doing proper caching and throttling.
Have you been following the news at all lately? Website operators are
complaining left and right about the load from scrapers related to AI
companies. They're seeing 10x, 100x the normal visitor load, with not
just User-Agents but also source IP addresses masked to present as
regular visitors. Captchas is unfortunately one of the more visible
ways to address this, even if not perfect.
For example,
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-for...
That article describes a classic case of impedance mismatch, which is
an engineering issue.  It also fails to provide any actual engineering
details, beyond simply the fact that it's about a git webservice, and,
more specifically, about a self-hosted instance of Gitea.

Git requires a LOT of resources and is NOT web scale.  Git is, in
fact, super-fast for its tasks when used LOCALLY, compared to CVS/SVN,
but it's simply NOT web scale.

Putting git onto the web, for anonymous access, using a tool optimised
for local access, without doing any of the most basic caching or rate
limiting with nginx or the like, as well as on the app itself, is the
cause of the issue here.  Adjusting robots.txt is NOT a "standard
defensive measure" in this case; using the rate limts in nginx would
be.

If your website only has 10 active daily users, with each visit
lasting only a few minutes, of course 100 daily bots using the site
24/7 will bring his site down.  The solution?  Maybe have more than 10
daily users so you don't have to worry about the 100 bots?  Rate
limiting with nginx isn't even mentioned.  Why is the website so
inefficient that just like presumably 100 users can bring it down?
Why are you even running the website by yourself instead of using
GitHub for public access in that case?

On another note, about this Anubis spam…

The most ridiculous recent adoption of Anubis was on www.OpenWrt.org.
It's literally a "static" website with like 230 pages (ironically,
they didn't adopt it on their forum.openwrt.org as of now).  They've
been using Anubis on all pages of www.openwrt.org, even on the front
page.  Why?  The site literally changes like twice a year, and might
as well be simply cached in its entirety on a daily basis without any
external users noticing a thing.

That would be correct solution:

* Not logged into the Wiki?  You get last day's cache.
* Logged in?  Fresh data.

And, I mean, how inefficient does your wiki has to be that 100
simultaneous users could bring the site down?

What's the purpose of having Anubis again?

C.