Re: Cats vs Mice [CAPTCHAs on Cloudflare-proxied sites]

9 Jul 2025

      On 8/07/25 03:50, Brandon Martin via NANOG wrote:
...
On 7/7/25 13:46, Dan Lowe via NANOG wrote:
...
For the crawler mitigations I've personally been involved with, this 
would not have worked. The source IPs numbered at least in the tens 
of thousands. They didn't repeat; each source IP made one request and 
was never seen again. They didn't aggregate together into prefixes we 
could filter. They didn't use any common identifier we could find to 
filter on, including the user-agent (which were valid-looking and 
randomized).
In my case, we were able to simply put 99% of the site behind a 
login, which mitigated the problem. Many sites don't have that option..
A perhaps interesting question would be how the entities involved in 
this crawling activity have come to control so much IP space. It 
doesn't seem like a use-case that readily justifies it.
(Yes, I know that hardly matters)
This is a service you can buy, and it even has different service classes.

If you want to launder your traffic through residential addresses it's 
expensive, on the order of dollars per GB; they typically have to use 
real connections at real eyeball ISPs.
The more reputable ones pay people money to run a proxy tool on their 
computer, and perhaps also get their own connections under aliases. The 
less reputable ones rent time on botnets.

If you can deal with renting datacenter addresses, it's much cheaper. 
This particular provider's smallest plan is for unlimited use of 1000 
datacenter proxies for $25/month.

You can also rent mobile phone proxies; it's even more expensive than 
residential. I expect they run through a lot of burner SIM cards on a 
mobile phone farm, as well as paying people to run them on their 
legitimate phones.

The service already existed prior to LLMs and was (and still is) used 
for a wide variety of purposes against non-cooperative websites. 
Apparently one surprisingly popular category is "sneaker botting", the 
practice of buying out stocks of limited edition shoes as soon as they 
become available, so you can resell them to actual people for higher 
prices. You can also imagine more beneficial uses such as scraping 
Amazon to make historical price charts (bad for Amazon, good for 
society) and more destructive uses such as spamming web forms.

AWS Lambda used to be a popular way to get a large pool of IPs (and EC2 
for a smaller pool), but site operators quickly saw that traffic from 
their network was almost always bots, and blocked the entirety of AWS.
Blocking all of Comcast or Deutsche Telekom is harder to justify, so 
they throw up a CAPTCHA instead.
This is why you see CAPTCHAs on your home internet connection today - 
it's because other users of your ISP are running AI scraping proxies for 
money. (In today's economy, who can blame them?)

The blocking decision is outsourced to even more 3rd-party companies 
which compile lists of IP ranges and classify them as datacenter, 
residential, mobile phone, Tor exit node, etc and scores of how likely 
they are to be a proxy.
Cloudflare probably has their own internal division, but a site like 
Reddit is likely to query some IP info company and if it doesn't say 
"residential, not a proxy" then it serves you a CAPTCHA or severely 
rate-limits you.
The end game here is that almost all addresses are marked as "probably a 
proxy" and it becomes a useless signal, and half of all HTTP packets get 
routed across the world three times for no good reason.

Re: Cats vs Mice [CAPTCHAs on Cloudflare-proxied sites]

nanog＠immibis.com