
On 8/07/25 03:50, Brandon Martin via NANOG wrote:
On 7/7/25 13:46, Dan Lowe via NANOG wrote:
For the crawler mitigations I've personally been involved with, this would not have worked. The source IPs numbered at least in the tens of thousands. They didn't repeat; each source IP made one request and was never seen again. They didn't aggregate together into prefixes we could filter. They didn't use any common identifier we could find to filter on, including the user-agent (which were valid-looking and randomized).
In my case, we were able to simply put 99% of the site behind a login, which mitigated the problem. Many sites don't have that option..
A perhaps interesting question would be how the entities involved in this crawling activity have come to control so much IP space. It doesn't seem like a use-case that readily justifies it.
(Yes, I know that hardly matters)
This is a service you can buy, and it even has different service classes. If you want to launder your traffic through residential addresses it's expensive, on the order of dollars per GB; they typically have to use real connections at real eyeball ISPs. The more reputable ones pay people money to run a proxy tool on their computer, and perhaps also get their own connections under aliases. The less reputable ones rent time on botnets. If you can deal with renting datacenter addresses, it's much cheaper. This particular provider's smallest plan is for unlimited use of 1000 datacenter proxies for $25/month. You can also rent mobile phone proxies; it's even more expensive than residential. I expect they run through a lot of burner SIM cards on a mobile phone farm, as well as paying people to run them on their legitimate phones. The service already existed prior to LLMs and was (and still is) used for a wide variety of purposes against non-cooperative websites. Apparently one surprisingly popular category is "sneaker botting", the practice of buying out stocks of limited edition shoes as soon as they become available, so you can resell them to actual people for higher prices. You can also imagine more beneficial uses such as scraping Amazon to make historical price charts (bad for Amazon, good for society) and more destructive uses such as spamming web forms. AWS Lambda used to be a popular way to get a large pool of IPs (and EC2 for a smaller pool), but site operators quickly saw that traffic from their network was almost always bots, and blocked the entirety of AWS. Blocking all of Comcast or Deutsche Telekom is harder to justify, so they throw up a CAPTCHA instead. This is why you see CAPTCHAs on your home internet connection today - it's because other users of your ISP are running AI scraping proxies for money. (In today's economy, who can blame them?) The blocking decision is outsourced to even more 3rd-party companies which compile lists of IP ranges and classify them as datacenter, residential, mobile phone, Tor exit node, etc and scores of how likely they are to be a proxy. Cloudflare probably has their own internal division, but a site like Reddit is likely to query some IP info company and if it doesn't say "residential, not a proxy" then it serves you a CAPTCHA or severely rate-limits you. The end game here is that almost all addresses are marked as "probably a proxy" and it becomes a useless signal, and half of all HTTP packets get routed across the world three times for no good reason.