It appears that Patrick Clochesy <patrick@mach.net> said:
Both robots respect robots.txt, of course they’re not going to answer.
The content farm is not one site with six billion pages, it's six billion sites each with one page. They check the robots.txt for each site they visit but by then its's too late. Most spiders can take the hint that they're all on the same IP. But not these two. R's, John
On Feb 13, 2024, at 8:35 PM, John Levine <johnl@iecc.com> wrote:
One day I set up the world's lamest content farm. You can see it here:
While humans tend not to find its six billion pages very interesting, some web spiders are entranced. In the past week or so, Amazon's amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6 million. (If you were wondering what they use to train ChatGPT, now you know.) I don't care that googlebot comes by every 5 or 10 minutes, but gptbot is every few seconds and amazon as fast as the server will respond.
They both come from predictable IPs so I can set packet filters but they're still hammering pretty hard. Each has a URL in the user agent string, Amazon's page has an address to write to but OpenAI's doesn't. I wrote to the Amazon address, no response.
If anyone has contacts at either I would appreciate it. A few years ago the bingbot got trapped but fortunately I knew someone at Microsoft who could pass the word. He reported back that while he could not go into detail, there was a great deal of animated conversation at the other end of the hall, and shortly after that it stopped.
R's, John