Anyone have contacts at the Amazon or OpenAI web spiders?
One day I set up the world's lamest content farm. You can see it here: https://www.web.sp.am/ While humans tend not to find its six billion pages very interesting, some web spiders are entranced. In the past week or so, Amazon's amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6 million. (If you were wondering what they use to train ChatGPT, now you know.) I don't care that googlebot comes by every 5 or 10 minutes, but gptbot is every few seconds and amazon as fast as the server will respond. They both come from predictable IPs so I can set packet filters but they're still hammering pretty hard. Each has a URL in the user agent string, Amazon's page has an address to write to but OpenAI's doesn't. I wrote to the Amazon address, no response. If anyone has contacts at either I would appreciate it. A few years ago the bingbot got trapped but fortunately I knew someone at Microsoft who could pass the word. He reported back that while he could not go into detail, there was a great deal of animated conversation at the other end of the hall, and shortly after that it stopped. R's, John
Both robots respect robots.txt, of course they’re not going to answer. On Feb 13, 2024, at 8:35 PM, John Levine <johnl@iecc.com> wrote:
One day I set up the world's lamest content farm. You can see it here:
While humans tend not to find its six billion pages very interesting, some web spiders are entranced. In the past week or so, Amazon's amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6 million. (If you were wondering what they use to train ChatGPT, now you know.) I don't care that googlebot comes by every 5 or 10 minutes, but gptbot is every few seconds and amazon as fast as the server will respond.
They both come from predictable IPs so I can set packet filters but they're still hammering pretty hard. Each has a URL in the user agent string, Amazon's page has an address to write to but OpenAI's doesn't. I wrote to the Amazon address, no response.
If anyone has contacts at either I would appreciate it. A few years ago the bingbot got trapped but fortunately I knew someone at Microsoft who could pass the word. He reported back that while he could not go into detail, there was a great deal of animated conversation at the other end of the hall, and shortly after that it stopped.
R's, John
It appears that Patrick Clochesy <patrick@mach.net> said:
Both robots respect robots.txt, of course they’re not going to answer.
The content farm is not one site with six billion pages, it's six billion sites each with one page. They check the robots.txt for each site they visit but by then its's too late. Most spiders can take the hint that they're all on the same IP. But not these two. R's, John
On Feb 13, 2024, at 8:35 PM, John Levine <johnl@iecc.com> wrote:
One day I set up the world's lamest content farm. You can see it here:
While humans tend not to find its six billion pages very interesting, some web spiders are entranced. In the past week or so, Amazon's amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6 million. (If you were wondering what they use to train ChatGPT, now you know.) I don't care that googlebot comes by every 5 or 10 minutes, but gptbot is every few seconds and amazon as fast as the server will respond.
They both come from predictable IPs so I can set packet filters but they're still hammering pretty hard. Each has a URL in the user agent string, Amazon's page has an address to write to but OpenAI's doesn't. I wrote to the Amazon address, no response.
If anyone has contacts at either I would appreciate it. A few years ago the bingbot got trapped but fortunately I knew someone at Microsoft who could pass the word. He reported back that while he could not go into detail, there was a great deal of animated conversation at the other end of the hall, and shortly after that it stopped.
R's, John
On Wed, Feb 14, 2024 at 1:36 PM John Levine <johnl@iecc.com> wrote:
If anyone has contacts at either I would appreciate it.
https://developer.amazon.com/support/amazonbot probably returned as a result of searching "amazonbot" on your favourite search engine.
If anyone has contacts at either I would appreciate it.
Um, that is the site I mentioned in the line above the one you quoted. As I said, I wrote to the contact address, no reply.
probably returned as a result of searching "amazonbot" on your favourite search engine.
Regards, John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies", Please consider the environment before reading this e-mail. https://jl.ly
participants (4)
-
John Levine
-
John R. Levine
-
Lincoln Dale
-
Patrick Clochesy