On Thu, 20 Jan 2005 14:30:04 +0200, Gadi Evron <gadi@tehila.gov.il> wrote:
Inktomi (now Yahoo!) sends it's spiders all over the Internet. Lately some of our systems are reporting that they open many HTTP connections to our web sites, without ever sending any data and immediately disconnecting. This is getting to a level where it disturbs us.
I have heard previous stories of inktomi ignoring robots.txt (not seen this for myself though). And there are threads like this - Quoting from http://www.webmasterworld.com/forum11/1968-1-15.htm
I've got Scooter allowed in, but I've also got it lumped int with a number of agents that are not allowed to get non-HTML files. This is especially important at my site as it includes a number of very large binary datasets in numerous locations and the robots have proven too stupid to understand that downloading them is a waste of bandwidth.
RewriteCond %{HTTP_USER_AGENT} .*Ask.Jeeves.* [OR] RewriteCond %{HTTP_USER_AGENT} .*FAST.WebCrawl.* [OR] RewriteCond %{HTTP_USER_AGENT} .*ia_archiver.* [OR] RewriteCond %{HTTP_USER_AGENT} .*InfoSeek.* [OR] RewriteCond %{HTTP_USER_AGENT} .*inktomi.* [OR] RewriteCond %{HTTP_USER_AGENT} .*Scooter.* [OR] RewriteCond %{HTTP_USER_AGENT} .*Slurp.* [OR] RewriteCond %{HTTP_USER_AGENT} .*Teoma.* [OR] RewriteCond %{HTTP_USER_AGENT} .*VoilaBot.* [OR] RewriteCond %{HTTP_USER_AGENT} .*Google.* RewriteRule!.*(html¦htm¦txt¦/)$ /www/msgs/badagent.html [F]