Still no word from google, or indication that there's anything wrong with the robots.txt. Google's estimated hit count is going slightly up, instead of way down. Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring my robots.txt file, thereby hammering the server and facilitating s pam, they're doing the same with a google other sites. (Well, ok, not a google, but you get my point.) On 11/14/05 2:18 PM, Coyle, Brian sent forth electrons to convey:
Just thinking out loud...
Have you confirmed the IP addresses of the Googlebot entries in your log actually belong to Google?
/paranoia :) The google search URL I posted shows that google is hitting the site. There are results in there that point to pages that postdate the robots.txt that should have blocked 'em. (http://www.google.com/search?q=site%3Awiki.fastmail.fm)
On 11/14/05 2:09 PM, Jeff Rosowski sent forth electrons to convey:
Are you trying to block everything except the main page? I know to block everything ... No; me too. See http://www.google.com/webmasters/remove.html The above page says that User-agent: Googlebot Disallow: /*? will block all standard-looking dynamic content, i.e. URLs with "?" in them.
On Mon, 14 Nov 2005, Matthew Elvey wrote:
Doh! I had no idea my thread would require login/be hidden from general view! (A robots.txt info site had directed me there...) It seems I fell for an SEO scam... how ironic. I guess that's why I haven't heard from google...
Anyway, here's the page content (with some editing and paraphrasing):
Subject: paging google! robots.txt being ignored!
Hi. My robots.txt was put in place in August! But google still has tons of results that violate the file.
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi doesn't complain (other than about the use of google's nonstandard extensions described at http://www.google.com/webmasters/remove.html )
The above page says that it's OK that
#per [[AdminRequests]] User-agent: Googlebot Disallow: /*?*
is last (after User-agent: *)
and seems to suggest that the syntax is OK.
I also tried
User-agent: Googlebot Disallow: /*? but it hasn't helped.
I asked google to review it via the automatic URL removal system (http://services.google.com/urlconsole/controller). Result: URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW: /*?
How insane is that?
Oh, and while /*?* wasn't per their example, it was legal, per their syntax, same as /*? !
The site as around 35,000 pages, and I don't think a small robots.txt to do what I want is possible without using the wildcard extension.