STILL Paging Google...

16 Nov 2005

      Still no word from google, or indication that there's anything wrong 
with the robots.txt.  Google's estimated hit count is going slightly up, 
instead of way down.
Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps 
ignoring my robots.txt file, thereby hammering the server and 
facilitating s pam, they're doing the same with a google other sites.  
(Well, ok, not a google, but you get my point.) 

On 11/14/05 2:18 PM, Coyle, Brian sent forth electrons to convey:
...
Just thinking out loud...
Have you confirmed the IP addresses of the Googlebot entries in your log
actually belong to Google?
/paranoia  :)
The google search URL I posted shows that google is hitting the site.  
There are results in there that point to pages that postdate the 
robots.txt that should have blocked 'em.  
(http://www.google.com/search?q=site%3Awiki.fastmail.fm)
On 11/14/05 2:09 PM, Jeff Rosowski sent forth electrons to convey:
...
Are you trying to block everything except the main page?  I know to 
block everything ...
No; me too. See
http://www.google.com/webmasters/remove.html
The above page says that
User-agent: Googlebot
Disallow: /*?
will block all standard-looking dynamic content, i.e. URLs with "?" in them.
On Mon, 14 Nov 2005, Matthew Elvey wrote:
...
Doh!  I had no idea my thread would require login/be hidden from 
general view!  (A robots.txt info site had directed me there...)   It 
seems I fell for an SEO scam... how ironic.  I guess that's why I 
haven't heard from google...
Anyway, here's the page content (with some editing and paraphrasing):
Subject: paging google! robots.txt being ignored!
Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
doesn't complain (other than about the use of google's nonstandard 
extensions described at
http://www.google.com/webmasters/remove.html )
The above page says that it's OK that
#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*
is last (after User-agent: *)
and seems to suggest that the syntax is OK.
I also tried
User-agent: Googlebot
Disallow: /*?
but it hasn't helped.
I asked google to review it via the automatic URL removal system 
(http://services.google.com/urlconsole/controller).
Result:
URLs cannot have wild cards in them (e.g. "*"). The following line 
contains a wild card:
DISALLOW: /*?
How insane is that?
Oh, and while /*?* wasn't per their example, it was legal, per their 
syntax, same as /*?  !
The site as around 35,000 pages, and I don't think a small robots.txt 
to do what I want is possible without using the wildcard extension.