Still no word from google, or indication that there's anything wrong with the robots.txt. Google's estimated hit count is going slightly up, instead of way down. Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring my robots.txt file, thereby hammering the server and facilitating s pam, they're doing the same with a google other sites. (Well, ok, not a google, but you get my point.) On 11/14/05 2:18 PM, Coyle, Brian sent forth electrons to convey:
Just thinking out loud...
Have you confirmed the IP addresses of the Googlebot entries in your log actually belong to Google?
/paranoia :) The google search URL I posted shows that google is hitting the site. There are results in there that point to pages that postdate the robots.txt that should have blocked 'em. (http://www.google.com/search?q=site%3Awiki.fastmail.fm)
On 11/14/05 2:09 PM, Jeff Rosowski sent forth electrons to convey:
Are you trying to block everything except the main page? I know to block everything ... No; me too. See http://www.google.com/webmasters/remove.html The above page says that User-agent: Googlebot Disallow: /*? will block all standard-looking dynamic content, i.e. URLs with "?" in them.
On Mon, 14 Nov 2005, Matthew Elvey wrote:
Doh! I had no idea my thread would require login/be hidden from general view! (A robots.txt info site had directed me there...) It seems I fell for an SEO scam... how ironic. I guess that's why I haven't heard from google...
Anyway, here's the page content (with some editing and paraphrasing):
Subject: paging google! robots.txt being ignored!
Hi. My robots.txt was put in place in August! But google still has tons of results that violate the file.
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi doesn't complain (other than about the use of google's nonstandard extensions described at http://www.google.com/webmasters/remove.html )
The above page says that it's OK that
#per [[AdminRequests]] User-agent: Googlebot Disallow: /*?*
is last (after User-agent: *)
and seems to suggest that the syntax is OK.
I also tried
User-agent: Googlebot Disallow: /*? but it hasn't helped.
I asked google to review it via the automatic URL removal system (http://services.google.com/urlconsole/controller). Result: URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW: /*?
How insane is that?
Oh, and while /*?* wasn't per their example, it was legal, per their syntax, same as /*? !
The site as around 35,000 pages, and I don't think a small robots.txt to do what I want is possible without using the wildcard extension.
Hi there, Looking at your robots.txt... are you sure that is correct? On the sites I host.. robots.txt always has: User-Agent: * Disallow: / In /htdocs or wherever the httpd root lives. Thus far it keeps the spiders away. GoogleSpider also will obey: NOARCHIVE, NOFOLLOW, NOINDEX placed within the meta tag inside of the html header. -M. With the above for robots.txt I've had no problems th
Still no word from google, or indication that there's anything wrong with the robots.txt. Google's estimated hit count is going slightly up, instead of way down. Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring my robots.txt file, thereby hammering the server and facilitating s pam, they're doing the same with a google other sites. (Well, ok, not a google, but you get my point.)
The above page says that User-agent: Googlebot Disallow: /*? will block all standard-looking dynamic content, i.e. URLs with "?" in them.
On Mon, 14 Nov 2005, Matthew Elvey wrote:
Doh! I had no idea my thread would require login/be hidden from general view! (A robots.txt info site had directed me there...) It seems I fell for an SEO scam... how ironic. I guess that's why I haven't heard from google...
Anyway, here's the page content (with some editing and paraphrasing):
Subject: paging google! robots.txt being ignored!
Hi. My robots.txt was put in place in August! But google still has tons of results that violate the file.
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi doesn't complain (other than about the use of google's nonstandard extensions described at http://www.google.com/webmasters/remove.html )
The above page says that it's OK that
#per [[AdminRequests]] User-agent: Googlebot Disallow: /*?*
is last (after User-agent: *)
and seems to suggest that the syntax is OK.
I also tried
User-agent: Googlebot Disallow: /*? but it hasn't helped.
I asked google to review it via the automatic URL removal system (http://services.google.com/urlconsole/controller). Result: URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW: /*?
How insane is that?
Oh, and while /*?* wasn't per their example, it was legal, per their syntax, same as /*? !
The site as around 35,000 pages, and I don't think a small robots.txt to do what I want is possible without using the wildcard extension.
On Tue, 15 Nov 2005, Steven Kalcevich wrote:
www.paypal.com
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, webmaster@paypal.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
Works for me. Same BS splash advertising that always comes up. Damn that is annoying. Chris -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Chris Owen ~ Garden City (620) 275-1900 ~ Lottery (noun): President ~ Wichita (316) 858-3000 ~ A stupidity tax Hubris Communications Inc ~ www.hubris.net ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* matthew@elvey.com (Matthew Elvey) [Wed 16 Nov 2005, 01:56 CET]:
Still no word from google, or indication that there's anything wrong with the robots.txt. Google's estimated hit count is going slightly up, instead of way down.
robots.txt is about explicitly spidering your site; Google will still follow links from outside towards your website and index pages linked that way. This is common knowledge -- Niels. -- "Calling religion a drug is an insult to drugs everywhere. Religion is more like the placebo of the masses." -- MeFi user boaz
matthew@elvey.com (Matthew Elvey) [Wed 16 Nov 2005, 01:56 CET]:
Still no word from google, or indication that there's anything wrong with the robots.txt. Google's estimated hit count is going slightly up, instead of way down.
Way back in the early '90's someone came up with an elegant solution to this problem. When building a site in a folder named /httproot, all dynamic pages, i.e. scripts, were placed in a folder named /httproot/cgi-bin Then somebody invented robots.txt to allow people to tell spiders to leave the cgi-bin folder alone. Sites which follow the ancient paradigm do not run into these kinds of problems. Some people would say that asking the world to re-engineer the robots.txt protocol instead of building sites compliant with the protocol, is in violation of the robustness principle as expressed by Jon Postel in RFC 793 section 2.10 and reiterated in section 4.5 of RFC 3117. When something doesn't work, the correct operational response is to fix it. --Michael Dillon
participants (7)
-
Chris Owen
-
Harald Koch
-
Matthew Elvey
-
MH
-
Michael.Dillon@btradianz.com
-
Niels Bakker
-
Steven Kalcevich