On Wed, Jul 12, 2006 at 02:50:54PM -0700, Malcolm Staudinger wrote:
Google is your friend? They're a search engine. robots.txt and forget it.
Malcolm
That's assuming whoever designed their software actually adheres to robots.txt. RFCs recommend people adhere to it, but there are some who don't; it's operationally "optional". I can't find a single reference to what standards "GigaBlast" adheres to, or any technical data about how their engine works. The way their site is designed, it looks like a total fly-by-night operation. If "GigaBlast" is supposedly "indexing" his site, they have to be basing their GET requests on something (the equivalent of a normal browsers' Referer header; but again, who knows if they pass that along?). The requests Jim is seeing appear to be garbage, similar to spam composition, not based on actual references/indexes. I could be outright wrong here. Additionally, how does this solve the issue of Jim's bandwidth, CPU, memory, if not his time, being wasted for HTTP requests which shouldn't necessarily even be arriving at his boxes (which is what he's essentially complaining about)? "So filter upstream, or on the machine itself". Okay, that's a solution, but it doesn't address incoming traffic (just responses). -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |