Feel free to clue me in on this please... ;-) What is www.gigablast.com? And why is it constantly performing "questionable" queries (mostly http) across every IP that I have access to check. I get a could of thousand hits (mostly questionable non-existing URL requests) from that ip (66.154.103.75). Anyone else seeing/questioning this? Completewhois shows some listings in some RBLs, but not the more popular ones. -Jim P.
:-) Let me add something before everyone on NANOG reminds me that gigablast is a search engine..... I know what they do, but what I don't understand is why are they searching my systems for URLs that haven't ever existed there before. It's as though they are doing random word searches in hopes of striking lucky. They are "crawling" for URLs like this: (unfortunately most people won't see these because their spam blockers will block all the exclamation points) /Hj!!lpMall /BuscaP!!gina /!!!!!!-!!!!!! /P!!ginasAbandonadas /HilfeIndex /CategoryCategory /Aktuelle!!nderungen /EfterladteSider /SystemPagesInDanishGroup /!!rvaLapok /ForSide /!!!!!!!!!!!! /!!!!!!-!!! /StartSeite /!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! /Hj!!lpTilHenvisninger /!!!!!!-!!!!!!!!!!!! /ExplorerCeWiki /Xslt!!!!!!!!!!!! /P!!ginaInicial /SenesteRettelser /!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! /Pr!!f!!rencesUtilisateur /WikiHomePage /HilfeZuParsern /AiutoModello /GewenstePaginas /HilfeZu!!berschriften -Jim P. Jim Popovitch wrote:
Feel free to clue me in on this please... ;-)
What is www.gigablast.com? And why is it constantly performing "questionable" queries (mostly http) across every IP that I have access to check.
I get a could of thousand hits (mostly questionable non-existing URL requests) from that ip (66.154.103.75). Anyone else seeing/questioning this?
Completewhois shows some listings in some RBLs, but not the more popular ones.
-Jim P.
:-) Let me add something before everyone on NANOG reminds me that gigablast is a search engine..... I know what they do, but what I don't understand is why are they searching my systems for URLs that haven't ever existed there before. It's as though they are doing random word searches in hopes of striking lucky. They are "crawling" for URLs like this: (unfortunately most people won't see these because their spam blockers will block all the exclamation points)
[list of random path names snipped] This seems to be a very wrong and bad thing to do. Google searches URLs because a human gives it permission to do so, for example by linking to that URL. (What purpose does a link have other than to be something to click on.) What gigablast seems to be doing, on the other hand, is trying to open every window in a house in the hopes that it will find one that's open. It has no invitation or permission to do this, and I would consider such behavior inappropriate. You do not have the right to make requests of other people's computers without their permission. You can certainly argue implied permission in many cases -- for example, if Ford registers the domain ford.com, and assigns an IP address to 'www.ford.com', you can certainly argue that they have invited the public to access that URL because that's the normal reason people create such things. However, you have no implied permission to try numerous combinations of random paths on the end of that in the hopes that you'll find something Ford did not invite you into. DS
> What gigablast seems to be doing, on the other hand, is trying to open > every window in a house in the hopes that it will find one that's open. Just looking at the text strings in the URLs, my off-the-top-of-my-head guess was that those were URLs it saw in email spam. They looked very similar to a lot of the ascii-garbage that gets generated by spammers trying to get through bayesian filters. It seemed plausible to me (not a good idea, of course, but the sort of thing that happens) that they might have been grepping web pages for URLs, and run across an archive of spam. -Bill
Google is your friend? They're a search engine. robots.txt and forget it. Malcolm Jim Popovitch wrote:
Feel free to clue me in on this please... ;-)
What is www.gigablast.com? And why is it constantly performing "questionable" queries (mostly http) across every IP that I have access to check.
I get a could of thousand hits (mostly questionable non-existing URL requests) from that ip (66.154.103.75). Anyone else seeing/questioning this?
Completewhois shows some listings in some RBLs, but not the more popular ones.
-Jim P.
Thats exactly it... they are doing site indexing .. if you like google... you'll need to like them! =P I personally wouldnt worry about anything in the logs unless you start seeing attempts to search and exploit .cgi and executable files... -Payam
Google is your friend? They're a search engine. robots.txt and forget it.
Malcolm
Jim Popovitch wrote:
Feel free to clue me in on this please... ;-)
What is www.gigablast.com? And why is it constantly performing "questionable" queries (mostly http) across every IP that I have access to check.
I get a could of thousand hits (mostly questionable non-existing URL requests) from that ip (66.154.103.75). Anyone else seeing/questioning this?
Completewhois shows some listings in some RBLs, but not the more popular ones.
-Jim P.
-- -- Payam Tarverdyan Chychi Network Analyst
On Wed, Jul 12, 2006 at 02:50:54PM -0700, Malcolm Staudinger wrote:
Google is your friend? They're a search engine. robots.txt and forget it.
Malcolm
That's assuming whoever designed their software actually adheres to robots.txt. RFCs recommend people adhere to it, but there are some who don't; it's operationally "optional". I can't find a single reference to what standards "GigaBlast" adheres to, or any technical data about how their engine works. The way their site is designed, it looks like a total fly-by-night operation. If "GigaBlast" is supposedly "indexing" his site, they have to be basing their GET requests on something (the equivalent of a normal browsers' Referer header; but again, who knows if they pass that along?). The requests Jim is seeing appear to be garbage, similar to spam composition, not based on actual references/indexes. I could be outright wrong here. Additionally, how does this solve the issue of Jim's bandwidth, CPU, memory, if not his time, being wasted for HTTP requests which shouldn't necessarily even be arriving at his boxes (which is what he's essentially complaining about)? "So filter upstream, or on the machine itself". Okay, that's a solution, but it doesn't address incoming traffic (just responses). -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
It appears that some of the queries are valid for an older site that existed in the past. That site was a wiki and some of the Giga hits are for internationalized versions of the default help/support pages. This is fine and acceptable behavior by them (IMHO). The fact that they are querying something that no longer exist is something I can deal with. The strangeness is that some of their crawling is looking for URLs with multiple exclamation points, those URLs never existed. This may be indicative of a character translation on my system or theirs. BUT, the net net is that I no longer feel a need to be concerned about them. Thanks all, -Jim P. Jim Popovitch wrote:
Feel free to clue me in on this please... ;-)
What is www.gigablast.com? And why is it constantly performing "questionable" queries (mostly http) across every IP that I have access to check.
I get a could of thousand hits (mostly questionable non-existing URL requests) from that ip (66.154.103.75). Anyone else seeing/questioning this?
Completewhois shows some listings in some RBLs, but not the more popular ones.
-Jim P.
On Wed, Jul 12, 2006 at 06:24:08PM -0400, Jim Popovitch <jimpop@yahoo.com> wrote a message of 32 lines which said:
The strangeness is that some of their crawling is looking for URLs with multiple exclamation points, those URLs never existed. This may be indicative of a character translation on my system or theirs.
From my experience (and I talked with people - or at least intelligent bots - at Gigablast), their HTML parser is seriously broken and it generates non-existing URL quite often. For instance <a href="http://www.example.fr/Cafe%20au%20lait"> will make their crawler ask for "/Cafe".
I reported the problem months ago but I got nothing except standard "Thanks for telling us".
participants (7)
-
Bill Woodcock
-
David Schwartz
-
Jeremy Chadwick
-
Jim Popovitch
-
Malcolm Staudinger
-
Payam Tarverdyan Chychi
-
Stephane Bortzmeyer