yahoo crawlers hammering us
So i guess im new at internets as my colleagues told me because I havent gone around to 30-40 systems I control (minus customer self-managed gear) and installed a restrictive robots.txt everywhere to make the web less useful to everyone. Does that really mean that a big outfit like yahoo should be expected to download stuff at high speed off my customers servers? For varying values of 'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh. Especially for an exe a user left exposed in a webdir, thats possibly (C) software and shouldnt have been there (now removed by customer, some kinda OS boot cd/toolset thingy). This makes it look like Yahoo is actually trafficking in pirated software, but that's kinda too funny to expect to be true, unless some yahoo tech decided to use that IP/server @yahoo for his nefarious activity, but there are better sites than my customer's box to get his 'juarez'. At any rate:
From Address To Address Proto Bytes CPS ============================================================================================================================================================================================== 67.196.xx.xx..80 67.195.112.151..44507 tcp 14872000 523000
$ host 67.195.112.151 8.8.8.8 151.112.195.67.in-addr.arpa domain name pointer b3091122.crawl.yahoo.net. CIDR: 67.195.0.0/16 NetName: A-YAHOO-US8 so that's yahoo, or really well spoofed. Is this expected/my own fault or what? A number of years ago, there were 1000s of videos on a customer site (training for elderly care, extremely exciting stuff for someone into -1-day movies to post on torrent sites). Customer called me to say his bw was gone, and I checked and found 12 yahoo crawlers hitting the site at 300K/s each (~30Mbps +) downloading all the videos. This was all the more injurious as it was only 2004 and bandwidth was more than $1/mbps back then. I did the really crass thing and nullrouted the whole /20 or whatever they were on per ARIN. It was the new-at-the-time video.yahoo.com search engine coming to index the whole site. I suppose they cant be too slow about it, or they'll never index a whole webfull of videos this century, but still, 12x 300K/s in 2004? (At the time Rasmus though it was kinda funny. I do too, now.) /kc -- Ken Chase - ken@heavycomputing.ca - +1 416 897 6284 - Toronto CANADA Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.
That speed doesn't seem too bad to me - robots.txt is our friend when one had bandwidth limitations. Leslie On 9/7/10 1:19 PM, Ken Chase wrote:
So i guess im new at internets as my colleagues told me because I havent gone around to 30-40 systems I control (minus customer self-managed gear) and installed a restrictive robots.txt everywhere to make the web less useful to everyone.
Does that really mean that a big outfit like yahoo should be expected to download stuff at high speed off my customers servers? For varying values of 'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh. Especially for an exe a user left exposed in a webdir, thats possibly (C) software and shouldnt have been there (now removed by customer, some kinda OS boot cd/toolset thingy).
This makes it look like Yahoo is actually trafficking in pirated software, but that's kinda too funny to expect to be true, unless some yahoo tech decided to use that IP/server @yahoo for his nefarious activity, but there are better sites than my customer's box to get his 'juarez'.
At any rate:
From Address To Address Proto Bytes CPS ============================================================================================================================================================================================== 67.196.xx.xx..80 67.195.112.151..44507 tcp 14872000 523000
$ host 67.195.112.151 8.8.8.8
151.112.195.67.in-addr.arpa domain name pointer b3091122.crawl.yahoo.net.
CIDR: 67.195.0.0/16 NetName: A-YAHOO-US8
so that's yahoo, or really well spoofed.
Is this expected/my own fault or what?
A number of years ago, there were 1000s of videos on a customer site (training for elderly care, extremely exciting stuff for someone into -1-day movies to post on torrent sites). Customer called me to say his bw was gone, and I checked and found 12 yahoo crawlers hitting the site at 300K/s each (~30Mbps +) downloading all the videos. This was all the more injurious as it was only 2004 and bandwidth was more than $1/mbps back then. I did the really crass thing and nullrouted the whole /20 or whatever they were on per ARIN. It was the new-at-the-time video.yahoo.com search engine coming to index the whole site. I suppose they cant be too slow about it, or they'll never index a whole webfull of videos this century, but still, 12x 300K/s in 2004? (At the time Rasmus though it was kinda funny. I do too, now.)
/kc
On Tue, Sep 07, 2010 at 04:19:58PM -0400, Ken Chase wrote:
This makes it look like Yahoo is actually trafficking in pirated software, but that's kinda too funny to expect to be true, unless some yahoo tech decided to use that IP/server @yahoo for his nefarious activity, but there are better sites than my customer's box to get his 'juarez'.
It's not uncommon at all for a web-spider to find large files and download them. I don't think there's some conspiracy at Yahoo to find warez; they are just opperating as a normal spider, indexing the Internet.
~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh.
What speed would you like a spider to download at? You could configure the speeds to Yahoo's blocks server-side if you care enough. Ideally, request your customer doesn't throw large programs on there if you're concerned about bandwidth. 4 Mb/s isn't abnormal at all for a spider, and especially on a larger file.
Is this expected/my own fault or what?
A little bit of both :)
On Tue, Sep 7, 2010 at 1:19 PM, Ken Chase <ken@sizone.org> wrote:
So i guess im new at internets as my colleagues told me because I havent gone around to 30-40 systems I control (minus customer self-managed gear) and installed a restrictive robots.txt everywhere to make the web less useful to everyone.
Does that really mean that a big outfit like yahoo should be expected to download stuff at high speed off my customers servers? For varying values of 'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh. Especially for an exe a user left exposed in a webdir, thats possibly (C) software and shouldnt have been there (now removed by customer, some kinda OS boot cd/toolset thingy).
The large search engines like Google, Bing, and Yahoo do try to be good netizens, and not have multiple crawlers hitting a given machine at the same time, and they put delays between each request, to be nice to the CPU load and bandwidth of the machines; but I don't think any of the crawlers explicitly make efforts to slow down single-file-fetches. Ordinarily, the transfer speed doesn't matter as much for a single URL fetch, as it lasts a very short period of time, and then the crawler waits before doing another fetch from the same machine/same site, reducing the load on the machine being crawled. I doubt any of them rate-limit down individual fetches, though, so you're likely to see more of an impact when serving up large single files like that. I *am* curious--what makes it any worse for a search engine like Google to fetch the file than any other random user on the Internet? In either case, the machine doing the fetch isn't going to rate-limit the fetch, so you're likely to see the same impact on the machine, and on the bandwidth.
Is this expected/my own fault or what?
Well...if you put a 3GB file out on the web, unprotected, you've got to figure at some point someone's going to stumble across it and download it to see what it is. If you don't want to be serving it, it's probably best to not put it up on an unprotected web server where people can get to it. ^_^; Speaking purely for myself in this manner, as a random user who sometimes sucks down random files left in unprotected directories, just to see what they are. Matt (now where did I put that antivirus software again...?)
I *am* curious--what makes it any worse for a search engine like Google to fetch the file than any other random user on the Internet
Possibly because that other user is who the customer pays have their content delivered to? Bruce Williams ----------------------------------------------------------------------------------------------------------------------------- You can close your eyes to things you don't want to see, but you can't close your heart to the things you don't want to feel. On Wed, Sep 8, 2010 at 12:04 AM, Matthew Petach <mpetach@netflight.com> wrote:
On Tue, Sep 7, 2010 at 1:19 PM, Ken Chase <ken@sizone.org> wrote:
So i guess im new at internets as my colleagues told me because I havent gone around to 30-40 systems I control (minus customer self-managed gear) and installed a restrictive robots.txt everywhere to make the web less useful to everyone.
Does that really mean that a big outfit like yahoo should be expected to download stuff at high speed off my customers servers? For varying values of 'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh. Especially for an exe a user left exposed in a webdir, thats possibly (C) software and shouldnt have been there (now removed by customer, some kinda OS boot cd/toolset thingy).
The large search engines like Google, Bing, and Yahoo do try to be good netizens, and not have multiple crawlers hitting a given machine at the same time, and they put delays between each request, to be nice to the CPU load and bandwidth of the machines; but I don't think any of the crawlers explicitly make efforts to slow down single-file-fetches. Ordinarily, the transfer speed doesn't matter as much for a single URL fetch, as it lasts a very short period of time, and then the crawler waits before doing another fetch from the same machine/same site, reducing the load on the machine being crawled. I doubt any of them rate-limit down individual fetches, though, so you're likely to see more of an impact when serving up large single files like that.
I *am* curious--what makes it any worse for a search engine like Google to fetch the file than any other random user on the Internet? In either case, the machine doing the fetch isn't going to rate-limit the fetch, so you're likely to see the same impact on the machine, and on the bandwidth.
Is this expected/my own fault or what?
Well...if you put a 3GB file out on the web, unprotected, you've got to figure at some point someone's going to stumble across it and download it to see what it is. If you don't want to be serving it, it's probably best to not put it up on an unprotected web server where people can get to it. ^_^;
Speaking purely for myself in this manner, as a random user who sometimes sucks down random files left in unprotected directories, just to see what they are.
Matt (now where did I put that antivirus software again...?)
Possibly because that other user is who the customer pays have their content delivered to?
Customers don't want to deliver their content to search engines? That seems silly. http://www.last.fm/robots.txt (Note the final 3 disallow lines...)
On Wed, 08 Sep 2010 02:21:31 PDT, Bruce Williams said:
I *am* curious--what makes it any worse for a search engine like Google to fetch the file than any other random user on the Internet
Possibly because that other user is who the customer pays have their content delivered to?
Seems to me that if you're doing content-for-pay and are upset when some(body|thing) downloads it without paying for it, you shouldn't be leaving gigabytes of said content where even a stupid bot can find it and download it without paying for it. Just sayin'.
On Wed, Sep 08, 2010 at 12:04:07AM -0700, Matthew Petach said:
I *am* curious--what makes it any worse for a search engine like Google to fetch the file than any other random user on the Internet? In either case, the machine doing the fetch isn't going to rate-limit the fetch, so you're likely to see the same impact on the machine, and on the bandwidth.
I think that the difference is that there's a way to get to Yahoo and ask them WTF. Whereas the guy who mass downloads your site with a script in 2 hrs you have no recourse to (modulo well funded banks dispatching squads with baseball bats to resolve hacking incidents). I also expect that Yahoo's behaviour is driven by policy, not random assholishness (I hope :), and therefore I should expect such incidents often. I also expect whinging on nanog might get me some visiblity into said policy and leverage to change it! </dream> /kc -- Ken Chase - ken@heavycomputing.ca - +1 416 897 6284 - Toronto CANADA Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.
On Wed, Sep 8, 2010 at 9:20 AM, Ken Chase <ken@sizone.org> wrote:
On Wed, Sep 08, 2010 at 12:04:07AM -0700, Matthew Petach said:
>I *am* curious--what makes it any worse for a search engine like Google >to fetch the file than any other random user on the Internet? In either case, >the machine doing the fetch isn't going to rate-limit the fetch, so >you're likely >to see the same impact on the machine, and on the bandwidth.
I think that the difference is that there's a way to get to Yahoo and ask them WTF. Whereas the guy who mass downloads your site with a script in 2 hrs you have no recourse to (modulo well funded banks dispatching squads with baseball bats to resolve hacking incidents). I also expect that Yahoo's behaviour is driven by policy, not random assholishness (I hope :), and therefore I should expect such incidents often. I also expect whinging on nanog might get me some visiblity into said policy and leverage to change it! </dream>
Well, I'd hazard a guess that the policy of the webcrawling machines at Bing, Google, Yahoo, Ask.com, and every other large search engine is probably to crawl the Internet, pulling down pages and indexing them for their search engine, always checking for a robots.txt file and carefully following the instructions located within said file. Lacking any such file, one might suppose that the policy is to limit how many pages are fetched per interval of time, to avoid hammering a single server unnecessarily, and to space out intervals at which the site is visited, to balance out the desire to maintain a current, fresh view of the content, while at the same time being mindful of the limited server resources available for serving said content. Note that I have no actual knowledge of the crawling policies present at any of the aforementioned sites, I'm simply hypothesizing at what their policies might logically be. I'm curious--what level of visibility are you seeking into the crawling policies of the search engines, and what changes are you hoping to gain leverage to make to said policies? Thanks! Matt (speaking only for myself, not for any current or past employer)
/kc -- Ken Chase - ken@heavycomputing.ca - +1 416 897 6284 - Toronto CANADA Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.
participants (7)
-
Bruce Williams
-
Harry Strongburg
-
Ken Chase
-
Leslie
-
Matthew Petach
-
Nathan Eisenberg
-
Valdis.Kletnieks@vt.edu