yahoo crawlers hammering us

7 Sep 2010

      So i guess im new at internets as my colleagues told me because I havent gone
around to 30-40 systems I control (minus customer self-managed gear) and
installed a restrictive robots.txt everywhere to make the web less useful to
everyone.

Does that really mean that a big outfit like yahoo should be expected to
download stuff at high speed off my customers servers? For varying values of
'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh.
Especially for an exe a user left exposed in a webdir, thats possibly (C)
software and shouldnt have been there (now removed by customer, some kinda OS boot
cd/toolset thingy).

This makes it look like Yahoo is actually trafficking in pirated software, but
that's kinda too funny to expect to be true, unless some yahoo tech decided to
use that IP/server @yahoo for his nefarious activity, but there are better sites
than my customer's box to get his 'juarez'.

At any rate:
...
From Address           To Address                Proto    Bytes    CPS
==============================================================================================================================================================================================
67.196.xx.xx..80       67.195.112.151..44507     tcp    14872000 523000
$ host 67.195.112.151 8.8.8.8

151.112.195.67.in-addr.arpa domain name pointer b3091122.crawl.yahoo.net.

CIDR:           67.195.0.0/16
NetName:        A-YAHOO-US8

so that's yahoo, or really well spoofed.

Is this expected/my own fault or what?

A number of years ago, there were 1000s of videos on a customer site (training
for elderly care, extremely exciting stuff for someone into -1-day movies to
post on torrent sites). Customer called me to say his bw was gone, and I
checked and found 12 yahoo crawlers hitting the site at 300K/s each (~30Mbps
+) downloading all the videos. This was all the more injurious as it was only
2004 and bandwidth was more than $1/mbps back then. I did the really crass
thing and nullrouted the whole /20 or whatever they were on per ARIN. It was
the new-at-the-time video.yahoo.com search engine coming to index the whole
site. I suppose they cant be too slow about it, or they'll never index a whole
webfull of videos this century, but still, 12x 300K/s in 2004? (At the time
Rasmus though it was kinda funny. I do too, now.)

/kc
-- 
Ken Chase - ken@heavycomputing.ca - +1 416 897 6284 - Toronto CANADA
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.

Ken Chase

Leslie

Harry Strongburg

Matthew Petach

Bruce Williams

Nathan Eisenberg

Bruce Williams

Valdis.Kletnieks＠vt.edu

Ken Chase

Matthew Petach

tags

participants (7)