On Wed, Sep 8, 2010 at 9:20 AM, Ken Chase <ken@sizone.org> wrote:
On Wed, Sep 08, 2010 at 12:04:07AM -0700, Matthew Petach said:
>I *am* curious--what makes it any worse for a search engine like Google >to fetch the file than any other random user on the Internet? In either case, >the machine doing the fetch isn't going to rate-limit the fetch, so >you're likely >to see the same impact on the machine, and on the bandwidth.
I think that the difference is that there's a way to get to Yahoo and ask them WTF. Whereas the guy who mass downloads your site with a script in 2 hrs you have no recourse to (modulo well funded banks dispatching squads with baseball bats to resolve hacking incidents). I also expect that Yahoo's behaviour is driven by policy, not random assholishness (I hope :), and therefore I should expect such incidents often. I also expect whinging on nanog might get me some visiblity into said policy and leverage to change it! </dream>
Well, I'd hazard a guess that the policy of the webcrawling machines at Bing, Google, Yahoo, Ask.com, and every other large search engine is probably to crawl the Internet, pulling down pages and indexing them for their search engine, always checking for a robots.txt file and carefully following the instructions located within said file. Lacking any such file, one might suppose that the policy is to limit how many pages are fetched per interval of time, to avoid hammering a single server unnecessarily, and to space out intervals at which the site is visited, to balance out the desire to maintain a current, fresh view of the content, while at the same time being mindful of the limited server resources available for serving said content. Note that I have no actual knowledge of the crawling policies present at any of the aforementioned sites, I'm simply hypothesizing at what their policies might logically be. I'm curious--what level of visibility are you seeking into the crawling policies of the search engines, and what changes are you hoping to gain leverage to make to said policies? Thanks! Matt (speaking only for myself, not for any current or past employer)
/kc -- Ken Chase - ken@heavycomputing.ca - +1 416 897 6284 - Toronto CANADA Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.