RE: Crawler Ettiquette

24 Jan 2002


      Hi,
...
a) Obey robots.txt files
  b) Allow network admins to automatically have their 
netblocks exempted on request
  c) Allow ISP's caches to sync with it.
I don't know if this is already on your list, but I'd also suggest "d) Rate-limiting of requests to a netblock/server". I haven't got any references immediately to hand, but I do seem to recall a crawler written in such a way that it remained "server-friendly" and would not fire off too many requests too quickly.
...
ISPs who cache would have an advantage if they used the cache 
developed by this project to load their tables, but I do not
know if there is an internet-wide WCCP or equivalent out there
or if the improvement is worth the management overhead.
It may be worth having a quick look at http://www.ircache.net/ - there is a database of known caches available through a WHOIS interface, amongst other things.

HTH,

Jonathan

RE: Crawler Ettiquette

Hunter, Jonathan