Hi,
a) Obey robots.txt files b) Allow network admins to automatically have their netblocks exempted on request c) Allow ISP's caches to sync with it.
I don't know if this is already on your list, but I'd also suggest "d) Rate-limiting of requests to a netblock/server". I haven't got any references immediately to hand, but I do seem to recall a crawler written in such a way that it remained "server-friendly" and would not fire off too many requests too quickly.
ISPs who cache would have an advantage if they used the cache developed by this project to load their tables, but I do not know if there is an internet-wide WCCP or equivalent out there or if the improvement is worth the management overhead.
It may be worth having a quick look at http://www.ircache.net/ - there is a database of known caches available through a WHOIS interface, amongst other things. HTH, Jonathan
participants (1)
-
Hunter, Jonathan