I figured this was the best forum to post this, if anyone has suggestions where this might be better placed, please let me know. A University in our customer base has received funding to start a reasonably large spider project. It will crawl websites [search engine fashion] and save certain parts of the information it receives. This information will be made available to research institutions and other concerns. We have been asked for recommendations on what functions/procedures they can put in to be good netizens and not cause undo stress to networks out there. On the list of functions: a) Obey robots.txt files b) Allow network admins to automatically have their netblocks exempted on request c) Allow ISP's caches to sync with it. There are others, but they all revolve around a & b. C was something that seemed like a good idea, but I don't know if there is any real demand for it. Essentially, this project will have at least 1Gb/s of inbound bandwidth. Average usage is expected to be around 500mb/s for the first several months. ISPs who cache would have an advantage if they used the cache developed by this project to load their tables, but I do not know if there is an internet-wide WCCP or equivalent out there or if the improvement is worth the management overhead. Because the funding is there, this project is essentially a certainty. If there are suggestions that should be added or concerns that this raises, please let me know [privately is fine]. All input is appreciated, DJ
On Wed, Jan 23, 2002 at 02:35:17PM -0500, Deepak Jain wrote: [snip]
This information will be made available to research institutions and other concerns. [snip] c) Allow ISP's caches to sync with it. [snip] ISPs who cache would have an advantage if they used the cache developed by this project to load their tables, but I do not know if there is an internet-wide WCCP or equivalent out there or if the improvement is worth the management overhead. [snip]
Assuming that the info will be made available in html format, the only thing you really need to do to achieve c) is to choose an appropriate value for the http-equiv="Expires" meta-tag when serving the info, and have a cron job at each ISP make a request for the info at some arbitrary time. This last step really isn't that useful unless there are points of congestion, or times when the servers are bogged down. The caches have to respect the Expires tag, though, and a broken clock can cause all sorts of fun on that end... -- Bob <melange@yip.org> | Please don't feed the sock puppet.
<embarassed look> Er, I just re-read that, and now understand what you meant by c). Sorry for the wasted bandwidth. On Wed, Jan 23, 2002 at 03:29:38PM -0500, Bob K wrote:
On Wed, Jan 23, 2002 at 02:35:17PM -0500, Deepak Jain wrote: [snip]
This information will be made available to research institutions and other concerns. [snip] c) Allow ISP's caches to sync with it. [snip] ISPs who cache would have an advantage if they used the cache developed by this project to load their tables, but I do not know if there is an internet-wide WCCP or equivalent out there or if the improvement is worth the management overhead. [snip]
Assuming that the info will be made available in html format, the only thing you really need to do to achieve c) is to choose an appropriate value for the http-equiv="Expires" meta-tag when serving the info, and have a cron job at each ISP make a request for the info at some arbitrary time. This last step really isn't that useful unless there are points of congestion, or times when the servers are bogged down.
The caches have to respect the Expires tag, though, and a broken clock can cause all sorts of fun on that end...
-- Bob <melange@yip.org> | Please don't feed the sock puppet.
-- Bob <melange@yip.org> | Please don't feed the sock puppet.
participants (2)
-
Bob K
-
Deepak Jain