Jamie writes:
While this thread is slowly drifting, I disagree with your assertion that so much of the web traffic is cacheable (nlanr's caching effort, if I remember, only got around 60% of requests hit in the cache, pooled over a large number of clients. That probably should be the correct percentage of cacheable content on the net). If anything, the net is moving to be *more* dynamic. The problem is that web sites are putting unrealistic expires on images and html files because they're being driven by ad revenues. I doubt that any of the US based commercial websites are interested in losing the entries in their hit logs. Caching is the type of thing is totally broken by session-ids, (sites like amazon.com and cdnow).
The only way caching is going to truly be viable in the next 5 years is either by a commercial company stepping in and working with commercial content providers (which is happening now), or webserver software vendors work with content companies on truly embracing a hit reporting protocol.
The workshop results from the last IRCACHE workshop have some interesting data on hit rates in a variety of caches (http://workshop.ircache.net/ for the main program). In general, it is worse even than you assert; it is often as bad as a 40 percent hit rate, even for a cache serving a large number of users. There has, however, been a fair amount of work to determine which algorithms for cache replacement are effective; John Dilley, in particular, has implemented several for Squid (the IRCACHE group's example cache engine). Like Jamie, I tend to believe that the current caching paradigm is broken. It relies on a community of users having sufficiently similar patterns of use to populate a cache with resources which will re-used; in most cases, that doesn't happen often enough to make it worth it, except in instances where the resources are very expensive to get (trans-oceanic links etc.) or where the cache and the aggregated user community are very large indeed. At a BOF at the last IRCACHE workshop, a group of us discussed the idea of creating a caching system that acts on behalf of the content providers rather than the user (an outward-facing "surrogate" instead of an inward-facing "proxy"). This paradigm relies on the fairly well documented phenomena of "flash crowds" or "cnn events" to presume that the users accessing a particular content provider will tend to have a high overlap for short time intervals. This reflect my experience as a NASA web guy, as well as the experience of some of the web hosting providers in the room at the time. You won't always get the high overlap rates of a CNN event, of course, but it seems worth checking to see if we can get better than the rates for proxy caches. Surrogates have their own problems, of course, but they do solve some of the traditional proxy issues like hit metering and authentication (since the surrogate operator has a prior business relationship with the content provider). This discussion and work continues on a mailing list "surrogates@equinix.com" (majordomo syntax to the -request address). The URL of the original BOF info is http://workshop.ircache.net/BOFs/bof2.html, for those who are interested. regards, Ted Hardie Equinix