What are Caches Good For ?

Experiences with installing and running an institutional cache.

This paper was first presented at the third International Web Caching Workshop in Manchester, June 1998.

Andrew Cormack
Cardiff University
40/41 Park Place
Cardiff
CF1 3BB

E-mail: Cormack@Cardiff.ac.uk

Phone: 01222-874489

When we were considering introducing a web cache in early 1996, there was little information available to help persuade management that the expenditure of time and money was worthwhile. Two years on, we know a great deal more about caches and hope that our experiences will be useful to others. This paper describes our experiences in bringing a cache into service and running it for eighteen months. It sets out the benefits to be obtained from using a cache, and how to maximise these, and looks at the likely future of caching.

While our experience has been with the squid cache software, running on Digital Unix and Linux platforms, most of the conclusions should be pertinent to any cache system.

Why Have a Web Cache?

For Users

The first, and most important, beneficiaries of introducing a web cache are users. A document obtained from our Cardiff cache [1] will reach a local user, on average, nearly five times faster than if it had to be retrieved from its site of origin. Since the cache can usually satisfy between 50% and 55% of requests this should result in a perceptible improvement in the users' experience of the web. Furthermore, with suitable preparation, the effect can be immediate: the cache achieved a 50% hit rate within 24 hours of being brought into service and the performance only declined when growth in demand exceeded the disk capacity originally provided.

The key to this rapid start up was ensuring that a large number of users were ready to use the cache as soon as it was available. An Internet name for the cache (wwwcache.cardiff.ac.uk) was created nine months before the machine itself existed. This name was set as an alias for the JANET national cache [2] - then at the University of Kent, Canterbury - though the same function could have been provided by any existing cache which was prepared to accept requests from our users. The local name was then installed as the default configuration for new and updated web browsers so that in time a reasonable proportion of users acquired references to the cache without having to perform any manual changes. Once the local cache machine had been purchased, installed and tested by trial users with re-configured browsers, the alias was changed to refer to it and a significant user community followed.

Introducing caching in this "transparent" fashion carries with it an obligation that the change is indeed as transparent as possible. Since users do not need to take any action to use the cache there is no guarantee that they will read, or even be aware of, any release note or warning message which is provided. When the cache forces itself into their consciousness, it must therefore do so in a way which is completely self-explanatory. The most common example is error messages. A user who types an invalid host name in a URL will get a different error message if they are using a cache, since it will be the cache, and not the browser, which discovers the error. When we first used the national cache we received many reports of "faults on the Internet", simply because the error message came from a remote location. In configuring our site cache we have tried to avoid this problem by ensuring that any error pages returned by the cache include a hypertext link to a help page. This has simple explanations of the cause, and recommended action, for each message. We would not claim these as perfect but they seem to have the desired effect: the page receives a large number of hits and the fault desk no longer does!

The cache must always retain users' goodwill. We do not currently force browsers to use it and such compulsion would be strongly resisted in our university environment. So long as users retain the ability to disable caching, it is essential to ensure that they do not exercise it. The cache can only perform as well as, or better than, an independent browser if it is provided with sufficient resources, especially physical memory. If the cache server ever begins to swap its speed of response will fall and users will desert. There must also be adequate cover for hardware failures. We have never installed an auto-configuration script into our browsers - when we started few users could take advantage of them - which leaves the cache as a single point of failure affecting all access to the web. Instead, in our original installation, we had a smaller backup machine running the cache software which could be brought into service on demand by changing the target of the wwwcache alias. This was never required to cover a failure but did allow us to take the main cache out of service for operating system and software upgrades without interrupting access to the web. We have now replaced the original cache with a pair of identical machines which are continuously in service. This provides better resilience, though some browsers still require manual intervention after a change of cache configuration. We hope, in time, to make the recovery process completely automatic. When changing cache aliases, we have found that it is necessary to make the change well in advance of any planned downtime. Since most browsers choose a cache only when they start up, and users often leave a browser running all day, the changeover is best done the evening before planned maintenance work. Similarly, even after normal service is restored, some client machines continue to refer to the temporary configuration for several hours.

The majority of users now seem satisfied with, or at least unaware of, the web cache. There are occasional complaints that a page could be retrieved more quickly with caching disabled, but these seldom include sufficient detail to allow a full investigation. Most users seem to have learnt that typing in a bare domain name, such as "netscape" or "microsoft", and relying on their browser to try all possibilities, will not work through the cache. Perhaps they found that most of the pre-configured expansions were inappropriate for British use. Surprisingly, there have been no complaints of the cache serving out of date pages, even though the refresh parameters are deliberately aggressive. Perhaps the best proof of satisfaction is that, without forcing anyone to use the cache, over 75% of our web traffic does so. The remainder may be dissatisfied users, or simply those with their own browsers who do not know the cache exists. Contacting them is a problem for the future.

One unexpected benefit of the cache has fallen to users at the university's field stations which use commercial Internet Service Providers for their network connection. By routing their requests via the cache they can gain access to internal university pages as well as other services restricted to academic addresses. Careful configuration is required to allow these users access while preventing the cache providing the same address-laundering function to other sites.

For Accountants

Caches are also popular with accountants and others concerned with the volume of network traffic. However they should not have false expectations. Even the best cache is unlikely to halve a site's total traffic so the same speed of external network connection will still be required. A cache will provide much more efficient use of the connection by reducing the amount of duplicated web transfers. This effect can be seen in statistics from the JANET trans-Atlantic link: of Cardiff's total use in January 1998, only 59% was due to web browsers compared to the national average of 69%. Sites which do not have institutional caches have figures as high as 83% [3]. By releasing network resources the cache can even improve the response of other services which have to compete for bandwidth. The introduction of a cache will reduce the volume of network traffic, but over time it will still increase. The potential saving from a cache seems to be reasonably constant at around 45% of web traffic, but a fixed percentage of a growing demand is still an ever-increasing quantity.

There is a plausible theory that if one cache is a good thing, then many caches will be even better. So site caches should be served by regional caches which should be served by national caches. By reducing the network traffic to the absolute minimum, the theory goes, the users' satisfaction will be maximised. This theory assumes that network congestion is the only factor delaying traffic which, in practice, is not always true. Each cache takes a little time to process a request and check its tables of stored objects and, in a deep hierarchy, these additional delays may outweigh any possible saving. This is particularly likely to occur on very high-speed networks such as SuperJANET where the network's performance can only be matched by very powerful servers.

The initial configuration of the Cardiff cache assumed that access within JANET was fast and everything else was slow, so all requests for sites not in .ac.uk should be passed to the JANET National Cache. Following complaints from users in November 1997, two alternative configurations were compared using the average transfer rate as a measure of the performance perceived by users. This is calculated as the number of bytes transferred to the user, divided by the elapsed time between the cache receiving a request and completing the transfer of the document back to the browser. This definition will clearly be affected by factors other than the caches, but seems to relate well to users' understanding of "performance". The table shows the average transfer rate on week days for each configuration as calculated by the analysis script, calamaris [4]. Configuration A has the national cache as parent for all sites other than .ac.uk; configuration B uses the cache as parent for sites outside the .uk domain; configuration C is selfish, using the national cache only when it is known to have the requested file.

cache for

transfer rate

A

non ac.uk

0.28kbytes/sec

B

non uk

0.44kbytes/sec

C

cache hit

0.63kbytes/sec

We were shocked to find that this last configuration doubled the average transfer rate, compared with the original! The explanation lies in the arrangement of JANET's international links which are, at present, available on an equal basis to all JANET sites. Since the Cardiff cache has a high-speed connection to SuperJANET it has exactly the same access to international bandwidth as the national cache. It is therefore inevitable that a document fetched for us by the national cache will arrive more slowly than had the Cardiff cache obtained it direct from its origin, simply because the request is being processed by one additional cache server. Requests which can be satisfied from the national cache should arrive more quickly, but these are in a minority since most of the popular pages will be found in the local cache and need go no further.

In the spirit of co-operation we have reverted to our second configuration, using the national cache as parent for sites outside the UK, and users seem satisfied with the result. This decision would be made easier if there were some incentive to use the national cache. This may be provided by the stick of charging for direct use of the international links, or by the carrot of dedicating some international bandwidth to the national cache to give it a performance advantage.

Caching in Future

Web caching appears to have a secure future as it becomes increasingly important to make efficient use of scarce network resources. The number of site caches seems likely to increase, since the benefits can be obtained at relatively low cost. The most common cache software, squid [5], is free and has only modest CPU requirements [6]. The principal demand is for memory and disk, both of which are extremely cheap. A web cache is one of the few internet servers unlikely to make sudden unexpected demands for more hardware: provided the disk and memory sizes are compatible, increased demand (within the constraints of disk and network bandwidth) produces only a gentle degradation in the length of time documents stay in the cache.

To provide the best user service from new and existing networks it will be necessary to tune cache configurations to reflect changes in both supply and demand. Peering arrangements need to be kept under review to since, over time, there is a tendency for peer caches to acquire similar contents, making them less able to satisfy one another's requirements. Within a group of caches, different algorithms for load sharing may be more efficient in both network and disk requirements. Problems with squid's ICP protocol have been identified in theory by Ross [7] and in practice by the national cache [8] during November 1997. However the alternative peering methods require common, or at least co-ordinated, management.

The greatest encouragement to use caches is likely to come from the increased cost of networks and, in the UK at least, the decision for the first time to pass some of that cost on to Higher Education Institutions from August 1998 [9]. Cache administrators should welcome their new found popularity, but also beware of some possible hazards.

The ideal charging regime should be designed to promote good practice in use of the network, and should reward efforts by users and sites to reduce the overall demand on central resources. In particular it should encourage the greatest possible use of web caches. A simple model, where sites are charged for each byte they receive across international links, would seem to meet this requirement. Unfortunately it also discourages large sites, which already have caches, from supporting their smaller neighbours. If "foreign" browsers are permitted access to a site cache, then the cost of the reduced traffic will be transferred to the owners of the cache. Even if a small site is able to install its own cache, larger neighbours may be reluctant to enter into unequal peering arrangements, since they are likely to effectively subsidise their partners in bytes if not in pounds. Where a number of cacheless sites share a connection, for example to a Metropolitan Area Network, it may be better for the MAN to provide a shared cache for these sites, forming a peering group with local institutional caches.

The implications of charging for the national cache service are also unclear. If a reduced price is offered as an incentive to use the cache there may be a temptation for site caches to save money by passing on requests which are known to be uncachable, such as form submissions, forcing the national caches to do fruitless work.

On the other hand, one benefit of charging may be to provide an economic counter-argument to commercial sites which deliberately defeat caching to obtain high usage figures to satisfy their own accountants. Such sites seem to feel that one user on the balance sheet is worth several who gave up because the site was too slow [10].

Since network traffic costs money, and caches are proven to reduce network traffic, there will be an obvious temptation for financial minds to move from encouraging the use of a site cache to enforcing it. This is technically simple - configuring a site's router to allow only the cache to make off-site HTTP connections takes a matter of moments - but has other implications which must not be forgotten. With such a configuration, the cache becomes a critical point of failure. If it breaks then no-one will be able to access the World Wide Web or, as many people perceive it, the Internet itself. University research and teaching are increasingly reliant on the web. Without enforcement a cache failure is unfortunate, since it damages the reputation of the service: with enforcement a failure threatens much of the work of the university. A cache installation, such as ours, which requires manual intervention to restore full service may no longer be adequate protection, and automatic fault-tolerant systems may be necessary. For web caches, such systems are relatively easy to construct, and may be as simple as a monitor program running on the primary DNS server and switching an alias pointer. Browser auto-configuration scripts may also be used to switch automatically without the need for complex failsafe systems which insert themselves at the IP level. All of these solutions require that the cache hardware be duplicated or, at least, that the normal cache service be provided by multiple machines which can re-configure themselves if one of their number fails.

Even with a working cache, sites which, for technical or licensing reasons, are only accessible to a direct connection, will no longer be accessible. If the restrictions cannot be altered, then it may be necessary to relax the enforcement rules for such sites.

In the home and in commerce, Internet access policies are often enforced by blocking access to undesirable sites. A site web cache, being a choke point through which all requests must pass, might seem a good place to apply such restrictions. In a research university, at least, there are a number of problems. The first is simple quantity. Lists of sites containing questionable material often run into tens or hundreds of thousands. The Cardiff cache receives over 600,000 requests a day from browsers; the hardware required to check each of these against all restricted sites at the speed our users expect of a 100Mbit/sec network does not bear contemplating! Second is the problem of translating into rules the simple statement "non-academic use". Cardiff University has 26 departments studying a vast range of topics and the cache also provides a service to the University of Wales College of Medicine, which widens the scope of "legitimate" activities still further. I hope that on this point sanity will prevail and, at most, written policy and post hoc accounting will be proposed. Even this bears the risk that users will simply disable caching (and therefore accounting) while engaging in dubious activities. These are of course, the sites most likely to use non-standard port numbers, and so escape the notice of any exclusive router policy.

Using cache logs for accounting raises issues of privacy which need to be addressed by policy before the occasion arises. IP addresses in cache logs can often be translated to individual people, so a cache log may well contain a record of a person's activity on the web. If this is not regarded as fully public information, then rules must be agreed under what circumstances it may be disclosed, to whom and following what procedure. This issue may be further complicated when a cache is shared between institutions which may have different policies. Cache logs can also be useful in tracing computer misuse: it is surprising how many people forget them when posting abusive mail through an "anonymous" mail or guestbook service.

Conclusions

The development of web caching has been achieved through generous co-operation by people as well as machines. The increasing cost of networks, and the greater awareness of this cost to institutions, will be strong incentives in favour of caching. However it will be important to resist purely economic arguments which may threaten the existing caching community spirit.

References

  1. Cardiff web cache <URL: http://www.cf.ac.uk/uwc/comp/Webcache/>
  2. JANET National web cache <URL: http://wwwcache.ja.net/>
  3. Trans-Atlantic network traffic <URL: http://statto.ukerna.ac.uk/> (to June 1998); <URL: http://bill.ja.net/> (from July 1998).
  4. Calamaris home page <URL: http://www.detmold.netsurf.de/homepages/cord/tools/squid/Welcome.html.en>
  5. Squid program <URL: http://squid.nlanr.net/Squid/>
  6. Squid hardware requirements <URL: http://wwwcache.ja.net/servers/squids.html>
  7. Keith W. Ross, Hash Routing for Collections of Shared Web Caches. IEEE Network Vol.11 No.6, pp37-44
  8. National cache problems <URL: http://wwwcache.ja.net/ops/1997/11/>
  9. JISC Charging announcement <URL: http://www.jisc.ac.uk/pub98/c3_98.html>
  10. Cache busting draft <URL: ftp://ftp.net.lut.ac.uk/internet-drafts/draft-hamilton-cachebusting-00.txt>