Collection Level Description

A review of existing practice

...an eLib supporting study

3. Collection Description

3.9 Centroids

WHOIS++ [RFC1835] is a simple search and retrieval protocol for the Internet, designed to be straightforward to implement. The protocol supports only those features for which there is a demonstrable need. Additionally, the protocol does not require major computing resources nor does it require any special expertise of its implementers. In particular, WHOIS++:

Only supports one query language
Constrains responses to follow a standard format
Provides a small number of well defined 'system commands'

A WHOIS++ client does not have to worry about whether the server will be able to handle the query that it sends, or that the server will respond with a format of results which it does understand. This is because in both request and response there can only be the one prescribed format for the data that passes between client and server. All WHOIS++ communications take the form of plain text, sent a line at a time. Text based protocols are considered easy to work with and this has resulted in their wide popularity on the Internet.

WHOIS++ supports distributed cross-searching via centroids and referrals.

3.9.1 Referrals and centroids

Cross searching is supported by WHOIS++ via two modes of operation. The first (obvious) method is sending the query to a number of different servers and collating the results. This method is undesirable from a bandwidth point of view, because it is likely to lead to a situation where every WHOIS++ service is being queried whenever a search is performed. It may, however, be desirable from the point of view of the WHOIS++ server to use this mechanism to collect usage and (potential) billing information.

The second method offers a way of discovering information held on other servers in addition to the one being explicitly queried. This is achieved via a 'referral'. A referral gives connection information for another server that may be able to answer the query. Referrals are generated from forward knowledge gathered previously. In WHOIS++ terminology, this forward knowledge is represented as a summary of the contents of a server and is known as a 'centroid'. Since centroids are only a summary, a referral resulting from a centroid match should not be assumed to be a genuine hit. For example, centroids have no way of knowing about word proximity information.

Each centroid is essentially a simple inverted index of the information in the database. For each of the types of record that have been used, and each of the attributes within those records, the centroid contains a list of all the index terms that have been found. To see how this works in practice, consider a database that contains the following SERVICE records:

Title: Social history server
Description: This server provides researchers in social history with
 resources that may be of interest.

Title: Military history database
Description: A database of pointers to military history resources.

Title: Medical history server
Description: A server that provides pointers to resources dealing
 with medical history

It can be seen that a number of words appear in the same attribute in multiple records. For example, in all three, the term 'history' appears in the value associated with the 'Title' attribute. The centroid of these records consists of the list of unique terms associated with each attribute. By unique terms we mean that we only record the first instance of an index term, even though it might occur repeatedly in the given attribute when the whole database is considered. Consequently, the centroid information associated with the 'Title' attribute would be:

Social
History
Server
Military
Database
Medical

Note that the original 'Title' fields in the records contained a total of nine words in the values whereas the centroid only has six words. The removal of redundant multiple instances of index terms typically makes the centroid associated with an attribute much smaller than the original data held in that attribute over all of the records in the database, which is a major difference when compared with the inverted index. Since real databases are much larger than this example, the chance that the same words will appear over and over again in multiple records is increased, and so the size of the centroid relative to the database is likely to be significantly smaller. We find that centroids for large data sets are quite large when viewed by themselves but relatively small when compared with the size of the original data from which they were derived.

If a database makes use of more than one object type then a centroid will contain a list of object types, the attributes contained in each type and the unique words within each attribute of each type. This would be treated separately from the 'Title' attribute of the 'DOCUMENT' object. This increases centroid size slightly as different object types are likely to share some common attributes and the unique words in these attributes will have to be repeated.

Although referrals and centroids have been presented here in the context of WHOIS++, there is no fundamental obstacle to using them with other protocols. For example, in the case of an open-ended protocol such as Z39.50, it would simply be necessary to develop a distributed indexing and searching 'profile' based on a common agreement on how best to implement these features in the Z39.50 context.

3.9.2 Building index servers

The combination of a simple search and retrieval protocol such as WHOIS++ with referrals and centroids offers us a simple yet highly sophisticated way of doing searches that span multiple databases. Once a centroid has been generated for a given collection of information, it can be made available to interested parties over the Internet. These interested parties are usually referred to as 'index servers', and there are two main ways for them to ingest centroids:

3.9.3 Pushing centroids

The originating server has a list of index services with which it will share the centroids it may generate of its collection(s). In this case, each index server may be contacted in turn and informed that the centroid of the originating server is now out of date due to data changes. This has the advantage that the remote copy can be kept relatively up to date, perhaps by automatically sending a new version whenever the database is changed. However, care needs to be taken in choosing distribution times which are convenient for both parties - e.g. if they are in different time zones. It also means that new index servers may have to wait until they are manually added to the distribution list before they can get centroids.

3.9.4 Pulling centroids

The index server knows about the servers that it receives centroid information from and regularly polls them for centroids, resulting in it 'pulling' centroids from the downstream servers. This is good from the point of view of the index server, since it can decide when it would like to pull centroids and how often. It may also be able to dynamically pick up centroid information from new collections that it has discovered - though there is no requirement that the information be made available to a previously unknown poller. It has the disadvantage that stale centroids may be left in the database of the polling server, and the index server might try to pull centroids at times which are inconvenient for the polled services - e.g. due to time zone differences.

3.9.5 A centroid redux

Both models of centroid distribution are likely to be of use on the Internet. A single service might even use a combination of both methods. For example, consider a subject gateway having several satellite subject services specialising in particular kinds of information. It might gather centroids by polling these and from time to time push an updated centroid up to a higher level cross-disciplinary subject service. The main database might also contain centroids drawn from other WHOIS++ servers that were not part of this framework of satellites. From this it can be seen that not all centroid aware servers will find it worth picking up centroid information themselves. Many of these can be expected to make use of an index server, usually topologically close (nearby on the network) rather than collecting centroids. A typical scenario would involve the smaller services referring the user upstream to index servers if no matches were found in the database they first started working with. This decision of which index servers to refer the end user to could be based to some extent on search preferences (e.g. only medical/health related information) and/or lists of index servers which are considered to be worthwhile contacting - such as our hypothetical subject gateway index server. It may also be desirable to rank referrals in terms of the perceived relevance of the server - for example, eLib funded services choose to specialise in quality reviewed resources whereas services operated by volunteers may not be quite as particular. This may not be the case with the other centroids held by the index server. This inter-linking of index servers is typically referred to as a 'mesh', and is described in more detail in [RFC1914] and [RFC1835].

Martin Hamilton, Loughborough University