A review of existing practice
...an eLib supporting study
WHOIS++ [RFC1835] is a simple search and retrieval protocol for the Internet, designed to be straightforward to implement. The protocol supports only those features for which there is a demonstrable need. Additionally, the protocol does not require major computing resources nor does it require any special expertise of its implementers. In particular, WHOIS++:
A WHOIS++ client does not have to worry about whether the server will be able to handle the query that it sends, or that the server will respond with a format of results which it does understand. This is because in both request and response there can only be the one prescribed format for the data that passes between client and server. All WHOIS++ communications take the form of plain text, sent a line at a time. Text based protocols are considered easy to work with and this has resulted in their wide popularity on the Internet.
WHOIS++ supports distributed cross-searching via centroids and referrals.
Cross searching is supported by WHOIS++ via two modes of operation. The first (obvious) method is sending the query to a number of different servers and collating the results. This method is undesirable from a bandwidth point of view, because it is likely to lead to a situation where every WHOIS++ service is being queried whenever a search is performed. It may, however, be desirable from the point of view of the WHOIS++ server to use this mechanism to collect usage and (potential) billing information.
The second method offers a way of discovering information held on other servers in addition to the one being explicitly queried. This is achieved via a 'referral'. A referral gives connection information for another server that may be able to answer the query. Referrals are generated from forward knowledge gathered previously. In WHOIS++ terminology, this forward knowledge is represented as a summary of the contents of a server and is known as a 'centroid'. Since centroids are only a summary, a referral resulting from a centroid match should not be assumed to be a genuine hit. For example, centroids have no way of knowing about word proximity information.
Each centroid is essentially a simple inverted index of the information in the database. For each of the types of record that have been used, and each of the attributes within those records, the centroid contains a list of all the index terms that have been found. To see how this works in practice, consider a database that contains the following SERVICE records:
Title: Social history server Description: This server provides researchers in social history with resources that may be of interest.
Title: Military history database Description: A database of pointers to military history resources.
Title: Medical history server Description: A server that provides pointers to resources dealing with medical history
It can be seen that a number of words appear in the same attribute in multiple records. For example, in all three, the term 'history' appears in the value associated with the 'Title' attribute. The centroid of these records consists of the list of unique terms associated with each attribute. By unique terms we mean that we only record the first instance of an index term, even though it might occur repeatedly in the given attribute when the whole database is considered. Consequently, the centroid information associated with the 'Title' attribute would be:
Social History Server Military Database Medical
Note that the original 'Title' fields in the records contained a total of nine words in the values whereas the centroid only has six words. The removal of redundant multiple instances of index terms typically makes the centroid associated with an attribute much smaller than the original data held in that attribute over all of the records in the database, which is a major difference when compared with the inverted index. Since real databases are much larger than this example, the chance that the same words will appear over and over again in multiple records is increased, and so the size of the centroid relative to the database is likely to be significantly smaller. We find that centroids for large data sets are quite large when viewed by themselves but relatively small when compared with the size of the original data from which they were derived.
If a database makes use of more than one object type then a centroid will contain a list of object types, the attributes contained in each type and the unique words within each attribute of each type. This would be treated separately from the 'Title' attribute of the 'DOCUMENT' object. This increases centroid size slightly as different object types are likely to share some common attributes and the unique words in these attributes will have to be repeated.
Although referrals and centroids have been presented here in the context of WHOIS++, there is no fundamental obstacle to using them with other protocols. For example, in the case of an open-ended protocol such as Z39.50, it would simply be necessary to develop a distributed indexing and searching 'profile' based on a common agreement on how best to implement these features in the Z39.50 context.
Martin Hamilton, Loughborough University