A CONFERENCE ORGANISED BY UKOLN IN ASSOCIATION WITH
THE BRITISH LIBRARY, CNI, CAUSE AND JISC
9th and 10th February 1996 at the Ramada Hotel, Heathrow, UK
This account was drafted for this report by The Marc Fresko Consultancy. It is based on notes taken during the presentation.
In this paper, some of the preliminary findings of the CNI working party on networked information discovery and retrieval (NIDR) are discussed. They include consideration of cost and usage, of cataloguing and metadata. A number of significant gaps requiring further work are highlighted.
The thinking in this paper is shaped by research conducted for the CNI white paper on information discovery and retrieval. This work was initiated when the CNI was approached by Avra Michelson of the MITRE Corporation, who crystallised the need for research in this area. In late 1994 CNI formed a team, including the author, Craig Summerhill from CNI and Cecilia Preston; the work was co-sponsored by the MITRE Corporation. This team has held meetings and discussions, and currently is drafting its paper.
It is gratifying to note that some of the teamís early predictions are already beginning to become reality.
A major question at the outset is how to impose some sort of structure on the discovery and retrieval "problem". Predicting how it is likely to evolve is also a fruitful source of debate. Because of the scope and size of the issues, the team consciously avoided current detailed or technical points; it tried instead to determine where work will be needed strategically in the future, on concerns such as protocols, metadata taxonomies, etc.
The first generation of networked information discovery and retrieval (NIDR) tools were predicated on the concept that "content is free". For example, tools such as Web Crawlers would search and index network servers systematically, assuming that they would find information which is freely available. However, this will change (in fact, it has already begun to change). The content of the Internet of the future will not all be free; it will contain a complex mix of proprietary and non-proprietary information. As the proportion of information which is charged for increases, so the impact on NIDR tools will increase too.
We are already starting to see the evolution of complex and information-rich organisational information spaces. These can be described as Intranets, that is a part of the Internet which is internal to an organisation and segregated from the bulk of the Internet by a firewall. As this trend continues, organisations will store more and more of their internal data in their Intranets while simultaneously relying on other Internet resources for external data. The impact will be that owners of Intranets will demand consistent tools for Intranet and Internet use. This will attract major software vendors to the NIDR marketplace. We are already seeing some software houses trying to squeeze revenue from the Intranet phenomenon.
Discovery is conceptually straightforward - it consists of finding an appropriate network location, then identifying the required information. Similarly, retrieval is simple - it only requires a collection of bits to be moved to the desired location. However, a complication arises because retrieval is not equivalent to use. For example, in todayís software environment, it is easy enough to retrieve a resource only to find that the "helper application" needed to use the resource is not available. We anticipate that this gap between retrieval and use will grow and will become more and more significant. Accordingly, the concept of "fitness for use" will become more important.
Today, information needs in the NIDR environment are typically formulated in terms of intellectual content description. In the future environment, other factors will become important: access time, service levels, price, and formats. Future tools will need to take these more mechanical factors into account.
From the outset, the team felt that the subject of discovery in a network context would be crucial. It remains very significant, but its importance is decreasing because of the growth of digital libraries. As a matter of definition, digital libraries are established by individuals or bodies who care enough about their contents to build, maintain and - in some cases - describe. It follows that for many users, the task of network discovery will be reduced to finding the best digital library, then searching within it. This will make discovery much easier than it is at present.
Of course, network discovery technologies will be needed for use within the digital libraries, but the libraries will impose constraints and simplifications which will facilitate the process.
Digital libraries will not meet all usersí needs. There will remain a core of users who will need to search the entire Internet routinely. In the main, these will be users who deal in "raw" information, for example in law enforcement, research, and corporate research. Generally these users will be well-informed and sophisticated users of network tools.
Most searching approaches and activities are heavily influenced by the nature of online cataloguing. However, online catalogues have several weaknesses as information systems; the weaknesses include lack of memory concerning previous searches, and the lack of personalisation features. Unfortunately, these weaknesses are being carried forward into NIDR systems. Clearly we need to move beyond the current model. What is needed is a move towards a more active model which involves the "push" distribution of information by systems which learn and become personalised through use.
The cataloguing tradition is non-evaluative (which is superficially attractive as it seems to suggest cataloguing by unskilled personnel, something which does not work well in practice); but a purely descriptive approach has very real limitations which prevent the catalogue from answering evaluative questions which users would like to be able to ask. For example, users would like to ask questions such as "what is the best reference?" or "which is more relevant?" Because of these limitations, NIDR suffers from a perception that "catalogues donít do this!"
Potentially, there are fortunes to be made by entrepreneurs who successfully depart from the conventional catalogue model to meet usersí needs and desires more fully.
The architectural model used to conceptualise the retrieval component of NIDR today is still dominated by the view of the world imposed by the file transfer protocol (FTP), which is twenty years old. This model views retrieval as the process of moving a file - a collection of bits - from one place to another. It lacks the concept of content at an intellectual level. This concept recognises that content can be represented through a variety of digital representations - formats, resolutions, etc - which might be selected based on the capabilities of the client and server involved in the retrieval and the network linking the two, as well as the ultimate needs of the person who will use the intellectual content. Protocols more modern than FTP, such as Z39.50, allow much more negotiation about format as part of retrieval; indeed, retrieval becomes a process rather than a simple action.
The very simplistic view of retrieval today has also confused a number of issues related to metadata. There is a tendency to consider information related to a digital object (price, format etc) as static information that can be viewed as metadata associated with the object rather than as more transitory and situational parameters that are established in the context of a specific retrieval of the object.
Today, retrieval normally assumes acquisition for digital information in the sense of physically copying the information to the usersí machine. The legal framework for networked information use has largely shifted from purchase of copies to licenses to use copies. Retrieval will need to be extended to also allow the negotiation and signing of licenses, not simply the transfer of funds to pay for an object. In the near future we are likely to see a much richer and more complex set of use options for information than simply making local copies; this will include:
The structures used for the collection of metadata will necessarily grow more complex. For the last few years, we have seen Web Crawlers trawling the network to derive descriptive metadata, following the model that was established by the archie system for indexing FTP archives. As alluded to earlier, this is not well suited to a network which contains valuable proprietary information; but, paradoxically, the owners of that information will have a vested interest in publicising the existence and nature of their information while still retaining control over it. We conclude that a new series of interface points will emerge where suppliers will interact with search engines. Quite possibly this will require new levels of abstraction. Users will not be satisfied solely with metadata provided by the information suppliers, which often is basically advertising; the more neutral information extracted from indexing and cataloguing processes will still be needed.
Metadata presents a complex set of ill-defined issues. The team spent considerable time looking at metadata issues; our conclusion was that to a great extent the view of information as metadata is highly contextual, depending on specific usage scenarios. Also, the linkage between metadata and the base object that metadata describes is elusive and becoming more blurred as technologies for automatic indexing, abstracting, translation, image recognition and similar processes improve.
With only limited research, we found no fewer than ten metadata mapping schemes. Mapping metadata between these schemes is imperfect at best; this represents a gap which should be filled.
One current trend is to try to define "core sets" of metadata, on the premise that agreeing the bottom layer of a conceptual hierarchy of metadata sets will permit some form of interoperability between models. Unfortunately this premise overlooks the critical weakness that it is impossible to rebuild a full metadata set starting from a core set; this weakness diminishes the value of the concept.
The compromise approach of lowest common denominator core sets of metadata also gives rise to another problem. While it is possible, through approaches such as the "Dublin Core" work, to develop reasonable core sets and gain consensus around them, they are not particularly interpretable or interoperable at the machine level. Typically, the data elements are not highly structured and are going to be used with free text searching and presented to people for decision making.
There is currently much talk about "intelligent agents", about sophisticated semantic interoperability. The current view of NIDR presumes the active, continual involvement of a human to control the process and to make complex decisions involving relatively unstructured information. A world of intelligent agents would have to rely on highly structured data and data interchange standards. In this sense, current NIDR technology is largely disconnected from the development of intelligent agents and focused on solving a different problem.
The current draft or final version of the white paper is available at http://www.cni.org/ and ftp://ftp.cni.org/
British Library R&D Report 6250
© The British Library Board 1996
© Joint Information Systems Committee of the Higher Education Funding Bodies 1996
The opinions expressed in this report are those of the contributors and not necessarily those of the sponsoring organisations.
The primary publication medium for this report is via the Internet at URL
It may also be purchased as photocopies or microfiche from the British Thesis Service, British Library Document Supply Centre, Boston Spa, Wetherby, West Yorkshire, LS23 7BQ.
This report of the conference was prepared by The Marc Fresko Consultancy Telephone +44 181 645 0080 E-mail firstname.lastname@example.org
Converted to HTML by Isobel Stark of UKOLN, July 1996