Position paper for NSF/ERCIM meeting

Rachel Heery, The UK Office for Library and Information Networking (UKOLN), University of Bath
E-mail: <r.heery@ukoln.ac.uk>
16 December 1997


These notes introduce issues and directions for metadata research. These are ideas emerging from the work with which UKOLN has been involved over the last few years, in particular various collaborative EC, eLib and British Library funded projects. The intention is to highlight areas for consideration in the process of creating and processing metadata while acknowledging that metadata cannot be viewed in isolation from resource discovery and the wider issues of interoperability between services, distributed searching and the organisational frameworks within which users and services act.

We will briefly consider some of the projects in which UKOLN has participated and try to draw out the plans for metadata research and suggest some issues meriting further consideration.

ROADS

UKOLN has participated in the Resource Organisation and Discovery in Subject-based Services (ROADS) project which provides software for subject based search services, e.g. SOSIG and OMNI, a number of which have been funded by eLib and by the EC DESIRE project. These services provide for the retrieval of selected internet resources, selected on the basis of 'high quality' criteria formulated as appropriate by the individual services. The software provides for the creation, searching and browsing of metadata. The metadata format is based on IAFA templates with some additional fields such as 'classification scheme'. Current practice amongst the services is to create the templates manually, although some facility is available for pre-populating fields with content using harvesting techniques.

ROADS software is designed to be queried using the Whois++ protocol and incorporates a Whois++ server. This option was chosen (rather than Z39.50) as Whois++ is perceived to be relatively light weight to implement, and is associated with building 'centroids' (summary indexes). A mesh a participating servers can be constructed, using the Common Indexing Protocol to share 'forward knowledge' in the form of centroids. This allows queries to be routed between servers.

The ROADS project will run until late 1998 and there are plans to investigate a number of areas relevant to metadata during this period:

A number of considerations could arise from this work, of particular interest might be the effectiveness of retrieval based on metadata created by a range of services e.g. even where the metadata is in the same format local practice may differ; the effect of searching across collections that vary markedly in size (from 3000 records to 100,000 records); mappings and conversion of metadata to achieve cross-searching may lead to loss in the quality of the metadata. Evaluation and feasibility studies would be useful in a number of areas:

Much of the recent innovation and experimentation in the UK regarding metadata has been done in the context of establishing new resource discovery services (chiefly funded as part of the eLib programme). Inevitably these services are small scale in comparison to legacy systems (i.e. the existing installed base). As the eLib programmme draws to a close, issues of scalability and managing integration with the installed base will gain in priority.

DESIRE

WP3 of the DESIRE project has implemented two approaches to the creation of metadata for resource discovery: the manual creation of metadata for selective services (based on the ROADS software) and robot generated metadata based on software developed for the Nordic Web Index. The resource discovery aspect of the project is due to run for two more years (as DESIRE II) and plans to include work in a number of areas relevant to the use of metadata (note these plans are still subject to negotiation):

The DESIRE project illustrates the tension between the benefits of early implementation of emerging standards as against waiting for the definition of richer functionality. The difficulties in incorporating rapidly changing technology and seeking forward compatibility once a production service has been established are obvious, nevertheless the intention is to do this particularly as regards the Dublin Core activity and the current RDF work. Significant considerations include:

BIBLINK

The EC project BIBLINK aims to improve the flow of metadata from publishers of electronic resources to national bibliographic agencies. A demonstrator service will run from July 1998 with the participation of several European national libraries along with a number of publishers of electronic resources. Publishers will create metadata records based on the Dublin Core or supply an SGML document header. The BIBLINK system will provide a workspace in which these metadata records can be shared between publishers, national bibliographic agencies and other interested parties. After conversion to UNIMARC the records can be enhanced by national bibliographic agencies and returned to the publishers in either Dublin Core or a national MARC format.

BIBLINK will create MARC records at 'CIP level' (i.e. brief records equivalent to the Cataloguing in Print level in use in some national libraries). Further areas for investigation might be:

Arts and Humanities Data Service (AHDS)

The AHDS is one of the first significant attempts to provide genuine cross-domain access. There are a number of service providers in different subject and curatorial domains: performing arts, visual arts, history, textual studies, archaeology. These will be providing access to a wide variety of resources including archaeological survey data, digital images, electronic texts, film, video, sound recordings and geospatial information. AHDS intends to offer integrated access to these diverse collections for the 'inter-disciplinary' user as well as offering specialist search facilities in each area. Substantial work has already been carried out identifying the resource discovery requirements from the subject and media perspectives of each service provider. Dublin Core will provide a core element set as a basis for semantic operability.

This service is beginning to address the need for the user to navigate between collection level description (a subject of separate study) and generic metadata (such as DC), and then, if relevant material is found, to 'drill down' to the richer domain specific metadata. The Warwick Framework may be instrumental in relating generic and specific metadata. The challenge is to provide a meaningful user interface in particular taking account of the widely recognised requirement to rank, de-duplicate and format result sets. Once again the pros and cons of syntax conversion and resulting degradation in the quality of data need to be considered.

Possible associated areas for research:

Digital Preservation

The CURL Exemplar for Digital ARchiveS (CEDARS) project is funded by eLib and CURL and is a three year project scheduled to start in March 1998. CEDARS is intended to identify strategies and techniques for the long-term preservation of the intellectual content of digital materials. UKOLN will have some input to the part of the project concerning metadata for preservation. Some of the questions raised by this part of the project will be:

Conclusion

UKOLN's involvement with digital library developments, influenced by its funders, concentrates on the provision of innovative demonstrator and prototype services as a means of serving the community and raising awareness. Much could be gained from one or two large scale projects which involved the creation and management of a variety of metadata covering diverse domains including existing library 'legacy systems'. Whether or not this happens there are a variety of opportunities for smaller scale developments, some of which we have outlined above, which would illustrate what is possible.

The integration of services based on disparate metadata raises issues of the users' awareness of the discovery process. Some disclosure to the user of the processes involved and in particular metadata searched would raise awareness of the value/limitations of the results.


Maintained by Michael Day of UKOLN, The University of Bath.
Last updated 18-Dec-1997.