University of Bath
Bath BA2 7AY, United Kingdom
This is a draft version of a paper on UKOLN metadata projects presented at the international conference: Electronic Resources: Definition, Selection and Cataloguing, Biblioteca nazionale centrale di Roma, Rome, Italy, 27th November 2001.
A revised version has been published in both Italian and English in: Mauro Guerrini, (ed.), Le risorse electtroniche: Definizione, selezione e catalogazione = Electronic resources: definition, selection and cataloguing, Milan: Editrice Bibliografica, 2002, pp. 193-206 (Italian), pp. 561-573 (English). ISBN 88-7075-586-X.
Abstract: This paper provides an introduction to two of the metadata-related projects in which UKOLN has been a partner. It first describes the development of services known as quality controlled subject gateways and looks in more detail at the Resource Discovery Network and the EU Renardus project. It then provides an outline of recent preservation metadata initiatives and describes the way the OAIS model has been used in the Cedars project.
While the first use of 'metadata' originated in contexts related to digital information (chiefly with regard to databases), the general understanding of the term has since expanded to include any kind of standardised descriptive information about both digital and non-digital resources. Thus, for example, the content of library catalogues and the descriptive data that lies behind abstracting and indexing services, archival finding aids and museum documentation, might all be understood to be metadata. The advantages of this are twofold. Firstly, it acts as a spur for librarians, archivists and museum documentation specialists to co-operate across professional boundaries. Secondly, it enables the cultural heritage professions to communicate more effectively with other domains that also have an interest in metadata: e.g., software developers, publishers, the recording industry, television companies, the producers of digital educational content and those concerned with geographical and satellite-based information. Metadata is currently, therefore, an important field of research and development .
Over the past decade, UKOLN (formerly the UK Office for Library and Information Networking) at the University of Bath has been involved in a number of metadata-based projects and initiatives. There are accounts of some former projects in overviews by Dempsey  and Day . This paper will describe progress in some more recent projects and initiatives in which UKOLN and its partners have been involved. It will focus in particular on two strands of metadata research and development. The first of these will be the continuing development of 'subject gateway' services and will focus on the progress of the European Renardus project and the UK's Resource Discovery Network (RDN). The second theme will be the ongoing development of metadata schemas to support digital preservation. The main project in this area in which UKOLN has had some involvement is the Cedars (CURL Exemplars in Digital Archives) project, but some other recent initiatives will also be described.
With regard to the title of this paper, it is important to stress that these are not UKOLN metadata projects, but metadata-related projects in which UKOLN is involved. Almost all of the metadata projects in which UKOLN is involved have multiple partners and most project outcomes are produced collaboratively. The projects described here are no exception. So while the main focus of the paper will be on work undertaken by staff at UKOLN, it is important to recognise the vital roles of our partner organisations and the project leaders.
The Internet, and the World Wide Web in particular, is the dissemination medium of choice for a wide (and growing) range of information. One consequence of this has been the development of new types of information retrieval tools to support Internet resource discovery. Examples of these tools are the popular robot-based Web index services Google  and AltaVista . These are based on harvesting software that periodically crawl through Web pages, following links and indexing all (or some) of the text associated with them. An alternative, but complementary, approach has been developed by Web directory services like Yahoo!  that provide links to Internet resources that are presented for users to browse in hierarchical lists. Selected resources are described by humans and assigned relevant subject terms, the user interface presenting some basic metadata in the form of a title, a brief description and a URL. A similar kind of service - based on the selection and cataloguing efforts of volunteers - is provided by the Open Directory Project .
Subject gateways are services that give access to selected third-party Web resources, usually with a specific focus on high-quality information suitable for use in research or higher education. On a superficial level, subject gateways share some of the same characteristics as Web directory services. However, there tends to be a much greater emphasis on the selection of resources according to some pre-defined quality criteria combined with the production of rich descriptive metadata which can enhance the potential of both searching and browsing facilities. This can be seen as analogous to the value-added selection and cataloguing services that libraries and other information professionals have traditionally provided for print formats . Traugott Koch notes that the most sophisticated gateways would have policies for collection development and management, would create rich metadata complying with particular standards, and would use subject vocabularies to create a browse structure and to enhance searching . He refers to these services as 'quality-controlled subject gateways.' Examples of this kind of service would include the Finnish Virtual Library  and the gateways that make up the UK's Resource Discovery Network (RDN).
UKOLN has previously been involved in a number of collaborative projects that have concerned the development and support of subject gateways. These included the ROADS (Resource Organisation And Discovery in Subject-based services) initiative funded by the Joint Information Systems Committee (JISC) and the two phases of the EU-funded DESIRE (Development of a European Service for Information on Research and Education) project . ROADS developed software tools which would help support the establishment and maintenance of subject gateways, but also investigated associated issues like cross-searching multiple gateways, interoperability and query routing . DESIRE looked at some of the same issues in a wider European context and in its second phase produced an Information Gateways Handbook to help support the creation and management of gateway services . UKOLN's current work on subject gateways is undertaken as part of its role in the RDN and as partners in the Renardus and IMesh Toolkit projects.
The broad aim of the Resource Discovery Network (RDN) is to give the UK learning and research communities access to high-quality Internet resources across the whole range of subjects required at further education and higher education levels . It is funded by the JISC, while some of its constituent services have additional support from the Economic and Social Research Council (ESRC) and the Arts and Humanities Research Board (AHRB). It is an attempt to build on the experiences of the subject gateways that previously had been funded by JISC as part of the Electronic Libraries Programme (eLib) and the ROADS project .
Organisationally, the RDN consists of a central organisation, the Resource Discovery Network Centre (RDNC) and a number of independent service providers called 'hubs'. The RDNC is run jointly by staff based at King's College London and UKOLN. The Centre is responsible for setting service standards, creating collection development policies and exploring strategic partnerships. Hubs provide one or more subject gateway services for their target user community, but can also offer other services. Existing hub services cover the health and life sciences (BIOME), engineering, computing and maths (EEVL), the humanities (Humbul Humanities Hub), the physical sciences (PSIgate), and social sciences, business and law (SOSIG). Three new hubs to cover hospitality, leisure, sport and tourism (Altis), geography and the environment, and the arts and creative industries are currently under development.
In order to help users search for resources described in all constituent RDN gateways, an all-RDN cross-search service known as ResourceFinder has been developed. Since its launch, this feature has been based on a variety of different technologies. Early versions of the feature were based on cross-searching using either the WHOIS++ protocol (as implemented by the ROADS subject gateway toolkit) or Z39.50. However, Andy Powell noted problems with these approaches related to response times and the difficulty of developing flexible browse interfaces . The RDN, therefore, began to investigate the possibility of basing ResourceFinder and other services on a record-sharing model rather than through cross-searching. The tool chosen to test this new model was an implementation of the protocol for metadata harvesting developed by the Open Archives Initiative (OAI) . Each RDN service converts records and moves them periodically to the RDN's OAI repository, where they are stored as unqualified Dublin Core in XML. Searches of ResourceFinder currently return relatively simple result sets consisting of title, description, URL and an indication of which RDN service the record was taken from. It therefore does not at present use the full richness of the metadata that exists within the RDN services, although it is hoped that additional services based on OAI tools will be developed in the future.
Renardus is a project funded by the European Commission as part of its Information Society Technologies (IST) programme . The twelve partners in Renardus include national libraries, research centres and subject gateway services from Denmark, Finland, Germany, the Netherlands, Sweden and the UK, co-ordinated by the Koninklijke Bibliotheek (National Library of the Netherlands). The project aims to develop a Web-based service to enable searching and browsing across a range of distributed European-based subject gateways .
The overall aim of Renardus is to establish a collaborative framework for European subject gateways that will benefit both users in terms of enhanced services, and the gateways themselves in terms of shared solutions. In order to achieve this aim, Renardus is firstly providing a pilot service for the European academic and research communities brokering access to those gateways that currently participate in the project. Secondly the project is exploring ways to establish the organisational basis for co-operative efforts like metadata sharing, joint technical solutions and agreement on standardisation. It is hoped that this exploration will feed back valuable experience to the individual participating gateways to suggest ways in which their services can be enhanced .
The services participating at the project stage in Renardus are:
A team based at the Technical Knowledge Centre and Library of Denmark (DTV) and NetLab (Lund University Libraries), developed the pilot Renardus broker . It is a completely distributed system using the Z39.50 protocol to search across the content of all participating gateway services. Each participant has to set up a Z39.50 server with data normalised with reference to the Renardus data model.
Work on developing a common data model (metadata format) that would be used by the Renardus broker was led by the Renardus project team based at the Niedersächsische Staats- und Universitätsbibliothek Göttingen . The team first undertook a detailed analysis of all metadata formats used by participating services and then proposed the minimum set of metadata elements which would need to be supported. These were, where possible, based on the Dublin Core (DC) Metadata Element Set. The minimum set for describing gateway content contained the DC elements Title, Creator, Description, Subject, Identifier, Language, Type, and the non-DC element Country. For administrative purposes, the data model also defined additional elements for identifying the location of the original metadata for a resource (Full Record URL) and a gateway identifier (SBIG ID). Some of these elements are able to be qualified using refinements defined by DC (e.g., Title.Alternative) or qualified by the use of schemes.
The Renardus pilot broker provides both simple and advanced searching facilities. The advanced search enables the user to define which particular metadata fields they want to search. It also allows users to filter results by document type, language or country, and to change the display defaults.
As well as the cross-search feature, the Renardus partners also wanted to offer some kind of subject browsing across all of the participating gateways. A working group of the project was set up to investigate this. In order to achieve some level of cross-browsing, a classification scheme was chosen to act as a central 'switching-language' within the Renardus broker. The scheme that was chosen was the Dewey Decimal Classification (DDC) and a research license negotiated with its maintainers, OCLC Forest Press . Gateways participating in the Renardus system were expected to map DDC terms to the subject terms used in their own browse hierarchies. In addition, a software tool developed as part of the German CARMEN (Content Analysis, Retrieval and MetaData: Effective Networking) project was adapted to facilitate the mapping workflow.
Mappings were made from the DDC to the range of subject browse structures produced by participating gateways. The Renardus partners use a wide range of different types of subject classification, including general classification schemes (e.g., the Universal Decimal Classification, Nederlandse Basisclassificatie, etc.), subject specific schemes (e.g., Mathematics Subject Classification, Ei, etc.) and some that have been produced specifically for the gateway. Also, some gateways have modified considerably the schemes that they use. The precise amount of mappings that needed to be produced depended upon the complexity of the browse structures that had been implemented by gateways. It was not expected that all terms would have direct equivalents in the DDC. For this reason, a set of mapping relevance levels was defined. These indicate, for example, where there are narrower or broader equivalents or major and minor overlaps.
The cross-browse structures that could be created using this system are relatively limited. It was not possible, for example, to link directly to resource descriptions from multiple gateways in a single browse hierarchy without a greater level of centralisation than that offered by the Renardus pilot. Instead, the Renardus browse system links directly into the subject hierarchies of individual gateways. For example, a user looking for information on violins or violas would first browse through the DDC-based structure on the Renardus system until they find a relevant match (e.g., Stringed Instruments) with its relevance level. If a part of an individual gateway's browse structure has been mapped to this DDC term, the gateway's name and logo becomes visible and the local subject term becomes a hyperlink to the relevant part of the local browse structure. Once the user follows this link, they leave the Renardus system and join the browse interface of the local gateway.
The classification mapping work remains experimental but is an interesting way of joining the browse structures of distributed services. More details are available in a conference paper produced by members of the working group . The pilot Renardus broker was released for evaluation in the Summer of 2001.
Digital preservation has been defined by Margaret Hedstrom as "the planning, resource allocation, and application of preservation methods and technologies necessary to ensure that digital information of continuing value remains accessible and usable" . It is important to remember that preservation is normally linked with maintaining access, not just long term storage. The reasons why preserving digital information is difficult are technological, related to things like relatively short media lifetimes, obsolete hardware and software, and defunct Web sites . Proposed solutions are partly technical and partly organisational, as Hedstrom's definition suggests. Various digital preservation strategies have been proposed; the most widely discussed being based on migration, emulation or technology preservation. Alternatives might include the relatively expensive data recovery programmes that are sometimes known as 'digital archaeology' . Regardless, however, of which particular strategy is adopted, long-term preservation will depend upon the generation and maintenance of metadata that describes the digital information being preserved and enables its interpretation. The rest of this paper will attempt to review some recent initiatives that relate to preservation metadata for digital objects and in particular the metadata specification developed by the Cedars project. Other relevant initiatives have been described in more detail in a conference paper by Day 
In technical terms, the successful long-term preservation of digital information will be dependent upon organisations identifying and implementing suitable preservation strategies . If one ignores strategies that involve converting digital information into non-digital forms (e.g. printouts or microforms), at the moment there are three main strategies: data migration, software emulation and technology preservation . None of these options will be a perfect solution for all digital information types and it is assumed that more than one strategy may have to be adopted.
The technology preservation approach involves the preservation of a digital information object together with all of the software and hardware needed to interpret it. This may have an important short-term role for the recovery of data from obsolete storage media and platforms, but is unlikely become a viable long-term strategy. Feeney points out that collection managers who relied only upon this approach would soon end up with "a museum of ageing and incompatible computer hardware" .
Emulation strategies are based on the premise that the best way to preserve the functionality and 'look-and-feel' of digital information objects is to preserve its original software and run this on emulators which can mimic the behaviour of obsolete hardware and operating systems. Emulation strategies involve encapsulating a data object together with the application software used to create or interpret it and a description of the required hardware environment - i.e., a specification for an emulator. Jeff Rothenberg says that the emulation approach requires "the development of an annotation scheme that can save ... explanations [of how to open an encapsulation] in a form that will remain human-readable, along with metadata which provide the historical, evidential and administrative context for preserving digital documents" .
A more proven preservation strategy is the periodic migration of digital information from one generation of computer technology to a subsequent one. This is currently the most tried-and-tested preservation strategy, and is often combined with some standardisation of formats on 'ingestion' into an archive in order to simplify future migration processes. However, as Seamus Ross points out, data migration inevitably leads to some losses in functionality, accuracy, integrity and usability . For this reason, migration strategies will be dependent upon the creation and maintenance of metadata that records the successive migration processes and helps preserve its authenticity.
We can already see that metadata appears to be an important supporter of any digital preservation policy. Essentially, preservation metadata is all of the various types of data that allow the re-creation and interpretation of the structure and content of digital data after the bits themselves have been preserved . Defined in this way, therefore, it is clear that preservation metadata needs to support a number of related, but distinct, functions. Clifford Lynch, for example, states that within a digital repository, "... metadata accompanies and makes reference to each digital object and provides associated descriptive, structural, administrative, rights management, and other kinds of information" . The wide range of functions that preservation metadata aims to fulfil means that the production of metadata standards is not simple and that most of the currently published schemas are relatively complicated.
UKOLN participated in the development of a proposed preservation metadata specification as part the Cedars project . Cedars was a Consortium of University Research Libraries (CURL) project funded by the JISC between 1998 and 2002 to investigate digital preservation issues and make recommendations on good practice . A major strand in Cedars concerned metadata; the project produced a review of existing preservation metadata initiatives and, in March 2000, a draft outline metadata specification . The Cedars specification had two main aims. Firstly, that it could be used within demonstrator services being developed elsewhere in the project. Secondly, that it would make contribution to the international standardisation of preservation metadata. Work on developing the Cedars metadata specification started in early 1999. An initial draft (for expert comment) was published in January 1999, and was broadly organised according to the information model provided in the influential Reference Model for an Open Archival Information System (OAIS) published by the Consultative Committee on Space Data Systems (CCSDS).
The development of the OAIS model resulted from a request from the International Organization for Standardization (ISO) that the CCSDS should co-ordinate the development of standards to support the long-term preservation of digital information obtained from observations of the terrestrial and space environments. The latest version of the standard (CCSDS 650.0-R-2) is currently a CCSDS Draft Recommendation, and has been recently accepted as ISO DIS (Draft International Standard) 14721 .
The OAIS model aims to provide a common framework that can be used to help understand archival challenges, and especially those that relate to digital information. This is the model's real value: providing a high-level common language that can facilitate discussion across the different communities interested in digital preservation. The document defines a high-level reference model for an OAIS, which is defined as an organisation of people and systems that have "accepted the responsibility to preserve information and make it available for a Designated Community" . The OAIS model has a much wider scope than metadata. It defines both a functional model and an information model. The functional model outlines the range of functions which would need to be undertaken by a repository, and defines in more detail those functions described within the OAIS specification as access, administration, archival storage, data management, ingest and preservation planning. The information model defines the broad types of information (or metadata) which would be required in order to preserve and access the information stored in a repository. However, it is important to realise that the OAIS standard is a reference model, not a detailed specification for any implementation based upon it. All of the different communities interested in digital preservation will have to apply the model (including the information model) in their own particular contexts, both organisational and technical.
The OAIS information model defines a number of different Information Objects covering the various types of information required for long term preservation. A basic assumption of the model is that all Information Objects are composed of a Data Object - typically a sequence of bits for digital data - and the Representation Information that permits the full interpretation of the Data Object into meaningful information. The OAIS model defines four distinct Information Objects.
The OAIS information model sub-divides the PDI into four distinct groupings, based on categories discussed in the 1996 report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access (CPA) and the Research Libraries Group (RLG). The task force wrote that "in the digital environment, the features that determine information integrity and deserve special attention for archival purposes include the following: content, fixity, reference, provenance and context" . Accordingly, the OAIS taxonomy divides PDI into four: Reference Information, Context Information, Provenance Information and Fixity Information.
The OAIS model also defines a conceptual structure for Information Packages. This is viewed as a container that logically encapsulates Content Information and its associated PDI within a single Data Object. Information Packages (IP) are defined for submission (SIP), archival storage (AIP) and dissemination (DIP). Of these, the AIP is the most important for digital preservation, as it contains "all of the qualities needed for permanent, or indefinite, Long Term Preservation or a designated Information Object" .
The Cedars project team took the existing draft of the OAIS information model and used it as a broad framework for an outline preservation metadata specification. It is an outline specification because in many cases it only defines the highest levels of the metadata scheme that would be required for any implementation. Also, elements and sub-elements are not specified as being 'mandatory' or 'optional,' but just given a significance level. In accordance with the OAIS's Information Package model, the project team envisaged that resources (Content Information) would be packaged together with its metadata (PDI). The specification focused on defining both the Representation Information which would enable the Content Information Data Object to be understood  and the Content Information's associated PDI. Less consideration was given to the specific Representation Information which would be required for the PDI Data Object, or to Packaging or Descriptive Information.
The Cedars project team was aware that the proposed metadata element set would not necessarily support all of the roles identified in the OAIS functional model, e.g. the administration or data management functions. Despite this, however, it was recognised that some of the information provided as part of the Provenance Information could help support administrative functions such as rights management. In fact, the Provenance Information defined in the Cedars outline specification contains a number of elements specific to rights management which may go well beyond the OAIS model's assumption that provenance is primarily concerned with supporting the integrity of a given Data Object. This reflects the difficulty of defining simple metadata schemes where the same information can be used by functionally different parts of a system.
A quick look at the hierarchical structure of the Cedars specification demonstrates its basic dependence upon the OAIS information model. The first three levels of the hierarchy inherit the exact terminology and some of the definitions used in the OAIS model.
The Cedars specification tried not to make too many assumptions about the actual form of the digital objects being preserved or about the 'granularity' of specific objects. It was hoped that the specification would be applicable at any level of granularity, but the authors recognised that the specifics of implementation would be the responsibility of repositories. Also, the specification made no assumptions about which particular preservation strategy would be used, although this may have an impact on which particular elements would be required.
After publication of the outline specification, meetings were held in Birmingham and Cambridge to 'walk-through' the metadata element set with regard to specific resources. These raised many issues related to how the specification should be implemented and with regard to the organisation of metadata handling within a repository. This included questions about who would be responsible for generating this metadata and the relevant workflow.
There has been no serious development of the Cedars outline metadata specification since its publication in 2000. However, Cedars project participants are currently involved in the work of the Preservation Metadata Working Group supported by OCLC Online Computer Library Center and the RLG . The membership of the working group is international, and includes key individuals who were involved in the development of the Cedars metadata specification and others, including those developed by the NEDLIB project  and the National Library of Australia (NLA) . The key deliverable of the working group to date has been a white paper, a review of the state-of-the-art in preservation metadata . This includes a summary of the OAIS model, descriptions of the element sets developed by Cedars, NEDLIB and the NLA and an attempt to map between them using the OAIS information model as a general framework. Also, the working group has more recently published a Recommendation for Content Information, which provides an expanded conceptual structure for a Content Information package and a set of metadata elements . The recommendation includes elements based on the ones defined in the Cedars, NEDLIB and NLA specifications as well as new elements defined by the working group. Future work will include a recommendation on PDI.
This paper has attempted to describe current progress in a selection of the metadata projects in which UKOLN is currently involved. It has concentrated on two main areas. Firstly, subject gateway co-operation as expressed in the Resource Discovery Network and the Renardus project. Secondly, on the development of preservation metadata standards. These topics are important because subject gateways and digital preservation represent key areas where some of the traditional values of the information profession, with regard to, for example, the selection, description and preservation of resources, have begun to be applied in the context of the Internet.
This paper is based on work undertaken for the Cedars, Renardus and SCHEMAS projects and for the Resource Discovery Network. The Joint Information Systems Committee (JISC) of the UK higher and further education funding councils funds both the RDN and the Cedars project. The European Commission funds the Renardus and SCHEMAS projects as part of its Information Society Technologies (IST) Programme. All of these projects are collaborative activities and the author would like to acknowledge the input of other project partners into the work described in this overview.
UKOLN is funded by Resource: the Council for Museums, Archives and Libraries and the JISC, as well as by project funding from JISC, the European Union and other sources. UKOLN also receives support from the University of Bath, where it is based.
Maintained by: Michael Day, Research Officer, UKOLN, University of Bath.