|Roots of the OAI-PMH||Top|
The roots of OAI lie in the development of e-print repositories (so-called archives). E-print repositories were established in order to communicate the results of ongoing scholarly research prior to peer review and journal publication. The earliest of these was xxx (later arXiv), which began with high energy physics in 1991 and expanded to cover the field of physics plus related fields of mathematics, nonlinear sciences and computer science. CogPrints followed for psychology, linguistics and neuroscience. The Networked Computer Science Technical Reference Library (NCSTRL) provided access to computer sciences technical reports deposited either in xxx or in departmental repositories of cooperating research bodies. Similarly, RePEc provided authors in the field of economics with the option to submit working papers to their departmental archive or, if there was none, to the EconWPA archive at Washington University. In addition, the Networked Digital Library of Theses and Dissertations (NDLTD) built a digital library of electronic theses and dissertations (ETDs) authored by students of member institutions.
The mechanism for filling these repositories, in all cases, was by author deposit. (Within the OAI, and for the purposes of this tutorial, an "e-print" is defined as an author self-archived document.) Web interfaces allowed people to interact with these repositories and some finding aids were provided. Different interfaces were designed for different repositories, so end users were forced to learn diverse interfaces in order to access the various repositories and finding aids. The "Guildford protocol" supported interoperability between the RePEc archives, while NCSTRL repositories implemented the Dienst protocol. These protocols made possible a variety of end-user services, including those supporting end-user search and browse across repositories in each grouping. NDLTD created a workflow for submitting material, and developed an XML DTD (Document Type Description) for electronic ETDs, as well as supporting a ETD digital library. However, little or no autonomous metadata sharing was supported across this diverse environment, and still further separate initiatives in the area of new means of scholarly communication were taking shape. Certain key players in these developments came to see interoperability as an increasingly important issue to be addressed by the e-print community.
|Santa Fe meeting||Top|
??the joint impact of these and future initiatives can be substantially higher when interoperability between them [e-print archives] can be established??
(Ginsparg, Luce, Van de Sompel, UPS Call, July 1999)
Two key interoperability problems were identified as impairing the impact of e-print archives: end users were faced with multiple search interfaces making resource discovery harder, and there was no machine-based way of sharing the metadata. Solutions that were being explored included cross-searching of archives (on one hand) and harvesting archive metadata (on the other hand) in order to provide centralised search services. In July 1999 Paul Ginsparg, Rick Luce, and Herbert Van de Sompel of the Los Alamos National Laboratory issued a call to a restricted group of technical experts to attend a meeting Santa Fe, New Mexico in October of the same year. Ginsparg was involved with arXiv, and Van de Sompel was also still associated with the University of Ghent at that time. They proposed the creation of a universal service for author self-archived scholarly literature (the Universal Preprint Service, or UPS). The UPS would be "the fundamental and free layer of scholarly information, above which both free and commercial services could flourish". The first steps toward establishing it would be the identification or creation of interoperable technologies and frameworks for the dissemination of e-prints. This was announced to a wider audience under the headline "The Open Archives initiative aimed at the further promotion of author self-archived solutions".
The aim of the meeting in Santa Fe was to discuss interoperability issues, agree to begin work on a promotional prototype digital library service based on the main existing e-print repositories, and establish a forum for continuing work on interoperability of self-archiving solutions. In preparation for the meeting, some foundational work was undertaken. Van de Sompel initiated a project that simulated some aspects of interoperable distributed e-print archives. Thomas Krichel (University of Surrey & RePEc) experimented with converting data from existing e-print initiatives into the ReDIF metadata format used in RePEc. Michael Nelson (NASA Langley) took this data and used it to create various archives architected along the lines of his Smart Object Dumb Archives concepts. Data used came from sources including CogPrints, NASA, NCSTRL, RePEc and xxx. The aim of this work was not to make statements about the architectural directions that the UPS should take, but rather to facilitate discussions about this at the October meeting.
|Challenges and proposed solutions||Top|
Cross search or harvest?
Choice of a general direction to take in developing the architectural framework for a UPS was a key issue at this early stage. Two possible approaches were cross-searching multiple archives based on a protocol such as Z39.50, or else harvesting metadata into one or more "central" services in a bulk move of data that would bring it closer to the user interface.
Digital library experience suggested that cross searching does not scale well, at least partly because the search service degrades to the level of the slowest and least reliable server in the cross search set. For example, NCSTRL found that distributed searching of a small number of nodes was viable, but that performance was very bad over 100 nodes. In the UK, the Resource Discovery Network (RDN) was finding that even with only five subject gateways in its cross search there were problems of poor performance and in the provision of a browse interface, and developers were looking for a feasible centralised database solution. The more servers are cross-searched, the higher are the chances of encountering one or more slow or unreliable servers.
There is also the problem of knowing which target servers to use in any particular cross search. Collection descriptions where they are available at all may be inconsistent across repositories, were not designed for machine-to-machine communication and require time-consuming examination by end-users. Differences in query language syntax and search attribute variation (between servers and over time) introduce barriers of complexity, either for the end user or the cross-search software, or both. Ranked merging of results from distributed servers presents further technical and user-interface problems, and different size and types of targets can skew results. A browse interface is very difficult to build when the metadata to be browsed is distributed across a number of repositories. It was suggested that a solution would be to get all the metadata records together in one place.
The UPS prototype brought to the Santa Fe meeting demonstrated a cross-archive digital library providing services based on a collection of metadata harvested from multiple archives. Its architecture drew on NCSTRL and a modified version of the Dienst protocol. In this way, the number of nodes being searched could be reduced to one, giving significant performance benefits. A service could be provided using one query language, set of search attributes and ranking algorithm. In addition, an awareness of the data makes browse structures easier to build.
Data and Service Providers
The UPS architecture identified two logical roles: "Data Providers" and "Service Providers". Data Providers handle the deposit and publishing of resources in a repository and "expose" for harvesting the metadata about resources in the repository. They are the creators and keepers of the metadata and repositories of resources. Service Providers harvest metadata from Data Providers. They use the harvested metadata for the purpose of providing one or more services across all the data. The types of services that may be offered include a search interface, peer-review system, etc. Note that one 'provider' organisation can play both roles, offering both data for harvesting and end-user services. The key architectural shift was the move away from only supporting human end-user interfaces for each repository, to supporting both human end-user interfaces and machine interfaces for harvesting.
|The dawn of a protocol||Top|
The name UPS (Universal Preprint Service) was quickly changed, partly in order to avoid potential difficulties related to the fact that UPS is an established brand name for a commercial parcel delivery service, and partly because it was recognised that not all e-prints were preprints. The framework within which this universal service would be developed was now designated the Open Archives initiative shortened to OAi, later OAI a phrase that had gained currency in early discussions.
It was clear from discussions and experiments that in order to facilitate metadata harvesting there must be agreement on:
An initial agreement in key areas made it possible to develop a protocol for metadata harvesting, named the Santa Fe Convention in honour of the meeting where this agreement was reached.
|OAI-PMH version history||Top|
The Santa Fe Convention was the first incarnation of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It drew upon the UPS Prototype, the RePEc/SODA Service/Data provider model, the Dienst Protocol, and the work of the Santa Fe group. The focus of the Santa Fe Convention was to ?optimise the discovery of e-prints?.
The OAI-PMH 1.0 introduced the unqualified Dublin Core element set as a baseline for metadata interoperability. It drew upon the Santa Fe Convention, Digital Library Federation meetings, work at Cornell University, and feedback from alpha-testers. The focus expanded to facilitating the discovery of ?document-like objects". It was a low barrier interoperability specification, based on a metadata harvesting model. It was HTTP based, using HTTP GET / POST requests and XML responses. Note that it is not a search protocol, rather, it is based on the metadata harvesting model. OAI-PMH 1.1 was a revision of the 1.0 specification taking account of changes to the emerging XML Schema specification. Both v.1.0 and 1.1 were experimental in nature.
The OAI-PMH 2.0 is a major revision of the protocol, and is not compatible with v.1.x. It drew upon OAI-PMH 1.x, feedback from OAI Implementers List, OAI Tech deliberation, and feedback from alpha-testers. Once again the focus of the protocol expanded; now it was said to be about ?the recurrent exchange of metadata about resources between systems?. It is still a low barrier interoperability specification based around a metadata harvesting model. No longer experimental, v.2.0 is a stable protocol, and OAI has committed to making subsequent revisions of the protocol backwards compatible.
|Flexible deployment: various says OAI-PMH may be deployed||Top|
OAI-PMH enables flexible deployment. Because it is a simple protocol based on HTTP and XML, it allows for rapid deployment. A number of toolkits are available, as will be discussed later in this tutorial. Systems can be deployed in a variety of configurations, as illustrated in the following diagrams. Metadata and full-text resources are typically made freely available, but this is not a requirement. OAI-PMH can also be used between closed groups; for metadata-sharing only; and in commercial applications.
Multiple Service Providers
Multiple Service Providers can harvest from multiple Data Providers.
Aggregators can sit between Data Providers and Service Providers.
Harvesting combined with searching
The harvesting approach can be complemented with searching based, for example, on Z39.50 or SRW.
Early movers were developing separate solutions, but the need for interoperability was recognised. In response, the Santa Fe Meeting led to substantial support for the OAI, which promotes interoperability via developing OAI-PMH as an open standard, and disseminating information about OAI-PMH. OAI-PMH is a low-cost mechanism for harvesting metadata records from one system to another from Data Providers to Service Providers. Multiple Service Providers can harvest from multiple Data Providers ensuring a wider spread of metadata. OAI-PMH is not a search protocol, but its use can underpin search-based services; it is a base layer on which to build other services.
Development over the last two to three years has seen a move from the specific to the generic from discovery of e-prints to sharing descriptions of any resources. Although unqualified Dublin Core is specified for baseline interoperability, OAI-PMH can be extended to any metadata format that can be encoded in XML. Based on HTTP for requests (and access-control, compression, error codes, etc.) and on XML for responses, it is Web-friendly, and therefore firewall friendly. It allows Service Providers to say 'give me some or all of your records', where 'some' is based on date-stamps, sets, or metadata formats. Simple, and built on existing technology, OAI-PMH is easy to deploy, with many toolkits that can hide the protocol from developers.
|Seven key definitions||Top|
An e-print is an author self-archived document. In the sense that the term is ordinarily used, the content of an e-print is the result of scientific or other scholarly research.
A document-like object is a digital data unit that is comparable to a paper document. The term designates a relatively simple stable resource, and would not cover, for example multimedia artifacts or interactive services.
A resource is anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., today's "weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources.
(Definition from Guidelines for implementing Dublin Core in XML by Andy Powell and Pete Johnston)
XML is the acronym for Extensible Markup Language. XML is a language for creating other languages. It defines a means of describing data. XML can be validated against a DTD or schema setting out the elements of the language created. XML mappings exist for a number of metadata record formats.
DTD is the acronym for Document Type Definition. A DTD is a formal specification of the structure of a document.
Dublin Core (DC) is a metadata format defined on the basis of international consensus. The Dublin Core Metadata Element Set defines fifteen elements for simple resource description and discovery, all of which are recommended, and none of which are mandatory. DC has been extended with further optional elements, element qualifiers and vocabulary terms.
(Definition draws on UKOLN's metadata glossary and Metadata in a nutshell by Michael Day)
Interoperability is the ability of systems, services and organisations to work together seamlessly toward common or diverse goals. In the technical arena it is supported by open standards for communication between systems and for description of resources and collections, among others. Interoperability is considered here primarily in the context of resource discovery and access.
|Sources of further information||Top|
Lynch, C.A. Metadata Harvesting and the Open Archives Initiative. ARL Bimonthly
Report 217, August 2001
Van de Sompel, Herbert, Krichel, T., Nelson, M. L. and others. The UPS Prototype:
An Experimental End-User Service across E-Print Archives. D-Lib Magazine,
vol.6, no. 2. February 2000.
Van de Sompel, H., Lagoze, C. The Santa Fe Convention of the Open Archives
Initiative. D-Lib Magazine, vol.6, no.2. February 2000.
OAI Web site: http://www.openarchives.org/
|Copyright © 2003 University of Bath. All rights reserved.
Author: Leona Carpenter (co-ordinating author) for OA-Forum and UKOLN
|Last modified: 14 Oct 2003 16:36
Authored in CALnet