UKOLN AHDS Metadata Harvesting


As the number of available digital resources increases so does the need for quick and accurate resource discovery. In order to allow users to search more effectively many resource discovery services now operate across the resources of multiple distributed content providers. There are two possible ways to do this. Either by distributed searching across many metadata databases or by searching harvested metadata.

Metadata harvesting is the aggregation of metadata records from multiple providers into a single database. Building applications or services that use these aggregated records provides additional views of those resources, assisting in access across sectors and greater exposure of those resources to the wider community.

Open Archives Initiative Protocol for Metadata Harvesting

When metadata harvesting is carried out within the JISC Information Environment the Open Archives Initiative Protocol for Metadata Harvesting (OAI PMH) [1] version 2.0 is recommended. The Open Archives Initiative [2] had it roots in the e-prints community who were trying to improve access to scholarly resources. The OAI PMH was developed initially by an international technical committee in 1999. It is a light-weight low cost protocol that is built on HTTP and XML. The protocol defines six requests, known as verbs:

  1. GetRecord Identify
  2. ListIdentifiers
  3. ListMetadataFormats
  4. ListRecords
  5. ListSets

In order for metadata to be shared effectively two things need to happen:

  1. Content/data providers need to make metadata records available in a commonly understood form.
  2. Service providers need to obtain these metadata records from the content providers and hold them in a repository.

OAI PMH provides a means of doing the above.

Record Format

At the lowest level a data provider must support the simple Dublin Core [3] record format ('oai_dc'). This format is defined by the OAI-PMH DC XML schema [4]. Data providers may also provide metadata records in other formats. Within the JISC Information Environment if the repository is of value to the learning and teaching community projects should also consider exposing metadata records that conform to the UK Common Metadata Framework [5] in line with the IMS Digital Repositories Specification using the IEEE LOM XML schemas [6] .

OAI-PMH also provides a number of facilities to supply metadata about metadata records for example rights and/or provenance information can be provided in the <about> element of the GetRecord response. Also collection-level descriptions can be provided in the <description> element of the Identify response.

Example OAI DC metadata record

The following example is taken from the Library of Congress Repository 1).

<dc:title>Empire State Building. [View from], to Central Park</dc:title>
<dc:creator>Gottscho, Samuel H. 1875-1971, photographer.</dc:creator>
<dc:date>1932 Jan. 19</dc:date>
<dc:type>two-dimensional nonprojectible graphic</dc:type>
<dc:type>Cityscape photographs.</dc:type>
<dc:type>Acetate negatives.</dc:type>
<dc:coverage>United States--New York (State)--New York.</dc:coverage>
<dc:rights>No known restrictions on publication.</dc:rights>

Conformance Testing for Basic Functionality

The OAI gives information on tests an OAI repository must successfully complete in order to be entered in the registry. For example:

More information on the tests necessary is available from the OAI Web site [7]. Projects could use the tests listed to create a checklist to measure their repository's conformance.


  1. The Open Archives Initiative Protocol for Metadata Harvesting,
  2. Open Archives Initiative,
  3. Dublin Core,
  4. OAI-PMH DC XML Schema,
  5. UK Common Metadata Framework,
  6. IMS Digital Repositories Specification,
  7. Registering as a Data Provider,

Further Information