OAI collection description and set semantics

Caroline Arms, Andy Powell, Mogens Sandfaer


Introduction

Each OAI repository can be thought of as a collection - a collection of metadata records and/or "full content" items. The collection of records and/or items may, optionally, be partitioned into one or more sub-collections, known as sets. It should be noted that OAI repositories need not contain "full content" items. For example, a subject gateway (a database of metadata about remote Internet resources) is a repository containing a collection of records but no items. A pre-print archive is an example of a repository that contains a collection of items and a collection of records.

This document considers the description of OAI repositories and sets and makes recommendations for mechanisms to encode such collection descriptions within the OAI Metadata Harvesting Protocol (OAI-MHP). The intention is to develop a generic collection description mechanism that can be used across all OAI repositories, i.e. one that is applicable to all domains, and that is rich enough to support:

Ideally, the descriptions should be simple enough that they will be provided by the majoritory of repositories.

This document does not consider the more general issues of collection description, nor does it consider how the OAI-MHP can be used to harvest collection description metadata about arbitrary external collections, though that is a perfectly valid use of the protocol.

Terminology

repository
a collection of metadata records and/or full content items that can be accessed using the OAI-MHP.
set
a sub-collection of metadata records and/or full content items within a repository.
item
a "full content" resource available from a repository.
record
a description (metadata) available from a repository.

Existing collection description within the OAI-MHP

There is one existing collection description mechanism within the OAI-MHP. The response to an Identify request may contain a list of description containers, which provide an extensible mechanism for communities to describe their repositories.

The 1.1 specification contains two sample XML Schemas - oai-identifier and eprints:

It should be noted that the oai-identifier schema doesn't provide a true collection description, it merely describes the format of identifiers used by the repository. The eprint schema provides a collection description covering both the items (dataPolicy) and the records (metadataPolicy). However, the eprint schema is fairly minimal. For example, there is no indication of the subject coverage of the repository.

In general, responses to the Identify request describe the repository as a whole - there is no agreed mechanism for separately describing the collection of items and the collection of records within the same repository.

The current version of the protocol provides no mechanism for describing the sets within a repository, other then providing the setName as part of the response to a ListSets request.

Current usage of sets

An analysis of the usage of sets by currently registered OAI repositories is available in Appendix A.

Out of 49 repositories, 39 are using sets. Of these 13 appear to partition their collection by subject area, 13 by genre, and 9 by source of records.

Collection description schemas

In order to share descriptions of repositories and sets a collection description schema needs to be agreed. Five possibilies are suggested here:

Of these, there is some benefit in using simple Dublin Core because of its use elsewhere in the protocol. EAD has a strictly archival background and may not be applicable across the whole range of OAI repositories. The RSLP schema may be considered to be over-complex for use within OAI. The eprints schema may not be considered complex enough. UDDI primarily focuses on 'service' description, though it may also provide a framework for describing collections.

It is worth noting that mappings between the DC, EAD and RSLP schemas already exist.

Possible protocol enhancements

As described above, the OAI-MHP already provides a mechanism for describing the repository as a whole, using the description container within the response to an Identify request. All of the schemas listed above could be encoded within the description container, provided a suitable XML schema is made available.

To provide descriptions of the sets within a repository the protocol will need to be enhanced. For example, it would be possible to add an optional setDescription container within the response to a ListSets request. A response might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<ListSets xmlns="http://www.openarchives.org/OAI/1.1/OAI_ListSets"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/OAI_ListSets
                              http://www.openarchives.org/OAI/1.1/OAI_ListSets.xsd">
  <responseDate>2001-06-01T19:20:30-04:00</responseDate> 
  <requestURL>http://an.oa.org/OAI-script?verb=ListSets</requestURL>
  <set>
    <setSpec>Oceanside</setSpec>
    <setName>Oceanside University of Nebraska</setName>
    <setDescription>
      <dc xmlns="http://purl.org/dc/elements/1.1/" 
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
          xsi:schemaLocation="http://purl.org/dc/elements/1.1/
                              http://www.openarchives.org/OAI/1.1/dc.xsd">
        <subject>Marine biology</subject>
        <publisher>University of Nebraska</publisher>
        <rights>Metadata may be used without restrictions as long as the oai
        identifier remains attached to it.</rights>
       </dc>
    </setDescription>
  </set>
</ListSets>

However, it is possible to view each repository and/or set as two collections - a collection of items and a collection of metadata records. It should be noted that, in the discussion above, the use of description and setDescription containers does not provide a mechanism for separately describing these two collections within the repository and/or set. In order to support separate descriptions, separate containers would be required. For example: itemsDescription, recordsDescription, setItemsDescription and setRecordsDescription.

Rather than providing collection descriptions in-line within the protocol, an alternative approach might be to provide some mechanism for linking to an external collection description about the repository and/or sets.

Summary of issues

It is recommended that further work be carried out to propose answers to the following questions:

Appendix A - Analysis of set usage in registered OAI repositories

By Caroline Arms, Library of Congress
As of 2001-10-22

OAI id Repository name # of sets Apparent semantics

celebration

A Celebration of Women Writers

0

 

aps

American Philosophical Society

5

discipline

arXiv

arXiv

4

discipline

bmc

BioMed Central

60

topic

CDLCIAS

California International and Area Studies Digital Repository

11

discipline (region of study)

caltechcstr

Caltech Computer Science Technical Reports

8

discipline (1), decade

caltecheerl

Caltech Earthquake Engineering Research Laboratory Technical Reports

1

 

caltechETD

Caltech Electronic Theses and Dissertations

1

genre

cimi

CIMI Metadata Harvesting Working Group Demonstration Repository

>100

source of records

citebase

Cite-Base services

2

source of records

cogprints

CogPrints

50

topic

cbold

Comparative Bantu Online Dictionary (CBOLD)

1

genre of content

CSTC

Computer Science Teaching Center

1

 

CDLDERM

Dermatology Digital Repository

34

topic

DUETT

DUETT - Dissertations and other Documents of the Gerhard-Mercator-University Duisburg

1

genre (dissertations and theses)

eldorado

Elektronisches Dokumenten-, Archivierungs- und Retrievalsystem der Universitaet, Dortmund

16

discipline

elra

European Language Resources Association

10

genre

formations

Formations

23

discipline

cav2001

Fourth International Symposium on Cavitation

21

topic (1), session at conference

hsss

Hochschulschriftenserver (HSSS) der SLUB Dresden

29

genre (3), discipline,

HUBerlin

Humboldt University of Berlin, GERMANY, Document Server

36

genre (6), discipline

scout

Internet Scout Project OAI Repository

0

 

lcoa1

Library of Congress Open Archive Initiative Repository 1

4

source of records, genre of content

ldc

Linguistic Data Consortium

 

ERROR (403)

 

LTRS

50

topic

 

M.I.T. Theses

2

source of records

 

NACA

0

 

etdcat

OCLC Online Computer Library Center Theses and Dissertations Repository

0

 

 

Perseus Digital Library

15

source of records

physdoc

PhysNet, Oldenburg, Germany, Document Server

1

genre (PhD theses)

 

Resource Discovery Network

0

 

RIACS

RIACS - Research Institute for Advanced Computer Science - Eprint Archive

 

ERROR (500)

sceti

Schoenberg Center for Electronic Text and Image

0

 

MONARCH

Technical University of Chemnitz - MONARCH

14

genre of content (level of thesis, article, etc.)

EKUTuebingen

The Eberhard Karls University of Tuebingen

0

 

lacito

The LACITO Archive

3

genre of content

dfki

The Natural Language Software Registry

1

 

 

The Oxford Text Archive

1

 

dlpscoll

The University of Michigan. University Library. Digital Library Production Service.

15

source of records, genre of content

CDLTC

Tobacco Control Digital Repository

>100

topic

 

Tropicos

1

genre

thesis

Universidad de las Americas - Puebla: Digital Thesis

1

genre (theses)

uiLib

University of Illinois Library

6

source of records

tkn

University of Tennessee Libraries

8

source of records

VTETD

Virginia Tech Electronic Thesis and Dissertation Collection

1

genre (theses)

 

AISRI (American Indian Studies Research Institute)

1

genre of content

anlc

Alaska Native Language Center

0

 

idli

University of Illinois at Urbana-Champaign, Digital Library Initiative

3

source of records

 

Chemistry Preprint Server

10

discipline


Last updated: 29 October 2001