UKOLN logo

Metadata in Digital Libraries, DELOS meeting, Riga, Latvia, 16 April 2003

Exercise 2: Exploring the use of
the OAI Protocol for Metadata Harvesting


Aims

This exercises provides an introduction to the OAI PMH and some services developed using the OAI-PMH. It gives you an opportunity to see how XML is used to encode metadata records in OAI PMH messages, but it is not intended to provide a detailed covergae of the technical elements of the protocol.

Task 1 : Explore arXiv.org: An e-prints archive

arXiv is subject/disciplinary e-print service covering the fields of physics, mathematics, non-linear science and computer science, owned, operated and funded by Cornell University.

Go to the arXiv home page at <URL: http://www.arxiv.org/>.

Spend a few minutes exploring the archive.

For the purposes of this exercise, identify a single paper to work with as an example, and make a note of the identifier used by arXiv. For example, I worked with a paper titled "Intersecting brane solutions in string and M-theory" by Douglas J. Smith, which has an identifier in arXiv of hep-th/0210157.

Task 2 : Explore Citebase, an OAI service provider

Citebase is an experimental service provided by the University of Southampton. Citebase uses the OAI PMH to gather metadata from a number of e-print archives (which act as data providers), and provides an "impact analysis" based on analysing citations in the full-text of the papers.

N.B. The citation analysis feature of Citebase depends on access to the full-text of the resource described by the arXiv records, not only to the metadata records. The OAI-PMH itself specifies only how the metadata is harvested; access to the resource is outside the scope of the protocol itself.

Go to the Citebase home page at <URL: http://citebase.eprints.org/>.

Search for your selected paper from arXiv in Citebase: probably the quickest way of doing this is to use the OAI-identifier search option. In the identifier textbox, key the identifier of the e-print which you recorded from arXiv, prefixed by the string oai:arXiv.org:. i.e. for my example above, the full OAI identifier would be: oai:arXiv.org:hep-th/0210157

Click the Abstract button to obtain the Citebase abstract display. That includes the metadata harvested from arXiv, but also the results of the (experimental) citation analysis performed by Citebase. That includes a list of references cited in the current paper and a list of papers which cite the current paper. In both cases, if those other papers are indexed by Citebase, a hyperlink is generated.

Task 3 : The OAI Repository Explorer

The OAI Repository Explorer provides a simple forms interface which allows you to submit OAI requests to a repository.

Go to the Repository Explorer home page at <URL: http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai>.

  1. Select the arXiv.org repository from the list of those available.

  2. Issue an OAI-PMH Identify request to the repository, by clicking on the "Identify" hyperlink in the "Verbs" table.

    You should receive a display of metadata describing the repository as a whole. Note that the Identify response can include information about access/use rights, both to the metadata and to the resources described.

  3. Issue an OAI-PMH List Metadata Formats request.

    The response indicates that the arXiv repository supports the unqualified Dublin Core format (identified by the prefix oai_dc) mandated by the OAI PMH specification, but does not support any additional metadata formats. That is, all harvesting from arXiv is of unqualified DC metadata records.

  4. To see the metadata record which was harvested by Citebase from arXiv for your selected e-print, issue an OAI-PMH Get Record request for that item. The Get Record request requires two parameters:

    • the identifier of the item from which the record is to be generated (in the current example, oai:arXiv.org:hep-th/0210157)
    • the metadata format required (oai_dc)

    The response includes a formatted version of the simple Dublin Core metadata record.

  5. Finally, select the "Raw XML" display option (towards the foot of the page) and reissue Get Record request. (The metadata prefix and identifier parameter values should be retained.)

    You should be able to recognise a <metadata> XML element containing a simple Dublin Core record similar to the examples you generated using DC-dot in the previous exercise.

Task 4 : Other OAI Service Providers

If you have time, explore some other OAI-based services, such as:

Exercises