KE briefing

From DigiRepWiki

Research paper metadata


A briefing paper for the Research paper metadata strand at the Knowledge Exchange Workshop on Institutional Repositories, Utrecht, NL, 15-17 January 2007


Author: Julie Allinson

Contributors: Mogens Sandfaer, Neil Jacobs

Contents

Background

The Knowledge Exchange (http://www.knowledge-exchange.info/) is a co-operative partnership between Denmark’s Electronic Research Library (DEFF), the German Research Foundation (DFG), the Joint Information Systems Committee (JISC) in the United Kingdom and the SURF Foundation (SURF) in the Netherlands. Its overarching aim is to support the use and development of ICT infrastructure for higher education and research. In January 2007, SURF is hosting an Institutional Repositories workshop, organised around the following 6 thematic strands:

  • usage statistics
  • research paper metadata
  • exchanging information
  • author identification
  • Open Archives Initiative
  • eTheses

The focus of this workshop is on identifying issues and solutions, towards agreeing common approaches and achieving interoperability. This briefing paper has been prepared by the moderators of the 'Research paper metadata' strand to facilitate discussion of research paper metadata by identifying the formats currently used and the issues and problems faced. Beyond metadata, the paper is cognisant of the role of scholarly works and their metadata in relation to repositories, other data types, other information systems and interoperability standards.

Introduction

For institutional open access repositories as well as research databases and other e-science documentation systems, research papers (scholarly works or eprints) are one the most commonly collected types of data and are a key component of the scholarly research information cycle. It is quite common for metadata about a single research paper, or scholarly work, to exist in multiple places, e.g. the institutional repository of each author, a subject repository, research information systems, aggregator search services and so on. In many cases, current practice involves duplication of effort, or manual manipulation of data to fit with a particular system, which goes against the metadata maxim of "do it once, do it well". Within the increasingly complex ecology of interoperating systems and services, the need to agree and use common formats and standards for exchange and to create high quality consistent metadata is clear.

Formats and approaches

A number of metadata standards exist for capturing and sharing metadata. These include:

Local, repository-specific, national or other, requirements are not within the scope of such metadata formats and to fulfil these application profiles are often developed. Several application profiles exist across the four KE partner countries.

Denmark (DEFF)

Current situation: The MXD format for document exchange is used in The Danish National Research Database (http://www.forskningsdatabasen.dk/) and thus data from the universities is exported in this format. The PURE format is used at a range of universities using the PURE system (http://www.atira.dk/) for research registration.

Danish Research Database : Metadata Exchange Format of Documents (DDF-MXD)

The MXD format comprises 4 entities: person, organisation, project and event. It has a simple yet flexible design, intended to cater equally well for the simplest minimum-level metadata as for more detailed and sophisticated cataloguing practises. It can be mapped to oai_dc for exposure over OAI-PMH.

PURE format for the PURE research system The PURE system collects a wide set of information, including biometric analysis (citations), research activities, research project information, project planning and national statistics.

Future work: Working group on a national metadata standard for research information, under the auspices of OIO (http://www.oio.dk/).

United Kingdom (JISC)

Current situation: Most repositories use oai_dc for exchange over oai_dc with internal repository metadata based on DC in most cases although examples of other formats, such as MODS do exist. The two most commonly used repository systems in the UK are GNU eprints and DSpace. DSpace uses a qualified Dublin Core metadata profile (http://www.dspace.org/technology/metadata.html); GNU eprints comes with its own out-of-the box metadata. The following application profile is a new development and has yet to be implemented within either of the software products above and thus by repositories. It is likely that uptake of this format will be achieved throughout 2007. There is little known evidence of interoperability between UK repositories and research systems.

Dublin Core Application Profile for Scholarly Works:

This application profile takes FRBR as the basis for its entity-relationship model and uses the DCMI Abstract Model to facilitate the grouping together of descriptions for each of its five entities: Scholarly Work, Expression, Manifestation, Copy and Agent. Metadata elements are taken from simple and qualified Dublin Core, the MARC relators and foaf schemes, plus some new eprints terms and four vocabularies for EntityType, Type, AccessRights and PeerReviewedStatus. It can be mapped to oai_dc for exposure over OAI-PMH, or exposed using the eprints dc xml schema.

A note about the Dublin Core Abstract Model

The DCMI Abstract Model (DCAM) (http://dublincore.org/documents/abstract-model/) provides a syntax-neutral abstract model for Dublin Core metadata and describes the constructs that make up a DC metadata description set. According to the DCAM:

  • a description set is made up of one or more descriptions
  • a description is made up of
    • zero or one resource URI and
    • one or more statements
  • a statement is made up of
    • exactly one property URI and
    • zero or one reference to a value in the form of a value URI
    • zero or more representations of a value, each in the form of a value representation
    • zero or one vocabulary encoding scheme URI
  • a value representation is either
    • a value string or
    • a rich representation
  • a value string may have an associated value string language
  • a value string may have an associated syntax encoding scheme URI
  • a value may be the subject of a related description

The DCMI is currently revising its guidelines for encoding DC metadata in XML and RDF/XML in line with the DCAM. These new guidelines will enable the exchange of more complex description sets' between software applications.

Netherlands (SURF)

Current situation: In DARE, the national programme for institutional repositories, (simple) Dublin Core is combined with MPEG21/DIDL. Most parts of CERIF are implemented by Metis, the Research Information System used by all Dutch Universities. In many universities, Metis and the IR are linked. Metadata records are exchanged between the two systems; in most cases the metadata records flow from Metis to the IR, in some cases Metis is also updated by records from the IR. For the Nereus project, a qualified DC XML scheme has been developed, with the addition of an author identifier (DAI). This identifier links to person records in the Dutch Union Catalogue (hosted by OCLC PICA).

DARE use of Dublin Core Version 2 (Dec. 2004)

This profile uses qualified Dublin Core and includes a list of resource type values for scholarly works.

Nereus use of metadata (2005):

This profile uses qualified Dublin Core and includes a list of resource type values for scholarly works.

Germany (DFG)

It is not clear what formats are currently in use in Germany.

Metadata

Entities and relationships

As part of any requirements gathering exercise, it is useful to think about the different 'entities' involved when talking about scholarly works, and the relationships between those entities. Two data models exist which can provide some help in this area:

CERIF - the Common European Research Information Format (http://cordis.europa.eu/cerif/) is a multi-entity data model for current research information. It can be used as a data model for current research information systems (CRIS) and as a model for data exchange. CERIF is organised into 5 levels of abstraction. At its highest level CERIF identifies three primary base entities - Person, Project and OrgUnit. The remaining 4 levels of abstraction define the secondary base entities, of which Result_Publication is one; the language field base (for multi-language attributes); lookup tables (define the nature of relationships between entities); and link tables (the relationships between entities). Within CERIF, research publications (papers, presentations etc.) are treated as an output of one, or more, of the base entities and sit alongside other two other types of output - product and patent.

Result_Publication has the following attributes and relationships:

  • Id (secondary base attribute)
  • Date (secondary base attribute)
  • Type (secondary base attribute)
  • Reference (secondary base attribute)
  • URI (secondary base attribute)
  • URI Type (secondary base attribute)
  • Result_Publication-Title (langugage field base)
  • Project-Result_Publication-Role (lookup tables)
  • Person-Result_Publication-Role (lookup tables)
  • Result_Publication-Type (lookup tables)
  • Project-Result_Publication (link tables)
  • Person-Result_Publication (link tables)
  • OrgUnit-Result_Publication (link tables)
  • Result_Publication-Classification (link tables)

FRBR - Functional Requirements for Bibliographic Records (http://www.ifla.org/VII/s13/frbr/frbr.pdf) is concerned primarily with bibliographic records, and looks at the different entities that make up a single 'Work' (definition). These are Work, Expression, Manifestation and Item. Additionally, it identifies the entities outwith the Work: Person, Corporate Body, Concept, Object, Event, Place. The JISC-funded Eprints Application Profile described above has applied the primary entities of Scholarly Work (for Work), Expression, Manifestation, Copy (for Item), along 'Agent' to capture both Person and Corporate Body; the other entities were not modelled in this application profile.

Although both CERIF and FRBR have a different focus, it is possible to extract useful key entities applicable to research papers. The primary entities concerning a scholarly work can be summarised as:

  • the papers / scholarly works (including the different versions and manifestations)
  • the people involved (creators, editors, other contributors)
  • the organisations associated (publishers, funders, author institutions etc.)

Additional entities could include: research projects for which those scholarly are produced; the other data that a scholarly work might be related to (experimental data, images, other research papers, learning objects etc.); other resources (classification schemes, authority files, vocabularies etc.); repositories and services making available the metadata. The extent to which these entities and relationships are modelled is dependent on the functional requirements that the data model needs to meet and the levels of complexity that can be sustained by the users of such a data model.

Metadata elements

When considering the metadata that is captured to describe and identify research papers, it is worth considering the different types of metadata required and also the possibilities for automated extraction and creation. Metadata used in particular application profiles may be drawn from a variety of sources, or mapped to various metadata schemas. Descriptive metadata is perhaps of most obvious interest when considering the description of research papers in repositories, it is used to describe resources in order to facilitate resource discovery and identification. Additionally, several types of administrative metadata may also be required. These provide the information necessary for managing, accessing and manipulating resources, such as technical details about file type, size and so on. Creation details, access information and rights management metadata could also usefully be considered as part of the administrative metadata. Preservation metadata too, is necessary if a resource is to be adequately archived and preserved. The PREMIS data dictionary (http://www.loc.gov/standards/premis/) offers a set of core preservation metadata elements.

Vocabularies

The use of controlled vocabularies facilitates metadata interoperability and consistency. Examples of vocabularies range from the use of large subject classification schemes like the Library of Congress, to small local term lists. One important area for using a standard vocabulary is for resource types. The DCMI Type vocabulary (http://dublincore.org/documents/dcmi-type-vocabulary/) provides a general, cross-domain list of DCMI-approved terms. These terms are not specific enough for research papers which would generally fall within the 'Text' type. Type vocabularies have been developed within the DARE, DEFF and JISC application profiles listed above.

Requirements and issues for discussion

Why do systems capture metadata? What do we want to do with that metadata? What are the issues and barriers to fulfilling these requirements?

Requirements / issues include:

  • Handling full-text
  • Capturing citations and references
  • Unambiguously identifying the full-text (or object)
  • Version identification
  • Metadata quality
  • Metadata consistency
  • Name authority and personal names; author identifiers
  • Unambiguously identifying different resource types

The Eprints Application Profile working group identified an extensive list of scenarios and functional requirements: http://www.ukoln.ac.uk/repositories/digirep/index/Functional_Requirements

Issues with CERIF

  • How widely CERIF has been used is unclear.
  • CERIF does not appear to facilitate linking research_publications with each other, or with other secondary base entities (e.g. product, funding programme, event or patent), these can only be related/linked via a primary base entity (e.g. two publications created by the same Person, a publication presented at an event about a Project), with the exception of Classification.
  • CERIF facilitates the capture of Research_Publication Title in multiple languages, but it does not capture the abstract at all.
  • CERIF does not provide sufficient metadata for Institutional repositories. This is especially so of those wishing to expose their metadata over oai-pmh. For this, oai_dc is a minimum and CERIF cannot provide a full oai_dc record. Most repositories would need to supplement any CERIF-compliant data.

Issues with institutional repositories

  • Most IR metadata captured at the moment doesn't unambiguously identify the relationships between entities. This is a particular issue where multiple versions/expressions, manifestations or copies of a single scholarly work exist. The DCAP for scholarly works in the UK (also known as the Eprints Application Profile) has been based on an entity-relationship model to facilitate the capture of a set of key relationships.
  • Institutional repositories don't capture information about the wider research context in which a scholarly work has been produced. In the UK, many research councils are beginning to mandate the deposit of funded research publications into open access repositories. In this situation, repositories must be able to identify the research funder, at the very least.

Metadata exchange

For metadata exchange, most open access repositories make use of the OAI-PMH (http://www.openarchives.org/) protocol for metadata harvesting, making available simple Dublin Core in the oai_dc format mandated by the protocol and other locally agreed formats such as the Danish MXD format. Current work on the Object Reuse and Exchange initiative in the US will provide new specifications for the exchange and movement of digital objects across repositories and other systems (http://www.openarchives.org/ore/). Aggregators using OAI-PMH include BASE and OAIster; the European DRIVER and JISC Intute Repository Search projects will also make use of this protocol, at least initially. Google and Google Scholar are also significant services for metadata exposure.