Issues with current use of simple DC

From DigiRepWiki

[ Home | Functional Requirements | Application Model | Application Profile | Community Acceptance Plan | Mapping to Simple DC | XML Format

Contents

Background

The ePrints UK project developed a set of guidelines for describing eprints using simple DC. The guidelines are available at http://www.rdn.ac.uk/projects/eprints-uk/docs/simpledc-guidelines/. The use of simple DC means that eprint metadata can be easily transferred using the OAI-PMH. However, there are some issues and problems for application developers caused by the limitations of simple DC.

This document analyses each of the recommendations in the guidelines and summarises those areas where there is felt to be a significant weakness in the metadata, caused by the use of simple DC.

Analysis

Note: properties marked with an asterix are mandatory. All other properties are optional.

Current guidance Issues/problems
dc:title (*)

The title of the eprint.

Preserve the original wording, order and spelling of the eprint title. Only capitalize proper nouns. Punctuation need not reflect the usage of the original. Subtitles should be separated from the title by a colon. For example:

  • <dc:title>Initial sequencing and analysis of the human genome</dc:title>
  • <dc:title>The new nationalism and the old history: perspectives on the West German Historikerstreit</dc:title>

If necessary, repeat this element for multiple titles.

  • Where multiple titles are provided, there is no way of determining the main title.
dc:creator (*)

An author of the eprint.

Personal names should be listed surname or family name first, followed by forename or given name or initial followed by a full stop. Separate the surname (or family name) from the forenames, given names or initials with a comma. Titles (Dr., Prof., etc.) should precede the forenames, generational suffixes (Jr., Sr., etc.) should follow the family name. When in doubt, give the name as it appears, and do not invert. For example:

  • <dc:creator>Sulston, John E.</dc:creator>
  • <dc:creator>Evans, R.J.</dc:creator>
  • <dc:creator>Ng, Tze Beng</dc:creator>
  • <dc:creator>Walker Jnr., Dr. John</dc:creator>

In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, separated by full stops. If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the eprint. For example:

  • <dc:creator>International Human Genome Sequencing Consortium</dc:creator>
  • <dc:creator>Loughborough University. Department of Computer Science</dc:creator>

Only encode organisations in this element to indicate corporate authorship, not to indicate the affiliation of an individual.

The inclusion of personal and corporate name headings from authority lists constructed according to AACR2 [8], e.g. the Library of Congress Name Authority File (LCNA), is also acceptable.

In cases of lesser responsibility, other than authorship, use dc:contributor. If the nature of the responsibility is ambiguous, recommended best practice is to use dc:publisher for organizations, and dc:creator for individuals.

If necessary, repeat this element for multiple authors.

  • Recipient of metadata has no knowledge that normalised form of name has been used. Therefore difficult to disambiguate different author names or combine different names for the same author.
  • There is no mechanism for providing the affiliation of the author(s).
  • There is no mechanism for indicating whether the author is a person or an organisation.
dc:subject (*)

The topic of the eprint.

In general, choose the most significant and unique words for keywords, avoiding those too general to describe a particular eprint. If the subject of the eprint is a person or an organization, use the same form of the name as you would if the person or organization were an author, but do not repeat the name in the dc:creator element.

For free-text keywords either encode multiple terms with a semi-colon separating each keyword; or repeat the element for each term. There are no requirements regarding the capitalization of keywords though internal (within archive) consistency is recommended.

Where terms are taken from a standard classification scheme: encode each term in a separate element. Encode the complete subject descriptor according to the relevant scheme. Use the capitalisation and punctuation used in the original scheme. Where subject terms are taken from LCSH, the subfields of the subject heading should be separated by double dash (--) and spaces should be omitted. For example (using free-text keywords and LCSH):

  • <dc:subject>polar oceanography; boundary current; mass transport; water masses; halocline; mesoscale eddies</dc:subject>
  • <dc:subject>World War, 1939-1945--Germany</dc:subject>
  • <dc:subject>Germany--History--1933-1945</dc:subject>
  • <dc:subject>Hitler, Adolf, 1889-45</dc:subject>
  • Recipient of metadata has no knowledge about whether terms have been taken from controlled vocabularies. Therefore difficult to build browse interfaces based on knowledge of vocabulary hierarchies/relationships.
dc:description (*)

A summary of the content of the eprint, typically in the form of an abstract.

  • None.
dc:publisher (*)

Eprint-specific Recommendation:

The publisher of the eprint, typically either the author's institution or a commercial publisher.

In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, separated by full stops. If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the eprint. For example:

  • <dc:publisher>Loughborough University. Department of Computer Science</dc:publisher>
  • <dc:publisher>University of Cambridge. Department of Earth Sciences</dc:publisher>
  • <dc:publisher>University of Oxford. Museum of the History of Science</dc:publisher>
  • <dc:publisher>University of Reading. Rural History Centre</dc:publisher>
  • <dc:publisher>University of Exeter. Institute of Cornish Studies</dc:publisher>
  • <dc:publisher>European Bioinformatics Institute</dc:publisher>
  • <dc:publisher>John Wiley & Sons, Inc. (US)</dc:publisher>

Personal names should be listed surname or family name first, followed by forename or given name or initial followed by a full stop. Separate the surname (or family name) from the forenames, given names or initials with a comma. Titles (Dr., Prof., etc.) should precede the forenames, generational suffixes (Jr., Sr., etc.) should follow the family name. When in doubt, give the name as it appears, and do not invert. For example:

  • <dc:publisher>Sulston, John E.</dc:publisher>
  • <dc:publisher>Evans, R.J.</dc:publisher>
  • <dc:publisher>Ng, Tze Beng</dc:publisher>
  • <dc:creator>Walker Jnr., Dr. John</dc:creator>

The inclusion of personal and corporate name headings from authority lists constructed according to AACR2 [8], e.g. the Library of Congress Name Authority File (LCNA), is also acceptable.

  • Recipient of metadata has no knowledge that normalised form of name has been used. Therefore difficult to disambiguate different publisher names or combine different names for the same publisher.
  • There is no mechanism for indicating whether the publisher is a person or an organisation.
dc:contributor

A contributor to the eprint (but not one of the primary authors). For example, a supervisor, editor, technician or data collector.

Personal names should be listed surname or family name first, followed by forename or given name or initial followed by a full stop. Separate the surname (or family name) from the forenames, given names or initials with a comma. Titles (Dr., Prof., etc.) should precede the forenames, generational suffixes (Jr., Sr., etc.) should follow the family name. When in doubt, give the name as it appears, and do not invert. For example:

  • <dc:contributor>Sulston, John E.</dc:contributor>
  • <dc:contributor>Evans, R.J.</dc:contributor>
  • <dc:contributor>Ng, Tze Beng</dc:contributor>
  • <dc:creator>Walker Jnr., Dr. John</dc:creator>

In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, separated by full stops. If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the eprint. For example:

  • <dc:contributor>International Human Genome Sequencing Consortium</dc:contributor>
  • <dc:contributor>Loughborough University. Department of Computer Science</dc:contributor>

Only encode organisations in this element to indicate a corporate contribution, not to indicate the affiliation of an individual.

The inclusion of personal and corporate name headings from authority lists constructed according to AACR2 [8], e.g. the Library of Congress Name Authority File (LCNA), is also acceptable.

  • Recipient of metadata has no knowledge that normalised form of name has been used. Therefore difficult to disambiguate different contributor names or combine different names for the same contributor.
  • There is no mechanism for indicating whether the contributor is a person or an organisation.
  • There is no mechanism for recording the nature of the contribution made by the contributor (editor, illustrator, etc.).
dc:date (*)

The 'last-modified' date of the eprint and/or the date of its accession into the archive.

The date should be formatted according to the W3C encoding rules for dates and times [9] (a profile based on ISO 8601 known as W3C-DTF), for example:

  • <dc:date>2000-12-25</dc:date>
  • <dc:date>1999</dc:date
  • <dc:date>2003-01</dc:date>

If necessary, repeat this element to provide both the last-modified date and the date of accession. The last-modified date will be assumed to be the more recent of the two dates. If only one date is provided, it will be assumed that the last-modified date and the date of accession are the same.

  • Recipient of metadata has no knowledge about what kind of date is being provided or how the date is formatted. Therefore difficult to make any reliable use of the date in user-interface or other applications.
dc:type (*)

The type of eprint.

Recommended best practice is to take the value of this element from the following list:

  • Book
  • BookChapter
  • ConferenceProceedings
  • ConferencePaper
  • ConferencePoster
  • InCollection
  • TechnicalReport
  • OnlineJournalArticle
  • JournalArticle
  • NewsArticle
  • Other
  • Preprint
  • Thesis

For example:

  • <dc:type>JournalArticle</dc:type>

If necessary, repeat this element to encode multiple types.

If necessary, repeat this element to indicate the peer-reviewed status of the eprint, using one of the following values:

  • PeerReviewed
  • NonPeerReviewed

For example:

  • <dc:type>PeerReviewed</dc:type>
  • Recipient of metadata has no knowledge that the value has explicitly been taken from the controlled lists provided here and is therefore only able to infer (i.e. guess) that the originator system's use of, say, 'Preprint' corresponds to the use suggested in the guidelines.
  • The use of dc:type to carry 'status' information somewhat stretches the semantics of the property.
dc:format

The media-type of the eprint.

Recommended best practice is to select a term from the IANA registered list of Internet Media Types (MIME types) [10]. For example:

  • <dc:format>application/pdf</dc:format>

Repeat this element if the eprint is available in multiple formats.

  • Recipient of metadata has no explicit knowledge that a MIME type is being provided.
  • Not clear what is being described here (1:1 problem). If the work is being described then use of dc:format is incorrect. If a single manifestation is being described, then dc:format shouldn't be repeated.
dc:identifier (*)

Eprint-specific Recommendation:

A URI or bibliographic citation for the eprint, typically the URI of the 'jump-off page' for the eprint, as served by the archive. For example:

If possible, repeat this element to provide a full bibliographic citation for the eprint. For example:

  • <dc:identifier>Heery, R. (2000). "Information gateways: collaboration on content."
  • Online Information Review, 24 (1), 40-45.</dc:identifier>

If possible, also repeat this element to provide an OpenURL [11] for the eprint, using the form below. For example:

  • <dc:identifier>openurl:?sid=ukoln:&genre=article&sid=ukoln:
  • &atitle=Information%20gateways:%20collaboration%20on%20content
  • &title=Online%20Information%20Review&issn=1468-4527&volume=24&spage=40
  • &epage=45&artnum=1&aulast=Heery&aufirst=Rachel</dc:identifier>

(Note that lines in these two examples have been wrapped for readability.)

  • Recipient of metadata has no explicit knowledge that a URI is being provided. Nor is it particularly clear whether the 'work' or a 'manifestation' of the work is being identified.
dc:source

The URI, title or bibliographic citation for a resource from which the eprint is derived.

In general, this element should not be used.

  • Where this property is used, the recipient of metadata has no explicit knowledge about whether a URI or title or bibliographic citation is being provided.
dc:language (*)

Eprint-specific Recommendation:

The language in which the eprint is written.

Use the language codes defined in RFC 3066 [12], for example:

  • <dc:language>en-GB</dc:language>

If necessary, repeat this element to indicate multiple languages.

  • Recipient of metadata has no explicit knowledge that an RFC 3066 language tag is being provided.
dc:relation (*)

The URI of each available format of the eprint. If necessary, repeat this element for multiple formats. Also repeat this element if the eprint is available from other locations, for example from the publisher's Web site. For example:

  • Recipient of metadata has no explicit knowledge that a URI is being provided. Nor is there any indication about the relationship between the eprint and the related resource. For example, in some cases the relationship will be 'isInstanceOf' but in others it could be 'isCitedBy'.
dc:coverage

The geographic location or temporal period that the eprint is about.

Recommended best practice is to select the value from a controlled vocabulary (for example, the Getty Thesaurus of Geographic Names [13] or TGN) and that, where appropriate, named places or time periods be used in preference to numeric identifiers such as sets of co-ordinates or date ranges.

If necessary, repeat this element to encode multiple locations or periods.

  • Recipient of metadata has no explicit knowledge that a term taken from the TGN has been used.
  • There is no mechanism for indicating whether coverage is spatial or temporal.
  • Where coverage is temporal, there is no agreed explicitly mechanism for recording dates and date ranges.
dc:rights

A human-readable statement about the rights held in and over the eprint, the URI of a Creative Commons [14] licence or the URI of a machine-readable statement. For example:

  • Recipient of metadata has no explicit knowledge about whether a human readable statement or a URI is being provided.

Summary

A summary of the major issues follows:

  • It is difficult to differentiate ‘works/expressions’ from ‘manifestations/items’
    • therefore difficult to use metadata as basis for bringing together information about different manifestations of the same work/expression, e.g. for citation analysis purposes
  • In particular, it is difficult to determine if dc:identifier is being used to identify the work/expression or a particular manifestation/item of the work. In the ePrints UK Guidelines for using Simple DC to describe ePrints, dc:identifier is used to identify the ‘work/expression’ and dc:relation is used to identify ‘manifestation/item’. However, dc:relation may be used for other resources (e.g. cited works), therefore there is ambiguity in the metadata record. In any case, the guidelines not widely implemented anyway
    • therefore difficult for software applications to move reliably from the metadata record to the full text.
  • It is not possible to determine whether subject terms are taken from a controlled vocabulary or not (e.g. is ‘Physics’ a free-text keyword or a term taken from Dewey?)
    • therefore difficult to base subject-browse interfaces on controlled vocabulary hierarchy.
  • It is not possible to disambiguate authors with same name or reconcile instances of the same author being given different forms of name
    • therefore difficult to build browse-by-author type interfaces.
  • Dates are ambiguous (either because of formatting and/or because type of date is not known)
    • therefore difficult for software applications to make decisions based on dates in the metadata.