A review of metadata: a survey of current resource description formats

A review of metadata: a survey of current resource description formats
Work Package 3 of Telematics for Research project DESIRE (RE 1004)

Title page
Table of Contents

Previous - Next

Peer Review

DESIRE: Peer Review Report
Project Number:	RE 1004 (RE)
Project Title:	DESIRE - Development of a European Service for Information on Research and Education
Deliverable Number:	D3.2
Deliverable Title:	Specification for resource description methods: a review of metadata: a survey of current resource description formats
Review Method:	Report Reading
Principal Reviewer:	Name	Tony Gill
	Address	Surrey Institute of Art & Design, Farnham, Surrey GU9 7DS, UK
	E-Mail	tony@adam.ac.uk
	Telephone	+44 (0)1252 722441
	Fax	+44 (0)1252 712925
	Credentials	Programme Leader: ADAM & VADS. The Art, Design, Architecture & Media Information Gateway (ADAM) is an Access to Network Resources project of the Electronic Libraries programme.The Visual Arts Data Service will store curated visual arts resources and resource descriptions.

Summary:	Relevant	5 (1 = poor, 5 = excellent)
	State-of-Art	4
	Meets Objectives	4
	Clarity	3
	Value to Users	5
Specific Criticisms	1	Small number of unsubstantiated assertions made
	2	Small number of excessive generalisations made
	3	Some terminology used without adequate definition
	4	Terms associated with specific metadata formats used inappropriately
Developer Response:	1	(developer's response given to general comments below)
	2
	3
	4

Peer review report: general comments from Tony Gill

(Within this section Developer responses are italicised)

The Survey document attempts to provide background information about the pertinent issues to consider when selecting a metadata format for implementation, and consistently structured outline descriptions of significant metadata standards initiatives to date.

The document is split into two main sections; Part I is a discursive overview of metadata and the general issues relating to the description of networked resources for a variety of purposes, whereas Part II provides a more structured directory-style description of the key metadata initiatives worldwide.

Part I

Part I provides a generally coherent and accurate summary of the issues, although it is somewhat terse in places, with certain passages assuming a high degree of prior knowledge on the part of the reader (see Clarity, below). There are also a small number of generalisations and unsubstantiated assertions that, whilst not necessarily disputed by this reviewer, possibly warranted more detailed discussion. For example:

"It is unlikely that some monolithic metadata format will be universally used. This is for a number of more or less well known reasons." (page 5)

Some brief explanation of these reasons would be helpful.

(Explanation now included.)

"Newer approaches based on manual descriptions have initially tended to focus on servers, and not describe particular information objects on those servers [..]. The subject information gateways fall into this category." (page 6)

The scope of the term 'subject information gateways' should be defined in this context before making this type of generalisation, since there undoubtedly exist subject-based information services that do not fall into the category as described.

(Definition now included.)

There is also some apparent inconsistency in the discussion of the three-band model for classifying metadata formats. For example, band one, a conceptual class of metadata format postulated in the review, is described as being "relatively unstructured data, typically automatically extracted from resources and indexed for searching." The apparent inconsistency is between the assertion that "the data has little explicit semantics and does not support searching by field.", and the statement that "there are moves to develop a shared format for exchange, perhaps based on SOIF." The inconsistency is that the Summary Object Interchange Format "is based on simple attribute-value pair elements", and should therefore support searching by field.

(Wording has been changed in the text to aid clarity. The reviewer has not perhaps taken into account the flexibility of SOIF which can be used for records with very little structure. In addition 'searching by field' implies some level of delineation of the semantic content of a record over and above the two or three attribut-value pairs that would be typical of a Band One record.)

The three-band model creates additional difficulties, since some of the other formats do not conform well to the defining characteristics of their class; for example, Alta Vista, a popular web crawler using a metadata format of the type described in band one, supports limited searching by field using HTML tags inherent in the resource itself. Similarly, Dublin Core records do not fully conform to the description of band two metadata formats, since they offer a relatively straightforward mechanism for describing relationships between objects. Overall, the three-band model appears to be somewhat artificial, and does not appear to add much value or clarity to the discussion.

(Any one format may not have all the characteristics of the band in which it is placed, and a note to this effect has been added to the text. In a number of discussions this grouping has proved beneficial in identifying the differences and similarities between formats.)

Taken as a whole, however, the Overview is an accurate, concise and useful introduction to the pertinent issues.

Part II

The use of a consistent structure across each entry in Part II, the review of metadata formats, enables comparisons between diverse metadata formats to be made, and the structure itself provides a sensible and clear description of each format in the context of the broader issues of resource description as outlined in Part I.

The descriptions of each metadata format generally provide a good synthesis between an analysis of the format, and discussion of the broader factors affecting the development of networked information discovery and retrieval initiatives. The Implementations section in particular is useful for ascertaining which formats are attracting interest from the influential web browser developer community.

Comments on individual sections are below:

CIMI

Caution should be taken when equating OSI (a framework for describing communication protocol layers) and TCP/IP (a family of communication protocols).

(Ambiguity now removed from text.)

The CHIO demonstrator requires the use of an SGML browser such as Panorama, in addition to a generic web browser, in order to view the SGML-encoded documents.

(Information added to text.)

Dublin Core

The Conversion to other formats section could be updated to include a reference to the DC/USMARC crosswalk exercise.

EAD

'Hand lists' (in the museum and archive sense at least) are not equivalent to detailed catalogues, but are more akin to inventory lists.

(Text changed.)

EEVL/EELS

Both the description of EELS and of EEVL talk of the absence of alternative formats for use by the engineering community, yet no cross-referencing between the EELS and EEVL is made.

(Cross-referencing now included.)

FGDC

Describing mSQL as a search engine is potentially misleading; it is in fact a freely-available relational database management system.

(The text has been amended.)

TEI Headers

The assertion that the inherent flexibility of the TEI Headers "might well lead to difficulties" could usefully be elaborated upon by examples of the type of difficulties that could be encountered as a result.

(The original comments on the implications for interoperability and distributed record creation have been elaborated.)

Clarity

The style of writing throughout both parts of the document is necessarily technical in nature, with acronyms and often-obscure references scattered liberally amongst the prose; since no guidelines about the intended audience for the document were supplied, it has been assumed throughout this review that the document is aimed at a reasonably technical audience with some prior knowledge of the issues pertaining to information retrieval in the network environment.

The multiple authorship of the document occasionally results in noticeable changes in the prose style from section to section. This has a marginal impact on the clarity of the document as a whole.

The liberal inclusion of URL's throughout, while slightly detrimental to the clarity of the document in paper form, allow it to be employed as a useful starting point for more in-depth study, and reflects it's dual role as both a traditional paper document and an (arguably more useful in view of the hyperlinking capability) electronic resource.

( HTML versions of the document have been made available as it has evolved.)

A more serious barrier to clarity is created by the occasional use of terminology associated with a particular metadata format to describe another format; the most common examples are the misleading use of the term Template to refer to records, a practice that has developed amongst the ROADS/IAFA/WHOIS++ community (pages 73, 75), and the phrase Document-Like Objects (pages 44, 45, 84), coined and only loosely defined by example in the Dublin Core initiative and not defined in the document under review.

(Different communities tend to use different terminology and this is certainly the case with metadata. For example templates, schemas, formats are used to refer to the 'format' of a record by different communities. The reviewer refers to the SOIF section where indeed the Harvest User manual does make use of the term 'template' to reference both format and record. Wherever possible ambiguity has been removed in the text, but there will inevitably be some borrowing of terminology amongst authors who come from different communities themselves.)

Technical slang is also occasionally employed, for example vanilla ASCII (page 55), on-the-wire format (page 55). These should not, however, present much hindrance to understanding for a technical reader.

(Where it proves useful and enlivens the style technical slang has been allowed to remain.)

It would also be helpful for the term use chain to be defined.

(The meaning of this term can be gleaned from the context. It is a phrase in current use in the field.)

A glossary of acronyms, and possibly some technical terminology, would greatly increase the clarity and potential audience of the document, should this be considered worthwhile.

(We will consider adding a glossary as part of further project work on resource description.)

A small number of typographical errors, listed as an appendix to this review, were spotted during the review process.

Conclusion

Documents of this nature are extremely difficult to compile and present clearly, since the requisite information, which must be collected from sources throughout the world, assimilated and reorganised, is almost immediately out of date in such a rapidly-evolving field.

Nonetheless, this Survey is a valuable and timely attempt to provide a coherent overview of the current state of the art of networked resource description, providing as it does a reasonably detailed and consistently structured account of the majority of the significant metadata initiatives taking place globally.

The Survey's usefulness is significantly enhanced by its publication as an electronic resource, allowing the user to carry out more in-depth research by following hyperlinks to detailed information about individual initiatives and formats.

This document is almost certainly the most comprehensive (and for the time being at least the most current) introduction to the diverse metadata formats currently in existence.

Next Table of Contents

Page maintained by: UKOLN Metadata Group
Last updated: 10-Jun-1998