UKOLN AHDS Quality Assurance For Metadata



Introduction

Once you have decided to make use of metadata in your project, you then need to agree on the functionality to be provided, the metadata standards to be used and the architecture for managing and deploying your metadata. However this is not the end of the matter. You will also need to ensure that you have appropriate quality assurance procedures to ensure that your metadata provides fitness for its purposes.

What Can Go Wrong?

There are a number of ways in which services based on metadata can go wrong, such as:

Incorrect content:
The content of the metadata may be incorrect or out-of-date. There is a danger that metadata content is even more likely to be out-of-date than normal content, as content is normally visible, unlike metadata which is not normally displayed on, say, a Web page. In addition humans can be tolerant of errors, ambiguities, etc. in ways that software tools normally aren't.
Inconsistent content:
The metadata content may be inconsistent due to a lack of cataloguing rules and inconsistent approaches if multiple people are involved in creating metadata.
Non-interoperable content:
Even if metadata is consistent within a project, other projects may apply different cataloguing rules. For example the date 01/12/2003 could be interpreted as 1 December or 12 January if projects based in the UK and USA make assumptions about the date format.
Incorrect format:
The metadata may be stored in a non-valid format. Again, although Web browsers are normally tolerant of HTML errors, formats such as XML insist on compliance with standards.
Errors with metadata management tools:
Metadata creation and management tools could output metadata in invalid formats.
Errors with the workflow process:
Data processed by metadata or other tools could become corrupted through the workflow. As a simple example a MS Windows character such as © could be entered into a database and then output as an invalid character in a XML file.

QA For Metadata Content

You should have procedures to ensure that the metadata content is correct when created and is maintained as appropriate. This could involve ensuring that you have cataloguing rules, ensuring that you have mechanisms for ensuring the cataloguing rules are implemented (possibly in software when the metadata is created). You may also need systematic procedures for periodic checking of the metadata.

QA For Metadata Formats

As metadata which is to be reused by other applications is increasingly being stored in XML it is essential that the format is compliant (otherwise tools will not be able to process the metadata). XML compliance checking can be implemented fairly easily. More difficult will be to ensure that metadata makes use of appropriate XML schemas.

QA For Metadata Tools

You should ensure that the output from metadata creation and management tools is compliant with appropriate standards. You should expect that such tools have a rich set of test suites to validate a wide range of environments. You will need to consider such issues if you develop your own metadata management system.

QA For Metadata Workflow

You should ensure that metadata does not become corrupted as it flows through a workflow system.

A Fictitious Nightmare Scenario

A multimedia e-journal project is set up. Dublin Core metadata is used for articles which are published. Unfortunately there are documented cataloguing rules and, due to a high staff turnover (staff are on short term contracts) there are many inconsistencies in the metadata (John Smith & Smith, J.; University of Bath and Bath University; etc.)

The metadata is managed by a home-grown tool. Unfortunately the author metadata is output in HTML as DC.Author rather than DC.Creator. In addition the tool output the metadata in XHTML 1.0 format which is embedded in HTML 4.0 documents.

The metadata is created by hand and is not checked. This results in a large number of typos and use of characters which are not permitted in XML without further processing (e.g. £, — and &).

Rights metadata for images which describes which images can be published freely and which is restricted to local use becomes separated from the images during the workflow process.