The Development of Metadata for the Census Resource Discovery System

Background

The UK Data Archive at the University of Essex is one of the partners within the JISC-funded Collection of Historical and Contemporary Census Data and Related Materials (CHCC) project [1]. The project, led by MIMAS at the University of Manchester, runs from October 2000 to September 2003.

The central aim of the project is to increase use of the CHCC in learning and teaching. It is doing this by: improving accessibility to the primary data resources; developing an integrated set of learning and teaching materials; improving awareness about the contexts in which census data can be used in learning and teaching; integrating contextual materials; providing access to web-based data exploration/visualisation tools; and developing resource discovery tools.

The UK Data Archive's role has been to develop this last output, a Census Resource Discovery System (initially and temporarily entitled a 'Census Portal'), which will allow both the primary census data and the CHCC-created related learning and teaching materials to be searched and browsed.

As a final, additional, introductory comment, it should be noted that although, due to staff changes, Lucy Bell has taken over the project management of the Census Resource Discovery System (CRDS) at the end of its development, the majority of the work described below was carried out between 2001 and 2003 by colleagues within the UK Data Archive and the History Data Service: Cressida Chappell, Steve Warin and Amanda Closier.

The Metadata Issues

Schema

As the Census Resource Discovery System (CRDS) was intended to index two very different sorts of resource - primary data and teaching and learning materials - much initial work prior to the start of the project was put into identifying which metadata scheme should be used. It is not possible to index all the materials to a fine enough degree using a single scheme, therefore, the DDI (Data Documentation Initiative) Codebook [2] was used for the data and the IMS Learning Resource Metadata Specification [3] for the learning and teaching materials.

Both schema were taken, analysed and had CHCC Application Profiles crested. An initial problem encountered in the first six months of the project was that the extensions to the DDI had not been finalised by the time they were required for the development work on the CRDS. This delayed the development of the Metadata Entry System (MES); however, the work to set up the MES for the learning and teaching materials went ahead as planned.

The MES is a 'behind-the-scenes' tool, written in Visual FoxPro 7, created so that the metadata which form the CRDS records can be entered remotely into the database. Other CHCC project staff have been sent copies of the MES on CD, which they have installed locally on their PCs and used to enter the metadata. The completed records are automatically sent to the database, where they become live the following day and are then part of the database viewed via the CRDS web site which users can search and browse.

Working with two schema has meant having to design a MES which is suitable for either sort of resource. It has also meant the need to identify and map the related fields within each profile to each other, for the purposes of search, browse and display. Even if the MES can be set up so that the appropriate scheme is used, should the metadata creator select 'data' or 'learning and teaching resource' at the start, the users still need to be able to search across all the resources, no matter which scheme has been used to catalogue them.

This work was undertaken during the end of 2001 and much of 2002. Near the end of the first phase of the project, when these essential preparatory tasks should have been completed, the second of the MES-related obstacles was hit: it was discovered that the IMS specification was likely to be superseded by an eLearning standard, the IEEE Learning Object Metadata (IEEE LOM) [4]. The team working on the CRDS had to move fast to ensure that the system was kept as up-to-date as possible in light of these changes.

Keywords

Another key task was the identification of the most appropriate thesaurus to use as the controlled vocabulary for the system. It was essential to find an appropriately specific list of keywords for indexing all the metadata records within the database. The list would be employed by several project staff, in diverse locations, all entering their own metadata and so the list needed to be of a manageable size but also to contain all terms which might be required.

Three thesauri were on offer as likely candidates: the Humanities and Social Science Electronic Thesaurus (HASSET) [5], the European Language Social Science Thesaurus (ELSST) [6] and the Social Research Methodology thesaurus (SRM) [7]. The third issue, which caused a delay in relation to metadata, was the need for the project team to reach a consensus on which thesaurus to use.

The Approaches Taken

Schema

Despite the fact that some staff members had already left (as the project was reaching a conclusion of its first phase), it was decided to upgrade from IMS to IEEE LOM. The JISC agreed to fund a short-term extension of four months, during which time, as well as incorporating OAI harvesting and setting the system up as a Z39.50 target, the changes between the two schema were to be analysed and technical alterations to both the MES and to the web site's search and browse functionality made. This work is now halfway through. The profile used has been the UK Common Metadata Framework (UKCMF) [8]. The current major task is to update the guidelines used by those people entering metadata to ensure that they correspond with exactly what is seen on the screen.

The biggest headache during the upgrade has been the application of the appropriate XML bindings. At first, it was thought that the system may have to use the IMS bindings as an IEEE LOM set was not yet available. The IMS XML was considered to be similar to that of the IEEE LOM. Following the release of the draft IEEE LOM bindings, however, it was decided that it would be more appropriate to use these. The work to complete the CRDS needs to be done sooner than these will be finalised; nonetheless, it still seems more sensible to apply what will be the eventual schema rather than one which may become obsolete. The XML is being applied using Java Architecture for XML Binding (JAXB) [9]. This is not proving to be as straightforward as was hoped with the IEEE LOM XML, due to issues with the custom bindings file; in contrast, the DDI XML bindings have been relatively simple.

Keywords

It soon became clear that a single thesaurus would not do the job. Although many included some census-specific terms, none were comprehensive enough. It is expected that some of the CRDS's users will have or will have been given by their tutors sophisticated and precise keywords to use, which are specific to certain census concepts. Additionally, because many of the CHCC-created learning and teaching objects constitute overviews of the subject and introductions to research methodologies, it was vital also to include appropriate methodological keywords to describe these.

In the end, terms from all three of the chosen thesauri were selected (HASSET, ELSST and SRM) and shared with the rest of the CHCC partners. This initial list numbered about 150 terms; however, some essential terms, such as FAMILY STRUCTURE, SOCIO-ECONOMIC GROUP or STANDARD OCCUPATION CLASSIFICATION, were still missing. The CHCC partners suggested additional terms and, after much debate, a final amalgamated list, gleaned from all four of these sources, the three thesauri and the CHCC partners, was settled. The final list contains 260 terms.

Lessons Learnt

The biggest lesson to have been learnt through the development of the CHCC CRDS is the need to build as much slippage time as possible into a timetable. This sounds obvious but is worth repeating. Unfortunately, having said that, several of the obstacles encountered during the last 19 months of this project could not possibly have been predicted.

It was expected that the DDI would have been finalised in early 2002, but this did not happen until late Spring; it was expected that the IMS metadata specification would be the final one to be used (and, in fact, this was the specification advocated by the JISC); it was hoped to resolve the thesaurus question more quickly than proved possible. Most project plans will include slippage time for instances such as the third in this list, but few will be able to include provision for changing or delayed standards.

The second lesson learnt and probably the most important one is the need to be flexible when working with metadata standards and to communicate with those in the know as much as possible.

The CHCC CRDS project has been fortunate in receiving additional funding to incorporate the new changes in elearning metadata standards; without this, a product could have been launched which would have already been out-of-date even before it started.

Contact details

Lucy Bell
UK Data Archive
University of Essex
Wivenhoe Park
Colchester
CO4 3SQ

Project Web site: http://www.chcc.ac.uk/

References

Contemporary Census Data and Related Materials (CHCC),
<http://www.chcc.ac.uk/>
Data Documentation Initiative,
<http://www.icpsr.umich.edu/DDI/>
IMS Learning Resource Metadata Specification,
<http://www.imsglobal.org/metadata/>
IEEE LOM,
<http://ltsc.ieee.org/wg12/>
Humanities and Social Science Electronic Thesaurus (HASSET),
<http://www.data-archive.ac.uk/search/hassetSearch.asp>
European Language Social Science Thesaurus (ELSST),
<http://www.limber.rl.ac.uk/Internal/Deliverables/D4_2_final_V2.doc>
Social Research Methodology thesaurus (SRM),
<http://www.niwi.knaw.nl/srmonline/>
UK Common Metadata Framework (UKCMF),
<http://www.cetis.ac.uk/profiles/ukcmf/>
Java Architecture for XML Binding (JAXB),
<http://java.sun.com/xml/jaxb/>

QA Focus Comments

For QA Focus use.