Metadata issues for Scottish repositories 2006-12-04

From DigiRepWiki

National Library of Scotland, Edinburgh, 4th December 2006, 2.00-4.30

This event was organised by the Cataloguing and Indexing Group in Scotland (CIGS), National Library of Scotland (NLS) and Centre for Digital Library Research (CDLR). As the title implies it was geared specifically for librarians and metadata specialists working in Scotland, but the invitation was open beyond Scotland. The purpose of the event was to inform the development of guidelines and recommendations for the IRIScotland project, for a workpackage due to complete by March 2007. The workshop was led by Gordon Dunsire from CDLR and was a mixture of speakers and discussion. There were three speakers. Alan Dawson (CDLR) opened with a talk on some of the metadata content issues faced by aggregator services wishing to offer subject search and, particularly, browse access. His list of 31 issues, illustrated with real examples from the c.6000 records harvested by IRIScotland to date, will be familiar to many and included: : typos; punctuation; hyphens; quotes; spaces and double spaces; ASCII and extended character sets; mixed case (particularly for UNIX-based systems); plurals; places/people/organisations entered as subject terms; resource types entered as subjects; abbreviations; synonyms; spelling (en-GB / en-US); initial articles; repeating terms in phrases and singly; Latin and common names; cross-references; uncontrolled keywords (a 'jumble'); context and the need for qualifiers; ambiguity; identifiers as subject terms; classification schemes that are too broad; topics that are too specific; achieving interoperability between different schemes; multiple languages.

For many of these, Alan believes that the solution lies in clear, simple procedures and guidance (e.g. don't copy and paste; don't include punctuation, don't include 'the' etc.). For the slightly more difficult issues, agreeing standards was the goal. The most difficult to solve were issues affected by differences in the scope and requirements of different services and repositories. Service providers and aggregators have a role in performing data cleaning, e.g. by removing double spaces, doing character translations, identifying anomalies through reports etc. Overall Alan urged repositories to avoid keyword clutter and to use some kind of authority control and classification, even it is LCSH (not ideal, but the best we've got).

To finish, Alan highlighted issues faced by repositories trying to import data from departmental databases - inconsistencies, missing data, incorrect data etc. He also gave a very quick demonstration of the IRIScotland pilot service and how these issues affect its ability to offer subject browsing. At present, aggregation services rarely offer browsing for the reasons outlined above.

Julie Allinson (i.e. me), gave a 15min whistlestop introduction to the Eprints Application Profile (see presentation at

Gordon Dunsire (CIGS/CDLR) then assessed the implications of the Application Profile for repositories in Scotland. Gordon was very supportive of the Eprints Application Profile, but he was usefully exploring some of its grey areas. He highlighted areas for further discussion. The following areas include both Gordon's queries and comments from the discussion.

AVAILABILITY - availability of the complete profile and the potential for customisation (and re-application). For repositories - training for non-cataloguers, retro-conversion for consistency. For aggregation services - de-duplication and consolidation of multi-level metadata.

IDENTIFIERS - the AP mandates only one identifier for the description set; this presents a potential problem if repositories want to unpack the WEMI parts of a description set. Also, where metadata exists elsewhere (e.g. for an publishers 'expression'), there is potential for duplication. The AP provides the framework, but there is still a need for the community to make decisions and come to agreements, particularly about identifiers and URIs. De-duplicating aggregated records from different repositories is another issue.

SCOPE - the AP has a defined scope (OA scholarly works). IRIScotland must consider it's own scope, in terms of restricted access and non-digital material (not out of scope for the`AP), plus other resources types such as images, datasets, institutional administrative (FOI) content and teaching materials. Institutional policies have an impact on what is available for aggregation, e.g. peer-reviewed only, literature, e-theses. E-theses, of course, have unique issues if a UK-wide e-theses service is established to follow on from the EThOS project.

SUBJECT TERMINOLOGY - is an agreed subject/classification scheme feasible for IRIScotland? LC is too broad as a classification schemes; but controlled lists need to be well-governed. The AP supports both controlled and uncontrolled keywords and the identification of the scheme used.

NAMES - is an agreed authority file for corporate names feasible for Scottish organisations; or for research institutions and departments? The AP provides separate properties for family and given name (this is a good thing) and it has an identifier property which could help identify different people with the same name. Use of authority files could help, but there is no resource to develop such a thing in Scotland. JISC funding on a name authority service will not provide a solution within the timescale of IRIScotland. The ensuing discussion touched on many of the topics outlined above, in particular name authority and standardisation, access management (shib), library-based authority files, identity codes and governance of any central authority. Dealing with name changes and changes in institution were raised and it was noted that the AP and the DCAM can only offer a limited solution to this, through the AP 'affiliated institution' element.

The IRIScotland metadata standard may have to extend the Eprints Application Profile for its own requirements. There is no problem with this, if it still retains interoperability with the AP. But IRI Scotland can only provide simple guidelines and ask repositories to *try* to implement them, this cannot be mandated. This is why it is important to get communities involved. Suggesting adherence to RDA for metadata may be desirable but problematic since RDA won't be available until 2009.

Other issues included the importance of library involvement and library mediation for repositories (including a suggestion for a SCURL statement to recommend library mediation of Institutional repository metadata as standard practice); the use of mandates vs academic freedom; and the usual advocacy and policy issues. An interesting point was made about the high level of manual control at the moment, e.g. in duplicate checking, importing records and metadata creation etc. and possible scalability issues as repositories grow. These issues are particularly pertinent for home-grown open source repositories, whereas commercial solutions are generally much better at offering additional data management tools (for a cost!). This inspired the question, from Gordon: "are we re-inventing the square wheel?". There were murmurs of agreement at this question, but I think a genuine awareness that repositories are at a very early stage and are trying to deal with issues of real complexity surrounding duplication and interoperability. Agreement on rules and policies is difficult even within a small circle of metadata expects, but the Application Profile and projects like IRI Scotland are taking small steps toward getting this community agreement.