HILT: High-Level Thesaurus project

Automated and manual approaches to the provision of thesauri and subject vocabularies

Rosemary Russell and Michael Day
UKOLN, University of Bath, Bath BA2 7AY, UK.

Version 1.0, 5th October 2001

Summary

This review addresses one of the components of the original HILT proposal: "A report of a survey of literature, projects, organisations, and individuals to determine current views on best practice in respect of both user and machine oriented interfaces both to thesauri and to subject terminologies generally."

1. Introduction

2. Approach and methodology

3. Interfaces to 'traditional' thesauri

4. Projects and initiatives

5. Software and products

6. Standards issues

6.2.1 The Zthes profile

6.3.1 The Vocabulary Markup Language (Voc-ML)

7. References and literature review

8. Acknowledgements

1. Introduction

The HILT project aims to research, document and make recommendations on how to facilitate cross-searching and browsing by subject across a wide range of communities and services.

The term thesaurus is used in different contexts to describe tools that fulfil different functions. From an information science point of view, thesauri were originally developed as tools to allow terminology control for detailed subject indexing of printed documents. By extension, they can be used to aid the selection of relevant search terms, although this function may not always be easily available to the end user. Some thesauri are general in scope, while many others have been developed to meet specialist subject needs. Standards for the establishment and development of monolingual and multilingual thesauri have been published by the International Organization for Standardization (ISO 2788:1986; ISO 5964:1985) and by the US National Information Standards Organization (ANSI/NISO Z39.19-1993). The Z39.19 standard, for example, defines a thesaurus as a "controlled vocabulary of terms in natural language that are designed for postcoordination." What distinguishes thesauri from some other subject vocabulary types is that they show relationships between concepts. Hodge (2000a, p. 6) has described the most well used relationships.

Relationships commonly expressed in a thesaurus include hierarchy, equivalence (synonymy), and association or relatedness. These relationships are generally represented by the notation BT (broader term), NT (narrower term), SY (synonymy), and RT (associated or related term).

Many thesauri have complex hierarchies of terms. In addition to thesauri, there is a range of other types of controlled subject terminologies (or vocabularies) that have been developed, often on a more informal basis to meet local needs. They sometimes have a simpler, flatter structure with less information on relationships.

Thesauri and subject vocabularies are used within resource discovery services in different ways. Sometimes they may offer browse access, so that users are presented with some form of alphabetical list or hierarchy of subject terms that can be selected and thereby link to resources with that attribute assigned. It may also be possible to use the thesaurus as part of a subject search, so that if for example users input a non-preferred term, they are prompted to use the preferred term instead; the system may additionally be able to suggest related, broader or narrower terms, etc. Clearly, some of these types of advanced resource discovery options may not be appropriate for all users, many of whom will only ever use the most simple search options. An alternative adopted by some systems is to make the use of preferred search terms invisible to the user, rather than explicit. However the counter-argument is that this could make search results confusing, if it is not clear to users how certain records were retrieved.

Thesauri and subject vocabularies have mainly been used up until recently for item level discovery within individual collections. Alongside the increased need for cross searching of databases, comes the need for collection level discovery. Appropriate thesauri and subject vocabularies will be of key importance for effective and consistent collection level description and discovery.

2. Approach and methodology

The following review aims to be practical in nature - a state-of-the-art review of a variety of approaches to thesaurus and subject terminology interfaces. The literature indicated that little work has been published specifically in the area of interfaces to thesauri. It is not planned to investigate related areas of work like human-computer interaction (HCI) as this is a complex area of technical research outside the broad scope of this work.

A questionnaire was not considered to be appropriate; it was considered more useful to email or speak with individuals who are involved in relevant projects and initiatives.

The first section will look at user interfaces to a small selection of 'traditional' thesauri and subject vocabularies that have been made available on the Web. This is useful as a reminder of the way that most subject vocabularies are currently being used, in stand-alone format and in specialist environments.

The second part of the study reviews several relevant projects and initiatives working in the area of cross searching by subject, some of which involve mapping between subject schemes. In order to provide some context, the review will briefly mention some technologies, e.g. neural and semantic networks, which are derived from developments in artificial intelligence (AI). Such techniques, particularly when combined with traditional classification or subject indexing can offer powerful alternatives to manually built thesauri.

A further section examines software and products that are being made available to automate implementations. This includes some knowledge management software that is primarily marketed to the corporate sector but whose use is increasingly being considered in wider contexts.

The final section covers a range of standards issues and developments. This includes national and international standards for the development and maintenance of thesauri, the use of the Z39.50 protocol for thesaurus navigation (the Zthes profile) and the development of an Extensible Markup Language (XML) based Document Type Definition (DTD) for knowledge organisation systems (Voc-ML).

3. Interfaces to 'traditional' thesauri

This section includes short overviews of selected 'traditional' thesauri, in order to provide a flavour of the types of interface offered to users. While there is a huge number of thesauri and subject vocabularies that could potentially be examined, it was aimed to include a small cross-section from the library, museum and archive domains, both in the UK and overseas. The following were selected from the A-Z list of thesauri at the HILT project Web site:

http://hilt.cldr.strath.ac.uk/Sources/thesauri.html

Some thesauri and subject vocabularies are only available on the Web, while others are made available in a variety of formats, including print and CD-ROM. It is usual for vocabularies to be made available in several different formats. For example, the BEI (British Education Index) is available in a variety of formats including print, CD-ROM and the Web (via BIDS). Most of their interfaces provide browse and search options to the terms in the thesaurus itself. The majority are provided on a standalone basis, sometimes as research or cataloguing aids, or as tools to be used by service providers on the basis of a license. Even when a machine-readable version of a vocabulary exists, it is not always possible to extract the data in order to build a database. This was the experience of the UK National Digital Archive of Datasets (NDAD) with a CD-ROM version of the UNESCO Thesaurus. For this reason and because of interface limitations, NDAD - after evaluation - used an earlier printed version of the thesaurus as the basis of the new NDAD Thesaurus (Garrod, 2000, p. 41). This thesaurus can be browsed in order to help users select search terms that describe the holdings of the NDAD service. In a similar way, the UK Data Archive's BIRON database directly links the Humanities And Social Science Electronic Thesaurus (HASSET) into its Web-based search interface.

3.1 AAT (Art & Architecture Thesaurus)

The Art & Architecture Thesaurus (AAT) was developed by the J. Paul Getty Trust and is and maintained by the Getty Vocabulary Program (part of the Getty Research Institute). The version of AAT available on the Web is not directly linked to a database of the Getty's collections. On the Web interface to AAT, it is possible to browse a list of 'AAT Hierarchies' that organises preferred terms - here known as 'descriptors' - by one of seven high-level 'facets'. Alternatively, terms and their associated scope notes can be searched directly through the Web interface. The information available on each descriptor includes, where appropriate, a scope note, together with some additional information that might include broader or narrower terms, related terms, alternative forms of speech, British English equivalents, historical notes, etc. It is possible for AAT to be licensed for use by third parties, and the files are available as ASCII records in a fielded format or as USMARC (MARC21) authority records.

Art and Architecture Thesaurus (AAT): http://www.getty.edu/research/tools/vocabulary/aat/

3.2 HASSET (Humanities And Social Science Electronic Thesaurus)

HASSET initially presents 24 top-level terms, with some (possibly the more important) of the narrower terms listed below them. When the user chooses a top-level term, a box is displayed with broader, narrower, related terms, etc. on the left, with another box on the right displaying records that match the term (from the UK Data Archive's BIRON database). The thesaurus itself can also be licensed to service providers for indexing and retrieval of their own resources.

HASSET: Humanities And Social Science Electronic Thesaurus, version. 2.1: http://dasun1.essex.ac.uk/services/zhasset.html

3.3 MeSH (Medical Subject Headings)

The Medical Subject Headings (MeSH) were developed by the US National Library of Medicine (NLM) for the subject indexing of MEDLINE and other databases. The vocabulary is continually revised and updated by staff in the NLM's Medical Subject Headings Section. The Web-based MeSH browser (http://www.nlm.nih.gov/mesh/MBrowser.html) - which is not linked to any of the Internet versions of the MEDLINE database - offers a complex range of search options using a rather technical vocabulary, and appears to be designed for information professionals. Searching will retrieve MeSH records that include scope notes, annotations, entry terms, history notes, allowable qualifiers, etc. For example, a search for "Diabetes Mellitus" will retrieve the heading, annotation and scope note, together with the entry term 'Diabetes', some see also references and information on allowable qualifiers. Below this is an extract from the MeSH tree structure, placing the term searched for in its hierarchical context. There is also a browse-based option - called "navigate from tree top" - that allows users to navigate from the top of the MeSH hierarchical tree-structure. In 2001, the top-level of the MeSH hierarchy contained fifteen headings.

Medical Subject Headings: http://www.nlm.nih.gov/mesh/meshhome.html

3.4 The Plumb Design Visual Thesaurus

The Plumb Design Visual Thesaurus provides an alternative and interesting visual design, but doesn't appear to be very usable or practical. Terms move (rather quickly!) around the screen, clicking on a selected term moves it to the centre of screen, with related terms hovering around. New search term can also be typed into a search box.

Visual Thesaurus: http://www.visualthesaurus.com/ [unavailable on 5 October 2001]

4. Projects and initiatives

The following section describes selected projects and initiatives with relevance to interfaces to thesauri.

4.1 The DESIRE II project

The original DESIRE (Development of a European Service for Information on Research and Education) project had an interest in the use of classification systems and published a report on the role of classification schemes in Internet resource description (Koch & Day, 1997). The DESIRE II project followed-up this broad interest in classification, including some co-operation with the OCLC Office of Research on the application of automatic classification techniques for vocabulary support in Internet search services (Koch & Vizine-Goetz, 1998).

The DESIRE projects proposed the development and support of subject gateway services that would facilitate access to high-quality Internet resources (e.g., Koch, 2000). It was felt likely that the future sustainability of these services would depend upon their increased co-operation (Dempsey, et al., 1999) and the development of services that would allow both cross-searching and cross-browsing across gateways. DESIRE II conducted some background work on subject vocabularies in order to support the development of interoperable subject gateways, especially with regard to multilinguality and the mapping of different vocabularies.

A report (Cross, Brickley & Koch, 2000) proposed a system of encoding thesauri that could act both as a means of facilitating the transfer of controlled vocabularies, but also to aid "the expression of relationships between the terms of different thesauri and between the different language terms within a single thesaurus." The authors proposed a representation of the conceptual relationships typical of controlled vocabularies (specifically thesauri and classification schemes) using the Resource Description Framework (RDF). It was hoped that such an approach would enable the use of generic RDF tools as a basis for mapping between subject vocabularies. The report included a proposal for a RDF/XML Thesaurus Schema that attempted to demonstrate how the RDF data model could represent a web of inter-related concepts and terms from more than one thesaurus. The schema only modelled two relationships (broaderConcept and relatedConcept) as it was assumed that most other relationships (e.g. NT) would either be implicit or unnecessary. The DESIRE II Thesaurus Schema did not attempt to model any of the richer semantic relationships that would hold between the entities denoted by concepts (e.g. for mapping thesauri), although the report authors were confident that RDF schemas could be developed to represent this kind of information.

DESIRE II project: http://www.desire.org/

4.2 The HITITE project

The Heritage Illustrated Thesaurus (HITITE) project is concerned with the development of an online thesaurus of monument terms that would give access to images from English Heritage's National Monuments Record (NMR). The NMR Thesaurus of Monument Types (RCHME, 1998) contains around 6,300 terms and the HITITE initiative is intended to provide several ways of providing access to this, through keyword searching, through an expert system based query service and through an experimental virtual-reality (VR) search. The VR interface part of the project is exploring "the possibility of providing three-dimensional VR worlds in which users can explore, intuitively, different monument types and their associated terminology" (Carlisle, 2001).

The HITITE project is being undertaken by English Heritage and Adlib Information Systems, and will be funded by the European Commission as part of its Information Society Technologies (IST) programme.

4.3 Project LIMBER

Project LIMBER (Language Independent Metadata Browsing of European Resources) is a project concerned with the development of multilingual access to the contents of social science archives. The project is partially funded by the European Union under its IST programme. Partners include the Central Laboratory of the Research Councils (CLRC) Rutherford Appleton Laboratory, the UK Data Archive, the Norwegian Social Science Data Services (NSD), Intrasoft and a number of other European social science data archives.

The LIMBER project will adapt HASSET - initially developed by the UK Data Archive - by extending it with 'equivalent terms' in non-English languages; initially in French, German and Spanish (Matthews and Wilson, 2000). A multilingual search tool will use these equivalent terms to search for relevant data across a number of data sources. Matthews and Wilson (2000) note that a major problem is that the 'equivalent term' in different languages may not have an exact match and that "this is particularly a problem in social science (as opposed to the physical sciences) where the exact meaning of terms is culturally dependent." They further note that this problem may get worse if thesauri are arranged in hierarchies.

The LIMBER project itself builds on a completed Telematics Applications Programme project called NESSTAR (Networked Social Science Tools and Resources). This project was concerned with developing software tools that would facilitate the publishing and retrieval of statistical and other social science data on the Web. (Musgrave and Ryssevik, 2000). The NESSTAR system utilised an international standard for social science metadata, an XML-based 'codebook' developed by the Data Documentation Initiative (DDI). One of LIMBER's aims is to develop a multilingual thesaurus that can operate either on its own or as a plug-in to the NESSTAR system.

Project LIMBER: http://www.limber.rl.ac.uk/ [URL changed, 31 October 2001]

4.4 The MACS project

MACS (Multilingual Access to Subjects) is a project of the Conference of European National Librarians (CENL), financed by the four partners: the Swiss National Library (SNL), the Bibliothèque nationale de France (BnF), Die Deutsche Bibliothek (DDB) and the British Library (BL). The project is primarily concerned with the problem of multilingual access to bibliographic databases with subject headings in different languages. Clavel-Merrin (1999) noted that the majority of subject searches in bibliographic databases "are only possible using the subject entries in the language of the country." As an example, she notes that in the BnF, "each document, independently of the language in which it has been written, is indexed using a French-language subject heading language." Landry (2000, p. 2) describes the problem using the SNL as an example: "a researcher from abroad must not only master the German language but also the intricacies of the SWD/RSWK subject heading language in order to access material by subject."

Following a pilot study based on the subject heading languages used in the four partner institutions, the libraries funded the development of a prototype MACS system that would establish links between the headings in each language. The relevant subject heading languages were:

Schlagwortnormdatei/Regeln für den Schlagwortkatalog (SWD/RSWK)
Répertoire d'autorité-matière encyclopédique et alphabétique unifié (RAMEAU)
Library of Congress Subject Headings (LCSH)

A consortium made up of Index Data (Denmark) and Tilburg University Library (Netherlands) was contracted to develop this prototype system. In order to test the prototype, c. 15,000 bibliographic records from each database and 1,000 headings were loaded into the system. These headings were in the fields of sport and theatre, together with an additional set of links derived from the 500 most-used RAMEAU headings (Landry, 2000. p. 3). Links between the different subject heading languages are dealt with by the Link Management Interface.

The search interface uses Z39.50 (based on Index Data's Zebra server). As noted below in the section on standards, the prototype is based on the Zthes Z39.50 Profile for Thesaurus Navigation. The MACS system does not as yet offer direct access to the partner libraries' catalogues, although an operational system is planned for 2002, together with the addition of a subject heading language in a new language (Freyre & Naudi, 2001).

Mapping (or 'equivalence linking') has been done between the three indexing languages used by the partner libraries: SWD/RSWK for German, RAMEAU for French and LCSH for English. Topics (headings) from the three lists are analysed to determine whether they are exact or partial matches, of a simple or complex nature. The approach is based partly on the ISO 5964 guidelines for multilingual thesauri. Freyre & Naudi (2001) note that the MACS system takes the role of a multilingual dictionary of subject heading languages and therefore acts as a bridge between systems that have been developed to organise and name concepts.

MACS project: http://infolab.kub.nl/prj/macs/

4.5 The Renardus project

The overall aim of the Renardus project is to establish a collaborative framework for European subject gateways that will benefit both users in terms of enhanced services, and the gateways themselves in terms of shared solutions. As part of this, Renardus is developing a pilot service brokering access to European-based subject gateways. This includes a Z39.50-based cross-search and an experimental cross-browse service based on mapping classification schemes used by partner gateways to a central 'switching language' - in Renardus, the Dewey Decimal Classification (Koch, Neuroth & Day, 2001).

Subject gateways use a wide variety of different classification schemes to provide browse access to selected Internet resources. These include well-known general schemes like the Dewey Decimal Classification (DDC) and the Library of Congress Classification (LCC) as well as a number of more subject specialised or locally-produced schemes. Within Renardus, for example, partner services use (or have adapted) parts of universal schemes like DDC or the Universal Decimal Classification (UDC), subject specialised schemes like the Ei classification, as well as locally-produced schemes like the Nederlandse Basisclassificatie (BC) and the Göttinger-Online-Klassifikation (GOK).

In Renardus, participating gateways are expected to map DDC captions to the subject terms used in local browse hierarchies, based on mapping guidelines that have been produced by a project working group. It is not expected that all terms will have direct equivalents in the DDC. For this reason, a set of relevance levels has been defined. These can indicate, for example, where there are narrower or broader equivalents, or major and minor overlaps. The person doing the mapping indicates the relevance level as part of the mapping workflow.

The cross-browse structures that can be created using this model are relatively limited. It is not possible, for example, to link directly to resource descriptions from multiple gateways in a single browse hierarchy without a greater level of centralisation than that offered by the Renardus broker. Instead, the Renardus browse system links directly into the subject hierarchies of individual gateways. If a part of an individual gateway's browse structure has been mapped to this particular DDC caption, the gateway's name, the local term and its relevance ranking would become visible. The local term would then become a hyperlink to the relevant part of the local browse structure. Once the user follows this link, they will leave the Renardus system and 'jump' into the browse interface of the local gateway.

The mapping information is stored in a format defined by the working group. This means that this information can be used to support the Renardus advanced search or by other features yet to be developed.

A beta version of the Renardus broker is currently (October 2001) being evaluated.

Renarus: http://www.renardus.org/

4.6 The SALT project

The SALT (Standards-based Access service to multilingual Lexicons and Terminologies) project is an international research initiative, part funded by the European Commission as part of the IST programme. The project is concerned with combining terminology databases and machine translation lexicons. In particular, SALT is concerned with the integration of the Open Lexicon Interchange Format (OLIF) and the ISO 12200:1999 MAchine-Readable Terminology Interchange Format (MARTIF) to create an XML-based database called the eXchange format for Lex/Term-data (XLT). It is hoped thereby, to facilitate access to and interchange between multilingual lexicons and terminologies. Part of the research effort of SALT refers to mapping ontologies into each another with minimal information loss (Meltby, 2000).

SALT project: http://www.ttt.org/salt/

4.7 The Semantic Web

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users... For the Semantic Web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning (Berners-Lee, Hendler & Lassila, 2001).

The Semantic Web is the vision, most powerfully articulated by Tim Berners-Lee of the World Wide Web Consortium (W3C), of an extension of the current Web in which information is given well-defined meaning so that machines can begin to understand it, and process it accordingly. This is not achieved through advanced AI techniques, but by relying "solely on the machine's ability to solve well-defined problems by performing well-defined operations on well-defined data" (Berners-Lee & Hendler, 2001, p. 1023). This means that content creators will need to use new languages that will make Web content understandable to machines. It's supporters note that the "challenge of the Semantic Web is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web" (Berners-Lee, Hendler & Lassila, 2001).

Some of the technical building blocks for the Semantic Web are already in place. Semantic Web standards developed by the W3C include the Extensible Markup Language (XML), XML Schema, the Resource Description Framework (RDF) and RDF Schema. XML helps to add structure to documents, while RDF provides a simple data model (based on triples) and an XML-based syntax for the application of metadata about a resource. RDF differs from XML in that it uses Uniform Resource Identifiers (URIs) to unambiguously denote objects and the properties of relationships between objects (Decker, Mitra & Melnik, 2000). The RDF Schema specification provides a mechanism to define properties and the relationships between these properties and other resources (Brickley & Guha, 2000).

While XML has been an undoubted success, to date there has been little take-up of RDF. Tim Berners-Lee admitted in his keynote talk at the 10th International WWW Conference that although researchers are developing interesting Semantic Web applications based on RDF, there is still little interest from the commercial community or evidence of the deployment of production services (Kelly, 2001). RDF has been widely criticised for being overly complex, thereby involving high development overheads. There has also been much discussion on W3C lists about the Semantic Web, with contributors claiming that nobody fully understands it (partly because of poor documentation) and that unnecessary confusion is caused. Although RDF may seem like an academic exercise and is unlikely to achieve widespread deployment in the short term, it is an initial step to provide software with the means to interpret the semantics of 'unknown' metadata. As one of the initial building blocks of the Semantic Web, it enables automatic manipulation of structured metadata and schema vocabularies. For this reason, it cannot be ignored as a potential enabling technology for mapping subject vocabularies and thesauri.

There have been a number of research-based implementations of RDF and RDF Schema with regard to controlled vocabularies and ontologies. RDF Schema has, for example, been used to formally represent thesauri (e.g. in the preliminary work undertaken by the DESIRE II project described in section 4.1) and to test the integration of ontologies and thesaurus hierarchies in the cultural heritage domain (Amann & Fundulaki, 1999; Amann, Fundulaki & Scholl, 2000).

Another area of research relevant to the Semantic Web relates to the development of ontology languages. Ontologies are difficult to define simply, but the term is used by the AI community (and others) to mean something like an "explicit specification of a conceptualisation" (Gruber, 1995, p. 908). This means that the term is used to refer to a vocabularies (or the conceptualisations that underlie them) that can build a representation of some area of knowledge, or the body of knowledge that describes a given domain (Chandrasekaran, Josephson & Benjamins, 1999, p. 20). In the context of the Semantic Web, Hendler (2001, p. 30) has recently referred to ontologies as "defined terms and relationships between them, usually in some formal and preferably machine-readable manner." Soergel (1999) has noted that ontologies have much in common with the classification schemes and thesauri developed by librarians and information scientists, and has argued for better communication between the various communities involved in their creation and maintenance. Potential applications of ontologies would be to form the basis of knowledge management systems and to aid e-commerce with regard to product information and electronic transactions (Fensel, et al., 2001, pp. 39-40). RDF Schema appears to be particularly suited for building ontology languages, and AI researchers have extended RDF and RDF Schema to create a number of knowledge representation languages. Relevant initiatives include the Ontology Inference Layer (OIL), the DARPA (US Defense Advanced Research Projects Agency) Agent Markup Language (DAML) and the Simple HTML Ontology Extensions (SHOE). OIL and DAML have recently joined forces to create DAML-OIL (Fensel, 2001, pp. 83-108; Fensel, et al., 2001; Harmelen & Horrocks, 2000; Hendler & McGuinness, 2000). This area of research is likely to be an important one for the future development of the Semantic Web, and may have implications for the wider deployment of thesauri and other subject vocabularies in a Web context.

The W3C has also recently set up a Web Ontology (WebOnt) Working Group to focus on the development of a language to extend the 'semantic reach' of XML and RDF. More generally, there is a European 'thematic network' called OntoWeb that is funded by the European Commission as part of its Information Society Technologies (IST) research programme. This has a large number of partners, both academic and industrial, and is fast becoming the main European focus of work concerned with the use of ontology-based technologies for the Semantic Web.

W3C Semantic Web activity: http://www.w3c.org/2001/sw/

The DARPA Agent Markup Language Homepage: http://www.daml.org/

OIL: http://www.ontoknowledge.org/oil/

Web Ontology Working Group: http://www.w3.org/2001/sw/WebOnt/

OntoWeb Network: http://www.ontoweb.org/

4.8 XML Topic Maps (XTM)

The ISO/IEC 13250:2000 Topic Maps standard defines a "notation for interchangeably representing information about the structure of information resources used to define topics, and the relationships between topics." Topicmaps.Org is a consortium that is defining an XML grammar for exchanging Web-based Topic Maps, called XML Topic Maps (XTM). The developers hope that XTM will help facilitate the use of the topic maps paradigm on the Web, and to help realise its potential for finding and managing information.

XML Topic Maps: http://www.topicmaps.org/xtm/1.0/

5. Software and products

Lists of software packages are available at several Web sites; one of the most comprehensive is that maintained by Leonard Will:

http://www.willpower.demon.co.uk/thessoft.htm

It contains short reviews of software for the development and maintenance of thesauri, including database systems like ADLIB, CAIRS and STAR, and systems that visually represent information like KnowledgeMap. The list mostly covers stand-alone thesaurus software, and does not generally include thesaurus or subject authority control modules included in library management systems.

5.1 Supporting software products

An alternative approach to the problem of cross searching by subject, where different domains use different thesauri, has been to develop tools for merging thesauri, using varying computational techniques (Hunter, 2001). These approaches tend to be based on technologies like RDF and RDFS, and include attempts to integrate ontologies and thesaurus hierarchies (e.g. Amann & Fundulaki, 1999) and the development of knowledge representation languages like OIL and DAML-OIL. These have been covered in more detail in section 4.7.

5.2 Automatic indexing and classification tools

Although automatic indexing tools do not yet appear to be widely used for production services, a growing number are becoming available. Some of this has originated in the library and information world. For example, the OCLC Office of Research has experimented with the building of tools for automatic subject recognition in the Scorpion project (http://purl.oclc.org/scorpion). This section presents a selection of some of the more 'talked about' commercial products. These products tend to be aimed at "knowledge management" applications in the commercial sector, but they have a wider relevance and often use technologies developed by information scientists and information retrieval experts. Because of their proprietary nature, most of the information available about these tools takes the nature of product information. Gilchrist and Kibby (2000) provide more information about a number of other software products that automatically create taxonomies, including grapevine, Meta Packager, Readware, etc.

5.2.1 Autonomy

Autonomy makes knowledge management software that is primarily marketed to the corporate sector but that is increasingly being considered by the government and education sectors. The core of the Autonomy knowledge management system uses advanced pattern-matching techniques and probability theory to help perform tasks that are usually done manually. The Autonomy Web site claims that this "provides a comprehensive infrastructure for automating the processing of unstructured information, including but by no means limited to, search." The company's FAQ says that "Autonomy develops infrastructure technology that automatically processes and organizes large amounts of unstructured information into personally relevant content in real-time and in an efficient manner." The technology that does this is known as the Dynamic Reasoning Engine (DRE). Some technical information on the Autonomy system is available in their white paper (Autonomy Systems, 2001). The British Council has used Autonomy software in the development of their Human Rights portal.

Autonomy Web pages: http://www.autonomy.com/autonomy/index.shtml

5.2.2 Interwoven

Interwoven are a company that produces content management systems and other software. One of these is the MetaTagger, which uses controlled vocabularies - either those available from third parties or locally created - as part of the process of creating metadata for documents created by Interwoven's TeamSite content management system. These vocabularies cover the names of public companies, geographical locations and industrial codes, and are expressed in XML.

Interwoven Web pages: http://www.interwoven.com/

5.2.3 Semio

Semio produce a product called "Semio Taxonomy" that is an attempt to organise a collection of documents into a browsable structure. The software automatically extracts key concepts from large volumes of text-based content, then structures (clusters) the information into logical categories, in order to provide user-friendly user access.

Semio Taxonomy combines linguistic analysis technology and statistical clustering with user-defined vocabulary requirements to create an intuitively browsable structure of categories that provides access to information within a mass of formerly unstructured text… The client can configure taxonomies to reflect a corporate thesaurus or controlled vocabulary, fully compliant with ISO thesauri…There is an ability to generate multiple taxonomies representing different 'views' of the same source data… - 'Taxonomies – tools and developments' Barry Mahon, with additions from Gilchrist (Gilchrist & Kibby, 2000).

Semio Web site: http://www.semio.com/

5.2.4 Wordmap

Wordmap is a commercial company that specialises in the development of systems based on 'knowledge maps.' These are defined as "visualised, controlled structures of related terms." In general, any subject vocabulary, thesaurus or ontology can be thought of as a knowledge map. Wordmap attempts to 'mesh' various vocabularies together so that users are able to locate relevant terms for a search. The knowledge maps would help a user avoid ambiguity (e.g. where homonyms exist), would provide alternative and supplementary search terms, context and cross-references.

Wordmap aims to provide a set of generic knowledge maps. The core of this is a large-scale general knowledge dataset. Linked to this are other subject vocabularies and classification schemes, including the North American Industry Classification System (NAICS), the Standard Industrial Classification and LCSH. A 'management toolset' enables the compilation, editing and updating of customer-specific knowledge maps. The Wordmap metasearch server deploys knowledge maps in the search and navigation of diverse data sources. A demonstrator is available on the Web that will search the content of Web search engines like AltaVista. Users can browse by broad top-level categories, or choose slightly narrower terms within these.

Wordmap Web pages: http://www.wordmap.com/

6. Standards issues

6.1 National and international standards

Both national and international standards exist for the creation and maintenance of thesauri. For example, the US National Information Standards Organization (NISO) publishes ANSI/NISO Z39.19-1993, Guidelines for the construction, format, and management of monolingual thesauri. International standards also exist. The International Organization for Standardization (ISO) publishes guidelines for both monolingual (ISO 2799:1986) and multilingual (ISO 5964:1985) thesauri. Some national standards (e.g., BS 5723:1987) are identical to the relevant standard published by ISO.

6.2 Z39.50 and thesauri

6.2.1 The Zthes profile

Zthes (Taylor, 2000) is a Z39.50 profile for thesaurus navigation. The profile describes an abstract model for representing and searching thesauri (e.g. hierarchies of terms as described in ISO 2788) and specifies how this model may be implemented using the Z39.50 protocol. It also suggests how the model may be implemented using other protocols and formats; e.g. a Zthes DTD for XML is provided as an appendix to the profile. Real Zthes datasets have been exchanged in the form of XML documents conforming to this DTD.

The profile document states that only one complete implementation of the Zthes client and server was known to exist at the time of writing: one developed by Index Data for their Zebra database. Since MACS (see section 4.4 above) is using Zebra, it is, therefore, one of the few project implementations using the Zthes profile. It has developed an extended version of the Zthes record structure. A Z39.50 client can search for a subject heading from one of the indexing schemes and linked terms from the other schemes will also be returned. Several other EU funded projects have also used Zthes including ELVIL (the European Legislative Virtual Library), the UNIverse project, Decomate II and ELISE II.

6.3 NKOS

6.3.1 The Vocabulary Markup Language (Voc-ML)

The Networked Knowledge Organization Systems/Services (NKOS) working group is concerned with the creation of interactive Knowledge Organization Systems (KOS) accessible over the Web (http://nkos.slis.kent.edu/). Hodge (2000a, p. 1) has attempted to define the scope of such systems:

The term knowledge organization systems is intended to encompass all types of schemes for organizing information and promoting knowledge management. Knowledge organization systems include classification and categorization schemes that organize materials at a general level, subject headings that provide more detailed access, and authority files that control variant versions of key information such as geographic names and personal names. Knowledge organization systems also include highly structured vocabularies, such as thesauri, and less traditional schemes, such as semantic networks and ontologies.

A NISO workshop on Electronic Thesauri: Planning for a Standard held in 1999 (Milstead, 1999) concluded (amongst other things) that there was a need for a metadata content standard for the description of knowledge organisation systems. NKOS has since then defined a set of attributes for the description of knowledge organisation systems, and developed a draft XML DTD known as the Vocabulary Markup Language (Voc-ML). The DTD is being developed by Joseph Busch and Ron Daniel of Interwoven (Hodge, 2000b). The schema includes Dublin Core metadata that would describe the knowledge organisation systems being encoded. It also defines tags and syntax for uniquely identifying each term, its relationship to other terms, and descriptive information like scope notes and definitions. It is hoped that the schema, when finalised, will work for a range of different types of system, including authority files, hierarchical thesauri, classification schemes, digital gazetteers, and subject heading lists (Hodge, 2000b). A new draft of the Voc-ML schema was discussed at the 4th NKOS Workshop held in in conjunction with the ACM-IEEE Joint Conference on Digital Libraries (JCDL) in Roanoke, Va.

NKOS: http://nkos.slis.kent.edu/

7. References and literature review

Literature searches were carried out using LISAnet, zetoc, BUBL, ISI Web of Science, and other Web search resources, as well as by following up suggestions from colleagues and links from relevant Web sites. Some items are less relevant than others, but have been listed in order to show the types of issues that other groups have been addressing.

Based on searches conducted thus far, not much literature exists specifically on interfaces to thesauri, and projects have not tended to focus on this. Some more general user interface design literature may be relevant, although this tends to have a more technical/mathematical approach. Leona Carpenter (UKOLN) has suggested some sources in this area.

Not all the items listed have been consulted directly. Some were not held in British Library stock in London or Boston Spa; several were requested via ILL.

Adams, Katherine C. (2001). Word Wranglers: automatic classification tools transform enterprise documents from "bags of words" into knowledge resources. IntelligentKM, 1 January: http://www.intelligentkm.com/feature/010101/feat1.shtml

Amann, B. & Fundulaki, I. (1999). Integrating ontologies and thesauri to build RDF schemas. In: Abiteboul, S. & Vercoustre, A.-M., eds., Research and advanced technology for digital libraries: third European conference, ECDL'99, Paris, France, September 22-24, 1999: proceedings. (Lecture Notes in Computer Science, 1696). Berlin: Springer, 234-253.

Amann, B., Fundulaki, I. & Scholl, M. (2000). Integrating ontologies and thesauri for RDF schema creation and metadata querying. International Journal on Digital Libraries, 3 (3), 221-236.

ANSI/NISO Z39.19-1993. Guidelines for the construction, format, and management of monolingual thesauri. Bethesda, Md.: NISO Press.

Autonomy Systems, Ltd. (2001). Technology white paper. PDF version available at: http://www.autonomy.com/echo/userfile/Autonomy_Technology_WP(0401).pdf

Berners-Lee, T. & Hendler, J. (2001). Publishing on the Semantic Web. Nature, 410 (6832), 26 April, 1023-1024.

Berners-Lee, T., Hendler, J. & Lassila, O. (2001). The Semantic Web. Scientific American, 284 (5), May, 28-37. http://www.sciam.com/2001/0501issue/0501berners-lee.html

Brickley, D. & Guha, R.V., eds. (2000). Resource Description Framework (RDF) Schema specification, 1.0. W3C Candidate Recommendation, 27 March.
http://www.w3.org/TR/2000/CR-rdf-schema-20000327/

BS 5723:1987. British Standard guide to the establishment and development of monolingual thesauri. London: British Standards Institution.

Carlisle, P. (2001). VR and Heritage Illustrated Thesaurus (HITITE). mda Conference 2001, University of East Anglia, Norwich, 12-14 September. Abstract available at: http://www.mda.org.uk/conference2001/paper24.htm

Chandrasekaran, B., Josephson, J.R. & Benjamins, V.R. (1999). What are ontologies, and why do we need them? IEEE Intelligent Systems, 14 (1), January/February, 20-26.

Clavel-Merrin, G. (1999). The need for co-operation in creating and maintaining multilingual subject authority files. 65th IFLA Council and General Conference, Bangkok, Thailand, 20-28 August 1999. http://www.ifla.org/IV/ifla65/papers/080-155e.htm

Cranefield, S. (2001). Networked knowledge representation and exchange using UML and RDF. Journal of Digital Information, 1 (8), 5 February. http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Cranefield/

Cross, P., Brickley, D. & Koch, T. (2001). Enhancements to user interface. DESIRE II internal deliverable, D3.6b. http://www.desire.org/html/research/deliverables/D3.6/d36b.html

Decker, S., Mitra, P. & Melnik, S. (2000). Framework for the Semantic Web: an RDF tutorial. IEEE Internet Computing, 4 (6), November/December, 68-73.

Dempsey, L., Gardner, T., Day, M., Werf, T. van der. (1999). International information gateway collaboration: report of the first IMesh Framework Workshop. D-Lib Magazine, 5 (12), December. http://www.dlib.org/dlib/december99/12dempsey.html

Dextre Clarke, S. (2001). Pan-government thesaurus workshop, 10 May 2001, London. Report of proceedings prepared for the Office of the e-Envoy, 21 May (draft).

Doerr, M. (2001). Semantic problems of thesaurus mapping. Journal of Digital Information, 1 (8), 5 February. http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Doerr/

Fensel, D. (2001). Ontologies: a silver bullet for knowledge management and electronic commerce. Berlin: Springer.

Fensel, D., Harmelen, F. van, Horrocks, I., McGuinness, D.L. & Patel-Schneider, P.F. (2001). OIL: an ontology infrastructure for the Semantic Web. IEEE Intelligent Systems, 16 (2), March/April, 38-45.

Freyre, E. & Naudi, M. (2001). MACS: subject access across languages and networks. Subject Retrieval in a Networked Environment, IFLA Satellite Meeting, OCLC, Dublin, Ohio, 14-16 August 2001.

Garrod, P. (2000). Use of the UNESCO Thesaurus for archival subject indexing at UK NDAD. Journal of the Society of Archivists, 21 (1), 37-54.

Gilchrist, A. & Kibby, P. (2000). Taxonomies for business: access and connectivity in a wired world. London: TFPL.

Gilchrist, A. (1990). Knowledge organisation and the human interface. In: In: Fugmann, R., ed., Tools for knowledge organization and the human interface: proceedings 1st International ISKO-Conference, Darmstadt, 14-17 August 1990. (Advances in Knowledge Organization, 1-2). Frankfurt/Main: Indeks, Vol. 1,, 1991, 163-170.

Gruber, T.R. (1995). Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies, 43, 907-928.

Harmelen, F. van & Horrocks, I. (2000). FAQs on OIL: the Ontology Inference Layer. IEEE Intelligent Systems, 15 (6), November/December, 69-72.

Harter, S. P. & Cheng, Y.-R. (1996) Colinked descriptors: improving vocabulary selection for end-user searching. Journal of the American Society for Information Science, 47 (4), 311-325.

Hendler, J. (2001). Agents and the Semantic Web. IEEE Intelligent Systems, 16 (2), March/April, 30-37.

Hendler, J. & McGuinness, D.L. (2000). The DARPA Agent Markup Language. IEEE Intelligent Systems, 15 (6), November/December, 72-73.

Hodge, G. (2000a). Systems of knowledge organisation for digital libraries: beyond traditional authority files. Washington, D.C.: Council on Library and Information Resources, Digital Library Federation, April. http://www.clir.org/pubs/abstract/pub91abst.html

Hodge, G. (2000b). NKOS group reviews draft DTD for thesauri. D-Lib Magazine, 6 (12), December. http://www.dlib.org/dlib/december00/12inbrief.html#HODGE

Hunter, J. (2001). MetaNet: a metadata term thesaurus to enable semantic interoperability between metadata domains. Journal of Digital Information, 1 (8), 5 February. http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Hunter/

ISO 2788:1986. Documentation - Guidelines for the establishment and development of monolingual thesauri. Geneva: International Organization for Standardization.

ISO 5964:1985. Documentation - Guidelines for the establishment and development of multilingual thesauri. Geneva: International Organization for Standardization.

ISO 12200:1999. Computer applications in terminology -- Machine-readable terminology interchange format (MARTIF) -- Negotiated interchange. Geneva: International Organization for Standardization.

ISO/IEC 13250:2000. Information technology - SGML applications - Topic Maps. Geneva: International Organization for Standardization.

Jones, S. & Hancock-Beaulieu, M. (1994). Support strategies for interactive thesaurus navigation. In: Albrechtsen, H. & Ørnager, S., eds. Knowledge organization and quality management: proceedings of the Third International ISKO Conference, 20-24 June 1994, Copenhagen, Denmark. (Advances in Knowledge Organization, 4). Frankfurt/Main: Indeks.

Jorna, K. & Davies, S. (2001). Multilingual thesauri for the modern world: no ideal solution? Journal of Documentation, 57 (2), 284-295.

Kelly, B. (2001). Web Focus: Hot News From WWW10. Ariadne, 28, June. http://www.ariadne.ac.uk/issue28/web-focus/

Klaus, H. (1999). Understandings of thesauri in the online dialogue: implications for construction and display. Cataloguing Australia, 25 (1/4), 120-139.

Koch, T. (2000). Quality controlled subject gateways: definitions, typologies, empirical overview. Online Information Review, 24 (1), 24-34.

Koch, T. & Day, M. (1997). The role of classification schemes in Internet resource description and discovery. DESIRE deliverable, D3.2 (3). http://www.ukoln.ac.uk/metadata/desire/classification/

Koch, T. & Vizine-Goetz, D. (1998). Automatic classification and content navigation support for Web services. In: Annual review of OCLC research, 1998. Dublin, Ohio: OCLC Online Computer Library Center. http://www.oclc.org/research/publications/arr/1998/toc.htm

Koch, T., Neuroth, H. & Day, M. (2001). Renardus: cross-browsing European subject gateways via a common classification system (DDC). Subject Retrieval in a Networked Environment, IFLA Satellite Meeting, OCLC, Dublin, Ohio, 14-16 August 2001. http://www.ukoln.ac.uk/metadata/renardus/papers/ifla-satellite/

Landry, P. (2000). The MACS project: Multilingual Access to Subjects (LCSH, RAMEAU, SWD). 66th IFLA Council and General Conference, Jerusalem, Israel, 13-18 August 2000. http://www.ifla.org/IV/ifla66/papers/165-181e.pdf

Landry, P. (2001). The MACS project: Multilingual Access to Subjects (LCSH, RAMEAU, SWD). International Cataloguing and Bibliographic Control, 30 (3), 46-49.

Leusen, M. van. (2001). Indexing and retrieving archaeological resources on the Internet: a prototype multilingual thesaurus application. In: Stancic, Z.; Veljanovski, T. eds., Computing archaeology for understanding the past: CAA 2000 Computer Applications and Quantitative Methods in Archaeology: proceedings of the 28th Conference, Ljubljana, April 2000. (BAR International Series, 931). Oxford: Archaeopress, 303-312.

López-Huertas, M.J. (1997). Thesaurus structure design: a conceptual approach for improved interaction. Journal of Documentation, 53 (2), 139-177.

Matthews, B. & Wilson, M. (2000). Multilingual metadata to access social science data. Data Management Workshop, CLRC Daresbury Laboratory, Cheshire, UK, 23-25 February. http://www.limber.rl.ac.uk/External/paper1.htm [URL changed, 31 October 2001]

Meltby, A.K. (2000). SALT: Standards-based Access service to multilingual Lexicons and Terminologies. http://www.ttt.org/salt/description.html

Miller, K. & Matthews, B. (2001). Having the right connections: the LIMBER Project. Journal of Digital Information, 1 (8), 5 February. http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Miller/

Miller, P. (2000). I say what I mean, but do I mean what I say? Ariadne, 23, March. http://www.ariadne.ac.uk/issue23/metadata/

Milstead, J. (1999). Workshop on Electronic Thesauri: planning for a standard: report. Bethesda, Md.: National Information Standards Organization, November. http://www.niso.org/thes99rprt.html

Musgrave, S. & Ryssevik, J. (2000). Beyond NESSTAR: faster access to data. IASSIST 2000 - Data in the Digital Library: Charting the Future for Social, Spatial and Government Data, Northwestern University, Evanston, Ill., USA, 7-10 June. Draft PDF version available at: http://www.nesstar.org/papers/beyondnesstar.pdf

O'Brien, E.A., Tinker, A., Pollitt, A.S. & Braekevelt, P.A. (2000). The Dewey Decimal Classification and the transition from physical to electronic organisation. Knowledge Organization, 26 (2), 80-96.

Priss, U. (1998). A graphical interface for conceptually navigating faceted thesauri. In: Mustafa El Hadi, W., Maniez, J., & Pollitt, A. S., eds., Structures and relations in knowledge organization: proceedings of the Fifth International ISKO Conference, 25-29 August 1998, Lille, France. (Advances in Knowledge Organization, 6). Würzburg: Ergon, 184-190.

Qin, J. & Paling, S. (2001). Converting a controlled vocabulary into an ontology: the case of GEM. Information Research: an international electronic journal, 6 (2), January. http://informationr.net/ir/6-2/paper94.html

Royal Commission on the Historical Monuments of England. (1998). Thesaurus of monument types: a standard for use in archaeological and architectural records, 2nd ed. Swindon: Royal Commission on the Historical Monuments of England, with English Heritage.

Shapiro, C.D. & Yan, P.-F. (1996). Generous tools: thesauri in digital libraries. In: Williams, M.E., ed., National Online Meeting: proceedings 1996: proceedings of the 17th National Online Meeting, New York, May 14-16 1996. Medford, N.J.: Information Today, 323-332.

Shneiderman, B., Byrd, D. & Croft, W.B. (1998). Sorting out searching: a user-interface framework for text searches. Communications of the ACM, 41 (4), April, 95-98.

Soergel, D. (1999). The rise of ontologies or the reinvention of classification. Journal of the American Society for Information Science, 50 (12), 1119-1120.

Taylor, M. (2000). Zthes: a Z39.50 profile for thesaurus navigation, version 0.4, November. http://lcweb.loc.gov/z3950/agency/profiles/zthes-04.html

Tudhope, D., Alani, H. & Jones, C. (2001). Augmenting thesaurus relationships: possibilities for retrieval. Journal of Digital Information, 1 (8), 5 February. http://jodi.ecs.soton.ac.uk/Articles/v01/i08/Tudhope/

Vasiljev, A. (1991). Enhancement of the subject access vocabulary in an online catalogue: tools for knowledge organisation and the human interface. In: Fugmann, R., ed., Tools for knowledge organization and the human interface: proceedings 1st International ISKO-Conference, Darmstadt, 14-17 August 1990. (Advances in Knowledge Organization, 1-2). Frankfurt/Main: Indeks, Vol. 1, 1991, 163-170.

8. Acknowledgements

HILT (High-Level Thesaurus) was a one-year project jointly funded by the Research Libraries Support Programme (RSLP) and the Joint Information Systems Committee (JISC) of the UK higher education funding councils. The project was led by the Centre for Digital Library Research (CDLR) at the University of Strathclyde: http://hilt.cdlr.strath.ac.uk/

UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher education funding councils, as well as by project funding from the JISC, the European Union and other sources. UKOLN also receives support from the University of Bath, where it is based: http://www.ukoln.ac.uk/

Maintained by Michael Day of UKOLN, University of Bath.
Created: 31-Oct-2001.
Last updated: 31-Oct-2001.

HILT: High-Level Thesaurus project

Automated and manual approaches to the provision of thesauri and subject vocabularies

Summary

Contents

3.1 AAT (Art & Architecture Thesaurus)

3.2 HASSET (Humanities And Social Science Electronic Thesaurus)

3.3 MeSH (Medical Subject Headings)

3.4 The Plumb Design Visual Thesaurus

4.1 The DESIRE II project

4.2 The HITITE project

4.3 Project LIMBER

4.4 The MACS project

4.5 The Renardus project

4.6 The SALT project

4.7 The Semantic Web

4.8 XML Topic Maps (XTM)

5.1 Supporting software products

5.2 Automatic indexing and classification tools

5.2.1 Autonomy

5.2.2 Interwoven

5.2.3 Semio

5.2.4 Wordmap

6.1 National and international standards

6.2 Z39.50 and thesauri

6.2.1 The Zthes profile

6.3 NKOS

6.3.1 The Vocabulary Markup Language (Voc-ML)