Terminology Services and Technology 

JISC state of the art review

 

 

Douglas Tudhope     University of Glamorgan

Traugott Koch          UKOLN, University of Bath

Rachel Heery           UKOLN, University of Bath

 

 

 

Document details

Date:

15-09-2006

Version:

Final draft for approval

Notes:

Circulation to JISC Development Team

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Acknowledgement to funders

This work was funded as part of the JISC Information Environment.

 

UKOLN is funded by the MLA: The Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils, as well as by project funding from the JISC and the European Union.  UKOLN also receives support from the University of Bath where it is based.

 

 

 

 

 

 

 

 

 

 

 

 

UKOLN is funded by the MLA: The Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils, as well as by project funding from the JISC and the European Union.  UKOLN also receives support from the University of Bath where it is based.

 

The Arts and Humanities Data Service is a UK national service funded by the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils and the Arts and Humanities Research Board to collect, preserve and promote the electronic resources which result from research and teaching in the arts and humanities.

 

 


Contents page

 

 

Executive Summary  6

Purpose  6

Overview of report contents  6

Key points  7

Recommendations  8

1.     Introduction  13

1.1       Purpose of this review   13

1.2       Terminology services overview   13

1.2.1        Controlled vocabularies  14

1.2.2        Folksonomies  14

1.2.3        Combination of terminology tools and techniques  15

1.3       Cost benefit issues  15

1.3.1        Benefits  15

1.3.2        Return on investment 16

2      Use cases - scenarios  17

2.1       Retrieval performance  17

2.2       Name Authorities  18

2.3       Mapping and other TS  18

2.4       Repositories  19

3      Types of vocabularies  20

3.1       Vocabularies by structure  20

3.1.1        Term Lists  20

3.1.2        Taxonomies  20

3.1.3        Subject Headings  21

3.1.4        Relationship-based KOS  22

3.2       Vocabularies by purpose  25

3.2.1        Retrieval purposes  25

3.2.2        Linguistic purposes  26

3.2.3        AI purposes - modeling the entities in a domain  26

3.2.4        eLearning purposes  27

3.2.5        eScience purposes  29

3.3       Named entity authority and disambiguation services  30

3.3.1        Name Authority databases  30

3.3.2        Other named entity authorities  33

3.3.3        Named entity recognition, text mining, name disambiguation  35

3.3.4        Tools, Web services  36

3.4       Social tagging and folksonomies  38

3.4.1        Terminology  38

3.4.2        Context 39

3.4.3        Categorization of tagging systems  39

3.4.4        Disadvantages and problems  39

3.4.5        Advantages and benefits  41

3.4.6        Proposed developments  41

3.4.7        Research  43

3.5       Best practice guidelines for constructing and using vocabularies  44

3.6       Network access to vocabularies  45

3.7       Terminology Registries  46

4      Activities with TS  47

4.1       Studies and models of information seeking behaviour 47

4.2       Information lifecycle with regard to TS  49

4.3       Types of Terminology Web Services  50

4.3.1        Definition of Terminology Web Services  51

4.3.2        Groups (and layers) of abstract terminology services  52

4.3.3        Illustration of TS assisted search process  55

4.3.4        Terminology Web Services review   56

4.4       Mapping  58

4.5       Automatic classification and indexing  60

4.6       Text mining and information extraction  61

4.7       General sources for work in TS  62

5      Review of current terminology service activity  62

5.1       JISC related activity  63

5.1.1        Archaeology Data Service (ADS) 63

5.1.2        Co-ODE: Collaborative Open Ontology Development Environment 63

5.1.3        geoXwalk Gazetteer Service  64

5.1.4        High Level Thesaurus (HILT) 64

5.1.5        Learning and Teaching Portal (Portals Programme) 65

5.1.6        Mersey Libraries, Archives Hub and Cheshire  65

5.1.7        Resource Discovery Network (RDN) 65

5.2       Other UK activity  66

5.2.1        COHSE Conceptual Open Hypermedia Project 66

5.2.2        FACET  66

5.2.3        FATKS  66

5.2.4        FISH Interoperability Toolkit 67

5.2.5        NHM Nature Navigator and other Scientific Taxonomic Projects  67

5.2.6        OpenGALEN   68

5.2.7        SKOS (Simple Knowledge Organisation System) 68

5.2.8        STAR (Semantic Technologies for Archaeological Resources) 68

5.3       International activity  69

5.3.1        Alexandria Digital Library  69

5.3.2        E-Biosci : EC platform e-publishing and info integration in Life  69

5.3.3        Renardus  69

5.3.4        Simile Piggy Bank  70

5.3.5        SPIRIT  70

5.3.6        OCLC and OCLC Research  70

5.4       Projects in relation to vocabulary lifecycle framework  71

5.5       Repositories  73

5.6       Augmenting existing programmes and projects  74

6      Standards  75

6.1       Design  76

6.2       Representations  76

6.3       Identification of concepts, terms and vocabularies  77

6.3.1        URIs  77

6.3.2        Practical experience  78

6.3.3        Further issues  79

6.4       Protocols, profiles and APIs  79

6.4.1        Protocols to access a vocabulary  79

6.4.2        Protocols to support query  81

6.5       Related standards  81

7      Conclusions  82

8      References (by main sections of the review) 83

 


Executive Summary

 

Purpose

Over the next two years, as part of its Capital Funding Programme, the Joint Information Systems Committee (JISC) is supporting further work to realize a rich information environment within the learning and research communities. This review is intended to inform JISC’s planning for future work related to Terminology Services and Technology, as well as to provide useful background information for participants in future calls, whether specifically featuring terminology or where terminology can be used to underpin other services.

Overview of report contents

This report reviews vocabularies of different types, best practice guidelines, research on terminology services and related projects. It discusses possibilities for terminology services within the JISC Information Environment and eFramework.

 

Terminology Services (TS) are a set of services that present and apply vocabularies, both controlled and uncontrolled, including their member terms, concepts and relationships. This is done for purposes of searching, browsing, discovery, translation, mapping, semantic reasoning, subject indexing and classification, harvesting, alerting etc. Indicative use cases are discussed.

 

One type of TS attempts to increase consistency and improve access to digital collections and Web navigation systems via vocabulary control. Vocabulary control aims to reduce the ambiguity of natural language when describing and retrieving items for purposes of information searching. Another type of TS is not concerned with consistency but with making it easier for end-users to describe information items and to have access to other users’ descriptions. This results in vocabularies (folksonomies) that may not be controlled, at least initially. The report reviews different kinds of vocabularies, according to their structure and their intended purpose. Potential benefits and return on investment are discussed. Named entity authority and social tagging services are discussed in some detail. Pointers are given on best practice guidelines and networked access to vocabularies, including key issues for future terminology registries.

 

The wider context of TS is considered. Relevant literature on user studies is reviewed. TS are located within an information lifecycle and within the JISC IE. Suggestions are made towards a more specific definition of Terminology Web Services within the JISC IE. Current work on Terminology Web Services is reviewed, along with work on mapping, automatic classification/indexing and repositories. Current projects that involve TS activity (JISC, UK, and international) are briefly reviewed.

 

Relevant standards are discussed, particularly for vocabulary representation;  identification of concepts, terms and vocabularies; protocols and APIs.

Key points

TS can be m2m or interactive, user-facing services and can be applied at all stages of the search process. Services include resolving search terms to controlled vocabulary, disambiguation services, offering browsing access, offering mapping between vocabularies, query expansion, query reformulation, combined search and browsing. These can be applied as immediate elements of the end-user interface or can underpin services behind the scenes, according to context. The appropriate balance between interactive and automatic service components requires careful attention.

 

Return on investment should be considered in any service provision. There are various types of vocabularies serving different purposes, with different degrees of vocabulary control, richness of semantic relationships, formality, editorial control. There are a range of TS options, both interactive and automatic. There is potential for piloting TS to augment existing JISC programmes and projects.

 

TS are sometimes contrasted with free text searching, assisted by statistical Information Retrieval techniques in automatic indexing and ranking. These are not, however, exclusive options and there are opportunities in exploring different combinations of the two approaches. It should be noted that Web search engines have introduced elements of TS, by offering synonym and lexical expansion options. Thus TS should not be seen as antithetical to free text searching and can augment it.

 

There are many existing vocabularies. Different arrangements regarding ownership, maintenance and licensing of vocabularies can be found. The issue of who will maintain a vocabulary and the basis on which it can be described or made available in a registry needs investigation since this underpins systematic use of vocabularies in the JISC IE. This involves establishing business models for access to and maintenance of vocabularies.

 

Mapping is a key requirement for semantic interoperability in heterogeneous environments. Although schemas, frameworks and tools can help, detailed mapping work at the concept level is necessary, requiring a combination of intellectual work and automated assistance. The impact on retrieval is a key consideration.

 

Automatic classification and indexing tools are important for addressing the potential resource overheads in applying TS to indexed collections and repositories. Some tools are emerging that should be investigated for JISC purposes. Many argue for a combination of intellectual and automatic methods.

 

It is important to consider how people search for information when designing and evaluating TS, in order to reduce the scope for design errors and increase the possibility that services will actually be used. User studies should be conducted where feasible in ongoing project work.

 

TS should not be seen as an isolated, free-standing component. TS need to be considered within the wider context of the JISC IE, and need to be integrated with other components of the eFramework. They should be seen as forming a set of services that can be combined with a wide range of other services. There is a need for specifications of TS and their workflow, as part of the JISC IE.

 

Interoperability requires commonly agreed standards and protocols. Standards exist at different levels and types of interoperability. The prospect is emerging for a broad set of standards across different aspects of terminology services - persistent identifiers, representation of vocabularies, protocols for programmatic access, vocabulary-level metadata in repositories. Such standards are an infrastructure upon which future TS will rest but it is not feasible to wait for international agreements; international consensus will be influenced by operational experience. Pilot TS projects should orient to existing potential standards (in persistent identifiers, representations, protocols for programmatic access) and help to evaluate and evolve them.

Recommendations

The review was asked to include: “recommendations for further activities needed in this field, and the extent to which JISC should be involved in the work (both short and longer term), including collaboration with other organizations as a possible form of involvement". The following recommendations are listed according to the relevant section of the review, where further context may be found.

 

1.         Introduction   

1.1          Purpose of this review                        
·        Terminology services can support various stages of the information lifecycle
·        JISC should highlight subject access and terminology services in all relevant JISC programmes, whether as extensions to existing projects or as new projects

 

1.2       Terminology Services overview

·        Demonstrate integration of Terminology Services with other components of the JISC Information Environment. (See also Recommendation 4.3)

 

 

1.2.3    Combination of terminology tools and techniques  

·        Encourage inter-disciplinary collaboration in the development of terminology services and co-operation with memory institutions and archives

·        Investigate different combinations of TS and uncontrolled (non-TS) search

 

1.3.2    Return on investment           

·        Investigate methods to make vocabularies available to the education sector through a Registry, initially for experimentation purposes but ultimately in a sustainable, maintained, licensed manner. (See also Recommendation 3.7)

 

2          Use cases - scenarios           

 

3          Types of vocabularies           

 

3.1       Vocabularies by structure

 

3.2       Vocabularies by purpose      

·        Descriptions of intended purposes of a vocabulary would be a useful element of a vocabulary registry (see also Recommendation 3.7).

 

3.2.4    eLearning purposes  

·        User studies of behaviour by indexers (cataloguers), students, teachers. Investigate how to support effective practice with a variety of indexing and retrieval tools

 

3.2.5    eScience purposes

 

3.3       Named entity authority and disambiguation services

·        Investigate lists of institutional names and academic affiliations (IESR Agents etc.)

           

3.4       Social tagging and folksonomies

 

3.7       Terminology Registries

Demonstrate the use of a terminologies registry within JISC IE testbed to include

·        Investigating inclusion of terminologies into IESR, potentially describing vocabularies as collections

·        Developing  marketing proposition for a UK terminology registry (include use scenarios, IPR issues, business models, cost benefit)

·        Evaluating use of the draft metadata description profile proposed by NKOS

·        Maintain collaboration between various UK initiatives (with eScience e.g. GRIMOIRES and learning communities e.g. Becta Vocabulary Tool) and internationally (e.g. NSDL)

 

4          Activities with TS      

4.1       Studies and models of information seeking behaviour       

           

4.3       Types of Terminology Web Services

 

4.3.4    Terminology Web Services   review

 

4.4       Mapping        

 

4.5       Automatic classification and indexing          

 

4.6              Text mining and information extraction

·        Investigate relationship between KOS and text mining:

           

5          Review of current terminology service activity      

 

5.5       Repositories  

-                 use of subject classification and

-                 use of specialised KOS vocabularies

-                 use of author assigned keywords

-                 full text indexing

 

5.6       Augmenting existing programmes and projects

 

6          Standards      

·        JISC should encourage participation in international standardisation activities

 

6.1       Design           

 

6.2       Representations        

 

6.3       Identification of concepts, terms and vocabularies 

 

6.4       Protocols, profiles and APIs 

 


1.   Introduction

1.1      Purpose of this review

Over the next two years, as part of its Capital Funding Programme, the Joint Information Systems Committee (JISC) is supporting further work to realize a rich information environment within the learning and research communities. This review is intended to inform JISC’s planning for future work related to Terminology Services and Technology, as well as to provide useful background information for participants in future calls, whether specifically featuring terminology or where terminology can be used to underpin other services. The review is intended to identify useful areas of activity and highlight current initiatives of interest rather than be comprehensive or prescriptive. The review will recommend a number of areas with potential, either for further investigation, or for the development of tools or demonstrator services.

 

JISC’s interest in terminology services is part of its strategy for shared infrastructure services underpinning resource discovery, both m2m and user-facing services. Within the education sector, there is interest from service-provider, developer and service user representatives. With adoption of a Services Oriented Approach (SOA), there is potential for a granular approach to terminology services, with different services arising from and being maintained by different communities. There is potential for re-use of some widely-applicable services to support learning, teaching and research, with other providers (possibly including other public sector bodies, research communities, professional societies and so on) providing more specifically-focused services. 

 

The recommendations made by this report are based both on a review of current activity and on contacts made with a number of interested parties. There is some overlap of topic with the JISC Shared Infrastructure Services Review and the JISC Pedagogical Vocabularies Project. This review has a more specific focus on Terminology Services, making reference to the other studies as appropriate. Multilingual vocabulary support, translation support, spelling correction and dictionary services are considered out of scope. References are provided at the end of the report, organized by section.

 

Recommendations: 
Terminology services can support various stages of the information lifecycle.    
JISC should highlight subject access and terminology services in all relevant JISC programmes whether as extensions to existing projects or as new projects

 

1.2      Terminology services overview

Terminology Services (TS) are a set of services that present and apply vocabularies, both controlled and uncontrolled, including their member terms, concepts and relationships. This is done for purposes of searching, browsing, discovery, translation, mapping, semantic reasoning, subject indexing and classification, harvesting, alerting etc. They can be m2m or interactive, user-facing services and can be applied at all stages of the retrieval process.

 

TS can be confusing in that they span very different application areas, vocabularies, communities, and can provide quite different kinds of services. They can be applied as immediate elements of the end-user interface (e.g. pick lists, browsers or navigation menus, search options) or can underpin services behind the scenes.

 

TS need to be considered within the wider context of the JISC Information Environment, and need to be integrated with other components of the environment and with other services (Section 4.3). Standard representations, protocols and APIs need to be defined to enable programmatic access and encourage interoperability (Section 6).

 

Recommendation: Demonstrate integration of Terminology Services with other components of the JISC Information Environment. (See also Recommendation 4.3)

 

Vocabularies are often associated with control of subject (or topic) metadata. This includes the major bibliographic or educational subject classifications, thesauri used for subject indexing, species taxonomies, etc. Other types of metadata can also benefit from vocabulary control, prominent examples including place names, personal names, genre and various descriptors of educational context in eLearning.

1.2.1      Controlled vocabularies

One type of terminology service attempts to increase consistency and improve access to digital collections and Web navigation systems via vocabulary control. Vocabulary control aims to reduce the ambiguity of natural language when describing and retrieving items for purposes of information searching.

 

Controlled vocabularies consist of terms, words from natural language selected as useful for retrieval purposes by the vocabulary designers. A term can be one or more words. A term is used to represent a concept.

Two features (synonyms and ambiguity) in natural language pose potential problems.

a)      Different terms (synonyms) can represent the same concept.

b)      The same term (homographs) can represent different concepts.

 

A controlled vocabulary can attempt to reduce ambiguity between terms by :-

- defining the scope of terms - how they are to be used within a particular vocabulary.

- providing a set of synonyms or effective synonyms for each concept

- restricting scope so that terms only have one meaning (and relate to only one concept).

 

Not all vocabularies provide all three features above. Some are just simple lists of authorized terms (authority lists). Controlled vocabularies also provide vocabulary for Knowledge Organization Systems (KOS), which additionally structure their concepts via different types of semantic relationship. Types of KOS are discussed in Section 3.

1.2.2      Folksonomies

Another type of TS is not concerned with consistency but with making it easier for end-users to describe information items and to have access to other users’ descriptions. This results in vocabularies that may not be controlled, at least initially. In principle, this is not a new type of terminology but novel web applications have gained attention recently. Various neologisms have emerged for this activity, including social tagging and folksonomies. It is seen by some to hold promise of reducing indexing costs and perhaps most significantly, encouraging end-user participation in information services and contributing to community building. However it has yet to be evaluated for educational purposes and existing social tagging applications have not been designed with general retrieval in mind. Folksonomy-based terminology services are discussed in Section 3.4.

 

1.2.3      Combination of terminology tools and techniques

TS are sometimes contrasted with free text searching, assisted by statistical Information Retrieval techniques in automatic indexing and ranking. These are not, however, exclusive options and there are opportunities in exploring different combinations of the two approaches. It should be noted that Web search engines, such as Google, have introduced elements of TS, by offering synonym and lexical expansion options. Thus TS should not be seen as antithetical to free text searching and can augment it.

 

In general, different disciplines make use of vocabularies and can contribute to TS, including Artificial Intelligence, Human-Computer Interaction, Information Retrieval, Library & Information Science and Natural Language Processing.

 

Recommendations:

Encourage inter-disciplinary collaboration in the development of terminology services and co-operation with memory institutions and archives

Investigate different combinations of TS and uncontrolled (non-TS) search

 

1.3      Cost benefit issues

Various cost benefit issues relating to terminology services should be considered.

1.3.1      Benefits

Terminology Services enable users to undertake educational and research inquiries more effectively. When searching free text with uncontrolled terms, significant differences can stem from trivial variations in search statements and from differing conceptualisations of an information need. Different people use different words for the same concept or employ slightly different concepts. It can be difficult for non-specialists to employ technical vocabulary and variation in person or place names can frustrate consistent access. This may not be a problem if the purpose is just to obtain a few relevant items as examples of a topic. However, when the purpose is an in-depth educational review or systematic research on a specialized topic then it is undesirable to miss potentially relevant items. These problems can be helped by various Terminology Services.

 

At the simplest level, a controlled list of terms ensures consistency in searching and indexing, helping to reduce problems arising from synonym and homograph mismatches. Name authorities are an important example.

 

At a more complex level, the presentation of concepts in hierarchies and other semantic structures helps the indexer and searcher choose the most appropriate concept for their purposes. Browsing based user interfaces become possible.

 

A KOS can assist both precision (by allowing specific searching) and recall (by retrieving items described by related concepts or equivalent terms). It also provides potential pathways (for human and machine) that connect a searcher and indexer’s choice of terminology. The more formal specification of specific semantic relationships in an ontology can assist applications where rules are specified on the relationships and logic-based inferencing is appropriate.

 

The use of uncontrolled vocabularies may encourage end-user participation in social indexing or tagging and help build user communities for an application.

 

Many mapping and semantic interoperability applications depend upon KOS of different types, as do other downstream applications.

1.3.2      Return on investment

The return on investment (ROI) should be considered. There are many different kinds of vocabularies, with different degrees of vocabulary control, richness of semantic relationships, formality, editorial control - all serving slightly different purposes (see Section 3).

 

Different arrangements regarding ownership, maintenance and licensing of vocabularies can be found. The issue of who will maintain a vocabulary and the basis on which it can be made described or made available in a registry needs investigation since this underpins systematic use of vocabularies in the JISC Information Environment. This would involve establishing business models for access to and maintenance of vocabularies.

 

Recommendation:

Investigate methods to make vocabularies available to the education sector through a Registry, initially for experimentation purposes but ultimately in a sustainable, maintained, licensed manner. (See also Recommendation 3.7)

 

There is overhead in designing a controlled vocabulary and also in its use for classification or indexing. Thus cost/benefit issues should be considered for the particular application in mind, when deciding on richness of semantic relationships and degree of formality. For example, is a simple controlled authority list sufficient for the purpose?  On the other hand, there are many existing vocabularies and indexed datasets which can be leveraged or combined in larger schemes. There is also potential in (semi)automatic indexing and classification techniques, both in application of products from commercial systems and outcomes of projects in this area (see Section 5). There is also potential in the application of interactive metadata assignment tools and their embedding in application interfaces and project workflow.

2         Use cases - scenarios

In light of the possibilities discussed in this review, there is general potential for TS augmentation - some near term, some longer term. The following high-level scenarios and discussion are intended to illustrate a selection of the benefits TS might offer.

 

Some of the following scenarios are based on the RDN (recently re-launched as Intute) as an example of a prominent information service that could be further improved by various TS, there is no intended criticism as similar points could apply across many JISC information services

 

2.1      Retrieval performance

Your teacher has given an assignment to find information from the RDN on how vog is relevant to tomorrow’s classes. Unfortunately your attention wandered momentarily at the point when this new word was explained. You do not know if it is something to do with the morning class on Japanese culture and street style or the afternoon’s class on volcanos and global warming. You do a search with RDN on vog and find no hits. Using a TS that searches a general subject vocabulary, you look up vog and find it is related to volcanic gases. You search RDN with these terms and find relevant resources

 

This scenario illustrates a hypothetical web service that suggests extra terms to construct or refine a query. This scenario is an abridged version of Vizine-Goetz’s scenario for OCLC Research Terminology Services which involved a general Web search http://www.oclc.org/news/publications/newsletters/oclc/2004/266/research.html. That scenario employed Library of Congress Subject Headings, as an example of an authoritative and frequently updated general vocabulary. LCSH is sometimes used as a general classification scheme, often along with more specialized vocabularies. Similar TS could be offered by other vocabularies, in more specialized domain applications.

 

Initial stages of a search process (see Section 4.1) may involve a process of exploration or familiarization with details of an information need. As well as general subject vocabularies, online dictionaries or encyclopedia are sometimes used for this general purpose. Various TS could be integrated as an option in the search process, as sources for query terms. Google Toolbar already offers a dictionary service and similar forms of TS can be envisaged.

 

In information retrieval systems, Synonym Rings or Search Thesauri (see Section 3) are used for purposes of improving search performance by taking account of synonyms and also terms from related concepts in matching a query. A range of TS services to improve query performance (both recall and precision) are possible. This includes various query expansion possibilities, where result ranking can be based on degree of semantic match. For example, you may wish to search with very specific terminology; you would be very interested in matches on those concepts and, failing that, would also be interested in matches on closely related concepts. Employing query expansion can combine several search ‘moves’ in the one query.

 

Another example scenario from OCLC Research is an item from Dempsey’s Weblog (Aug 18, 2005) on how the catalog can be used to offer access paths, via same author or similar topic, etc. This is demonstrated with an example from the Worldcat Find in a Library service. http://orweblog.oclc.org/archives/000772.html

2.2      Name Authorities

You wish to find articles by author D. Smith in your ePrints University repository. This allows search by Smith, D. However, a large number of search results are returned, with several variants of the name (including hyphenated surnames, first names, middle initials), representing several, different authors, in a single list. There is no easy way to disambiguate the different people and achieve a definitive list. There is no online authority file which you could search or browse and select the definite person. The situation becomes even more difficult when the author name occurs as both first name and surname (eg Thomas, or Michael).

 

Provision of an integrated ‘added value’ name authority service would allow the searcher to disambiguate author’s names.

2.3      Mapping and other TS

This scenario extends the attractive BIOME Alternative Land Uses Case Study by considering mapping and some other TS:

Farm diversification is often held up as a panacea for a time of falling prices at the farm gate. Changing to new farm products or going into organic or conservation grade production is viewed as on route out of the cycle of downward farm gate prices. But also environmental schemes often referred to as agric environment schemes are put forward as a way of stabilising farm incomes and giving benefits to the wider community both rural and urban.

http://www.rdn.ac.uk/casestudies/biome/agriculture/case4.html

 

The scenario discusses two AgriFor resources, resulting from a BIOME search on farm environment schemes. The query is given as a starting point in the case study. However it is not obvious that a student would formulate such a query as a first step. Various TS might help in constructing the query by suggesting controlled terms (as discussed in Section 2.1), or with different forms of query expansion – both synonym and concept expansion.

 

Initial browsing is also a common early stage of the search process. Browsing is available by AgriFor high level categories and relevant items can be found under Economics, Trade and Rural Development/ Agricultural Economics/ Government farm policies. The third level category is not visible on the main Browse screen, so a student would need to select Agricultural Economics when browsing. See Section 2.4 for a discussion on a vocabulary search TS for extended browsing systems.

 

Having found the information item mentioned in the case study, there is no easy way of ‘beaming-up’ to the AgriFor categories, other than the browser back button. Instead, information items are indexed with CAB Thesaurus concepts. This is helpful – knowing how items are indexed is potentially useful for refining a search. The CAB concepts provide an option for navigating through the collection via clicking on an index descriptor. However, the thesaurus structuring of knowledge is not available to further assist the search. Browsing is via an alphabetical list - hierarchical context and related concepts are not available.

 

Combining a classification with a thesaurus for indexing provides excellent resources. More use could be made of the combination. One possibility is to map the two vocabularies together. This might help advanced search facilities, such as query expansion. Another possibility is a greater integration of search and browsing (see the DeweyBrowser, Section 5.3.6).

 

At present, the RDN case studies tend to be isolated within one of the BIOME gateways. Mapping could extend beyond the two vocabularies used inside AgriFor to the BIOME vocabularies, generally. For example, the Natural Selection gateway also contains useful resources for the case study. Natural can be browsed by DDC headings and information items are indexed by free-standing keywords. The collection includes items on Computer Based Learning in Land Use and Environmental Sciences, a journal on natural resource management and restoration, technology for ecology management, DEFRA wildlife and countryside, etc. – all potentially relevant to the case study. A mapping between the DDC headings, AgriFor categories, CAB Thesaurus could underpin a variety of TS and access routes. Cross browsing and cross-searching would be enabled across the two collections.

2.4      Repositories

You wish to search your institutional ePrints repository for articles on a particular subject. Since the coverage is wide, a general vocabulary is available for browsing access, in this case the top 2-3 levels of the Library of Congress Subject Areas, with associated postings. However, it is not clear from the main menu where your subject interest would fall – the terms you usually employ to describe your subject are not mentioned and you don’t feel like browsing multiple sub-menus in the quite extensive browsing classification. In the browser, you try to Find on this page without success. There is no way of searching the vocabulary to find where your interest might fall. You can, of course, search the full text but this relies on a subject keyword appearing in the text.

 

A TS that augmented the general classification with an entry vocabulary of synonyms and allowed search of this extended vocabulary would extend the utility of the retrieval functionality. This would provide additional entry points for browsing. The more extensive the classification and the browsing options, the more useful this will be.

 

This scenario assumes that subject search of a University publication repository is a sensible option. Given the probable patchy distribution of coverage in any one University, some form of known item search or author-based search may be more likely. However, subject-based access would be applicable to various types of aggregated repositories in the future.

 

Recommendation: Use cases should be developed and refined in an ongoing basis, along with case studies of TS in practice, user session logging, observation, etc.

3         Types of vocabularies

Descriptions and comparisons of different types of vocabularies are often confusing because the terminology is not controlled and there is also a fair degree of overlap. Furthermore, systems can be compared across different criteria. For example vocabularies differ in structure and levels of complexity but also in the application purposes for which they are designed and used. We first consider vocabularies by their structure and then discuss them according to some major high level purposes or application areas.

 

Recommendation: Provide access to a range of different vocabularies according to context

3.1      Vocabularies by structure

Vocabularies can be considered by their structure (Hodge 2000 and see also BSI, NISO). One way of organizing them is by increasing structural complexity and types of relationship, which is roughly the order of the main divisions in the following discussion. Knowledge organization systems (KOS) are controlled vocabularies, which are organized and structured via different types of semantic relationships.

3.1.1      Term Lists

At the simplest level, Term Lists offer ambiguity control and, usually unstructured lists, are particularly appropriate when a limited set of options is offered. If made available as a pick list, they can ensure terminology control in interactive indexing and searching applications. Authority Files are used to control variants of named items, such as personal, organizational or place names, and are often presented in alphabetical order. For large Authority Files, a limited hierarchy might be employed to make access easier. See Section 3.3 for more information. Glossaries are lists of terms from a subject domain with accompanying definitions. Dictionaries usually have more general domain application than glossaries and may include different senses of a word meaning. They are always presented alphabetically and may have information on word origins. Gazetteers list place names and may also include coordinate information on locations in various types of ‘footprint’, such as centroid, bounding box, etc. Synonym Rings have recently emerged as a type of term list, offering synonym control in (free text) web search tools. They are not used for indexing purposes but give the option of synonym query expansion of a concept in free text (uncontrolled) search engines. For example, Google has recently added an option of synonym expansion to searching, while domain specific sets of synonyms can be found in search engines for particular websites.

3.1.2      Taxonomies

All taxonomies provide a hierarchical organization of categories. The hierarchical relationship may be loosely or more specifically defined. They usually serve a classification purpose (similar items are grouped into the same “bucket” – see above). As such, they can be considered as (simpler) examples of classification schemes. Complex examples from the Library domain, such as the Dewey Decimal Classification (DDC) and the Universal Decimal Classification (UDC), are considered below under classification schemes.

 

Hierarchical organization of information occurs in many domains and various forms of taxonomies exist, serving different purposes and organized by different types of characteristic of division. Taxonomy is a particularly loose term, with a wide usage even within terminology circles, varying from relatively simple menu systems to complex corporate knowledge bases. Taxonomy is associated with (at least) three different communities: scientific taxonomic systems, website designers, corporate taxonomies. Examples from the sciences include the well known scientific taxonomies – see the life science projects discussed in Section 5.2.5. In website design, taxonomy is the most common term for a variety of terminology systems, sometimes very informal. Taxonomies are used as the basis for menu systems, as a method of organizing a website to facilitate interactive browsing through sections of the website, or to underpin other access mechanisms. In some situations, a very loose hierarchical relationship is employed to structure the menu system. Sometimes the menu structure is dynamically generated from an underlying knowledge base. In some business information environments, with different tailored views possible, we approach a more general corporate knowledge management structure. Various web development applications attempt to provide some form of automatic creation of taxonomies (see eg ch16, Rosenfeld and Morville). However human input is also recognized and the new role of ‘Information Architect’ has emerged. Daniels and Busch (2005a, 2005b), from the company Taxonomy Strategies, review commercial use of taxonomies and discuss best and worse practices, also considering vocabularies with regard to particular Dublin Core metadata elements. They recommend factoring the DC Subject element into separate facets when appropriate and give examples.  ROI  issues are discussed (see also Rosenfeld and Morville, ch 17-18).

 

To be useful, it is important to remember that more is involved than creating a simple hierarchical structure. Consider an example from a case study of Microsoft’s successful application of taxonomies (considered broadly) to the internal MSWEB, (described in detail in ch20, Rosenfeld and Morville). The Microsoft team’s use of taxonomy encompassed: hierarchical controlled vocabularies with equivalent terms for the same concept; metadata schema of the attributes for a given document type; category labels for the displayed options in menu systems. Their tools included a Vocabulary Manager (supporting the editing of vocabularies and relationships between them, including thesaurus relationships), a Metadata Registry and a URL Cataloguing Service.

 

Recommendation: It is important to consider the broader context and return on investment

 

3.1.3      Subject Headings

Subject headings are controlled lists of subject terms. They often have broad coverage but with shallow hierarchies. They usually allow for ‘coordinated’, composite headings, formed by combining single subject terms according to rules. These rules may be more restrictive than a faceted classification. Well known examples include Library of Congress Subject Headings (LCSH) and Medical Subject Headings (MeSH). They typically have a set of main headings which may be allowed to have subdivisions or qualifiers appended.

3.1.4      Relationship-based KOS

Relationship-based KOS are defined in terms of concepts and more clearly distinguish between different kinds of relationships than the previous KOS structures, while varying in granularity of relationships and degree of formality of definition. There tends to be a practical trade off between expressivity (eg number of relationships) and both interoperability (via common agreement on meaning and use of the relationships) and overhead in design. The common KOS variants tend to overlap in structure but are designed with different purposes in mind.

3.1.4.1  Thesauri

The thesaurus is designed for retrieval purposes and has a restricted set of relationships. The three thesaurus relationships are Equivalence (connects a concept to terms that act as effective synonyms), Hierarchical (broader / narrower concepts) and Associative (more loosely related, ‘see also’ concepts). These are defined by international standards. The British and US standards have recently been revised and extended (BSI still ongoing, NISO). The standards discuss common subtypes of the three relationships. For example, the hierarchical relationship can be specialized into Generic (subclass/superclass), Instance (class/instance) and partitive (whole-part) relationships. The equivalence relationship connects a concept with a set of equivalent terms, treated as synonyms for the retrieval situations envisaged by the designers, and again various subtypes are possible. Either mono or poly hierarchical structures may be employed.

 

According to the thesaurus standards, assertion of relationships between concepts is governed by strict rules. Some widely used thesauri do not follow all the rules but still appear to function effectively for their purposes. Thesauri tend to be defined for a particular subject domain or family of products and can be large. They are usually employed for descriptive indexing purposes and corresponding search systems. Thesauri can also be used as a query expansion resource in free text search engines (sometimes then referred to as “search thesauri”).

3.1.4.2  Classification Schemes

Classification Schemes in many ways are similar to Taxonomies (above). The more complex classifications, with well defined hierarchical relationships should be considered as relationship-based KOS. Well structured classification schemes are mono-hierarchical, conforming to the principles of exhaustivity (covering all relevant subjects) and mutual exclusivity. Complex schemes, such as DDC, augment a concept with a wide range of auxiliary information and connections, including sets of effective synonyms, ‘see also’ cross-references within the scheme, direct and looser (eg co-occurrence) mappings to concepts in related schemes or thesauri, etc.

 

There are two approaches when dealing with compound subject descriptions that combine individual concepts. In enumerative schemes, all legitimate combinations are explicitly specified in the scheme and located at a place in the class structure. Any new compound subject must be explicitly added to a new version of the scheme. Alternatively, there may be rules to express valid combinations (synthesis rules) by combining atomic concepts and this allows a much wider range of subjects to be described than is practical to explicitly enumerate. Such schemes are called synthetic. In practice, there are also hybrid approaches.

3.1.4.2.1   Faceted Classification Schemes

Faceted systems apply facet analysis to the process of synthesizing complex descriptions from atomic elements. The term, facet, is used in different ways which gives rise to some confusion. In this context, it normally refers to a set of fundamental categories (as appropriate to an application domain) and their combination according to rules. Each fundamental category might itself be a class hierarchy. Most commonly the different facet dimensions are mutually exclusive. Single concepts from different facets are combined together when indexing an object - or forming a query. Often this is a simpler and more logical organization than attempting to form a single hierarchy that encompasses all different possible combinations of (e.g.) objects and materials and agents.

 

Faceted browsing interfaces to web databases are useful when a user is able to orient to the initial display and various commercial search engines now offer this facility. Pollitt’s HIBROWSE system demonstrated the potential of browsing facet hierarchies and interactively combining terms from several facets to refine a query (Pollitt, 1997). The Flamenco system dynamically generates previews of query results as the user browses different facets (Hearst et al. 2002; Yee, 2003). Some user evaluation has been conducted and Flamenco is now available on an open source basis. In the UK, the Adiuri faceted system has been used to develop Web interfaces to some JISC Projects (see the Common Information Environment (CIE) Demonstrators in Section 5.1.5). Faceted, ‘filter-flow’ interfaces can guide the user through a set of choices, dynamically updating the range of options with each choice. Faceted approaches to searching may also be helpful in situations where query rather than browsing is appropriate (e.g. deep/unfamiliar hierarchies) or when query preview is impractical (Tudhope et al. 2006).

 

A somewhat simpler notion of facet is prevalent in many of the Web interface contexts and in the USA. Here facets are often different metadata elements and there is little notion of the semantics of combining them (see for example NISO and Rosenfeld and Morville 2003). Facets might include Place, Time, Price, Colour, Audience, etc. and may not always be hierarchically organised. See also the XFML representation for a class of faceted web interfaces in Section 6.2.

 

In the UK, influenced by the work of the Classification Research Group, more complex faceted systems can be found. Facet analysis is applied to different aspects of Subject, all hierarchically organised. Here fundamental categories might include Abstract entities, Objects (of different types), Materials, Agents, Processes, etc. Different types of rules govern the ‘syntactical’ combination of facets (sometimes recursively at lower levels) and an ordering principle is often applied, useful for structured browsing. For more information, see BSI Part 3; Aitchison et al. (2000).

 

Faceted Classification Schemes are similar in some ways to faceted thesauri, such as the Getty Art and Architecture Thesaurus, and to some ontologies (eg Section 5.2.6 - OpenGalen).

 

Recommendation: Consider faceted approaches when developing vocabularies and TS

3.1.4.3  Lexical Databases

The most well known lexical database is Princeton University’s WordNet, which is a general purpose linguistic resource, with a wider range of semantic relationships than thesauri. There are separate databases for nouns, verbs and adjective/adverbs, each with its own set of relations, including hierarchical relationships. WordNet distinguishes between different word senses via domain-independent lexical relationships, including homonymy, antonymy and synonymy (extensive “synsets” are provided). It has been employed in a wide variety of general language processing applications, although other lexical databases might well be used for specialised purposes. An EC Telematics Project produced a EuroWordNet, with different European language versions.

3.1.4.4  Ontologies

The term, ontology, is sometimes used loosely for any knowledge organization system, particularly if it is represented using Semantic Web standards, such as RDF. However as intended for AI modelling and inferencing purposes, ontologies tend to have the most precise and formal definition of relationships of the knowledge systems discussed here. An ontology will contain classes (concepts) and instances of those classes, being objects in the domain. Classes will usually have attributes so that complex objects in the domain can be described. Relationships will include is-a (for class hierarchies), instance, partitive and (sometimes many) domain specific relationships.

 

A distinction is made between detailed domain ontologies (which can be thesauri or classification schemes or enriched versions) and more general upper (foundational) ontologies, which describe fundamental rules and axioms governing relationships and their composition. Core ontologies seek to act as unifying frameworks for a general domain, sometimes bridging different domain ontologies. The CIDOC Conceptual Reference Model (CIDOC CRM) is a widely used example from the cultural heritage domain. Ontologies can be associated with formally defined axioms and rules for processing and combining relationships and are intended for use with logical reasoning systems. Consequently, they are suited to applications with well defined objects and operations (see Section 3.2.3).

3.2      Vocabularies by purpose

We now consider different broad purposes, communities of practice and intended contexts of use (allowing for some overlap in practice). Folksonomy and name authority services are considered separately due to current topical interest.

 

Recommendation: Descriptions of intended purposes of a vocabulary would be a useful element of a vocabulary registry (see also Recommendation 3.7)

3.2.1      Retrieval purposes

Information retrieval KOS are intended primarily to assist retrieval of resources, originally from bibliographic databases and library catalogues and now from Digital Libraries and the Web. The design rationale is perceived assistance in future retrieval operations. These include classification and indexing, search (including browsing, query and various forms of “intelligent” searching), mapping between KOS (mono and multi- lingual), providing a framework for learning a subject domain or exploring it in order to refine a (re)search question (defining concepts and setting them in context). A KOS might be used both for classification/indexing and searching, or just searching. KOS can be used to support manual cataloguing and also automatic cataloguing activities. KOS range from domain specific KOS to general classification systems, from two hierarchical levels to systems with great depth and breadth of coverage.

3.2.1.1  Classification vs Indexing

The distinction between classification and indexing is important but often misunderstood, especially in new Web developments (Lancaster 2003 is a good text). Both processes assign descriptors or tags to information resources. Both can involve KOS with hierarchical arrangement of concepts. However, classification seeks to group similar items together, whereas indexing seeks to bring out the differences between items, in order to help distinguish them during search. Classification provides an overview and assists organization of material. This structure facilitates methods of access based on browsing, whether browsing library shelves or hierarchical menu systems. Classification Schemes are often associated with a notation or coding scheme that produces an ordering, useful both in shelving and in ranking results of a search. Indexing (eg with a thesaurus) seeks to be more descriptive of an item’s content, as opposed to assigning an item to a broad category. Thesaurus descriptors may be combined during search. The difference is sometimes compared to a table of contents versus a back of book index. While the structure of a classification system and a thesaurus may be fairly similar, in that both consist of hierarchical structures of concepts, they will tend to differ in the exhaustivity and specificity of their application to information items. Thus an information item will generally tend to be classified by fewer, more general concepts from a classification system and conversely will tend to be indexed by several, more specific concepts from a thesaurus.

 

Sometimes a classification and an indexing system are combined to cover both purposes, for example a classification scheme with a thesaurus. This affords much flexibility in browsing interfaces and rich resources for automatic classification and search tools. It can also be very useful in offering different classification-based filters on (thesaurus-based) search results.

3.2.2      Linguistic purposes

KOS are used as resources for various natural language text processing techniques (both automatic and intellectual), including the areas of machine assisted translation and language engineering with named entity extraction, text mining, summarization. Section 6.5 briefly points to some language technology standards. The term, ‘terminology’ is often used by the natural language community to refer to a language-purposed vocabulary.

 

Commonly used general purpose, linguistic resources will tend to contain finer grained relationships for language engineering purposes and are briefly discussed under lexical databases (Section 3.1.4.3). However some information retrieval KOS can be considered as containing elements of linguistic resources, for pragmatic application of natural language techniques. Thus, significant thesauri and classification systems will have a large entry vocabulary of terms considered equivalent for the envisaged use contexts. They may contain extensive scope notes or definitions of different kinds, which can be viewed as linguistic resources.

3.2.3      AI purposes - modeling the entities in a domain

The term, ‘ontology’, derives from metaphysics, a branch of philosophy concerned with the description of reality (Smith 2003). It was adopted by the AI knowledge representation community, although the AI use has some differences. An ontology

… is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.

Practically, an ontological commitment is an agreement to use a vocabulary (i.e., ask queries and make assertions) in a way that is consistent (but not complete) with respect to the theory specified by an ontology. We build agents that commit to ontologies. We design ontologies so we can share knowledge with and among these agents.

A conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose. Every knowledge base, knowledge-based system, or knowledge-level agent is committed to some conceptualization, explicitly or implicitly.

(Gruber)

 

As outlined in Section 3.1.4.4, AI ontologies are formal representations, modelling a knowledge domain with precise definitions and relationships. They are designed to be used by first order logic reasoning systems and are a knowledge representation mechanism for communication between (automatic) intelligent agents. They are often associated with Semantic Web research and database schema integration. They are suited to applications with well-defined objects and operations, situations where it is possible to reach agreement as to the precise definition of concepts (and terms) and where it is useful to define logical rules for processing relationships and possibly inferring new knowledge. These applications tend to have a different focus than retrieval per se, for example elements of analysis in eScience, or automatic generation of new data. Examples might include many scientific applications, where the ontology is a model of currently accepted scientific knowledge and smaller subject domains, such as some business applications. There is overhead in creating (and sustaining) formal representations and in some situations it may not be not feasible to come to commonly agreed, precise definitions on abstract or contested concepts (eg some descriptions of human activity). For example, in search applications, where a fuzzy notion of ‘aboutness’ is the basis for indexing or classifying a document, as opposed to an assertion of fact, a less formal approach may be suited.

3.2.4      eLearning purposes

The field of eLearning covers a variety of applications and projects, including work on eLearning Repositories, VLEs and projects with dedicated material. To some extent, vocabulary work in the eLearning and Digital Library fields tends to take place independently and increased cross-fertilisation would be a beneficial future development. Collaborative examples include the ADEPT Project, which investigated the combination of structured vocabularies and visualisation techniques to assist students learn scientific concepts in an applied context (Smith et al. 2004) and the recent RDN/LTSN partnership (Powell and Barker 2004).

 

Recommendation: Increased cross-fertilisation between eLearning and Digital Library fields

 

The eLearning domain has seen an emphasis on standards for Learning Object metadata, with various elements recommended to be taken from relevant controlled vocabularies. The IEEE Learning Object Metadata/IMSL Learning Resource Metadata is a standard, which allows comprehensive description of the different aspects of a learning resource. The JISC Centre for Educational Technology Interoperability Standards (CETIS) has useful sets of Metadata standards briefings, along with links to standards and guidelines.

 

Like other domains, eLearning applications attempt to make use of terminology for more effective cataloguing, sharing, discovering and retrieving objects in the domain. However, eLearning has some distinctive aspects, due to the specific nature of the learning resources that are its particular focus. Learning Objects (LOs) are complex entities that can be accessed in different ways and that combine different use perspectives – for example learner(s), teacher, developer, digital librarian. The granularity of learning resources might vary from a complete course to a very fine grained LO, with one precise objective. They may involve multimedia elements and may involve design and control of navigation paths. There is interest in mechanisms that might allow primitive LOs to be defined and combined together in instructional sequences.

 

Potential access points for LOs subsume pedagogical dimensions, in addition to the Library’s traditional subject or topic based description. The different dimensions might not be considered orthogonal; for example, in some situations, appropriate subject description terminology might vary with the intended educational level of a learning resource. Even when considering topical subject description independently, it can be approached from the point of view of a general subject discipline or from a curriculum related perspective. Thus the LO selection process, employed by a student or teacher or indeed indexer, may involve multi-faceted relevance judgements in ascertaining whether a particular LO (perhaps the result of a search) is appropriate, as opposed to bibliographic topical relevance judgements. Knowledge of how people use LO vocabulary elements in practice would be useful.

 

Recommendation: User studies of behaviour by indexers (cataloguers), students, teachers. Investigate how to support effective practice with a variety of indexing and retrieval tools

 

In eLearning applications, vocabularies tend to be taxonomies (and classification schemes) or term lists, although thesauri are also found, particularly for subject description. Within eLearning, the IMS Vocabulary Definition Exchange (VDEX) is an open specification for the representation and exchange of the eLearning vocabulary types mentioned above. Faceted and poly-hierarchical vocabularies can be represented. An Information Model and Guides for Best Practice, Implementation and an XML Binding are available (Fegen 2006). The XML binding allows application of XML style sheets for tailored views. However an RDF binding is not available.

 

Recommendation: Investigate conversion between VDEX and SKOS Core representations for compatible vocabularies (see also Recommendation 6.2)

 

The full IMS LOM data model consists of 9 basic categories: General, Life cycle, Meta-metadata, Technical, Educational, Rights, Relation, Annotation and Classification. See Barker (2005) for an overview and the IMS Meta-data Best Practice Guide for an extensive description. Vocabulary type elements are constrained to be drawn from a specified controlled vocabulary and are Source-Value pairs. The Classification category (which can be repeated) is probably the most relevant for general vocabulary issues, although vocabulary based elements may occur in the other 8 categories, particularly examples of term lists. The Classification element is complex with various sub-elements, allowing description of different Purposes (or perspectives) for classification and, via the Taxon Path, source and taxonomic identification. Free text descriptions and keywords are also possible to complement controlled vocabulary or allow more specific descriptors for particular applications.

 

Indexing might draw on various aspects of the Purpose sub-element of Classification. Some tend to be free text, such as accessibility restrictions; prerequisite requirement; skill level; security level; competency. Vocabulary controlled elements often include:

 

A few indicative examples of educational vocabularies are listed in the eLearning references. For a comprehensive listing of educationally oriented vocabularies, see Report 1 from the JISC Pedagogical Vocabularies Project, which also has more details on eLearning vocabularies and related projects in general. This includes the Becta Vocabularies Studio (hosted by the Vocabulary Management Group), which supports editing and maintenance of vocabularies (see also the JISC Shared Infrastructure Services Review). There is also a Vocabulary Bank, a repository for educational vocabularies, with a web services interface and a Tagging Tool. The Vocabulary Studio maintains a central spine, used for dynamically mapping between vocabularies. Vocabularies are represented in the Zthes XML DTD. Vocabulary management software is available from the Vocabulary Management Group built on SchemaLogic’s SchemaServer engine with open source additions. Basic browsing and searching of the vocabularies is supported.

 

Various eLearning tagging tools have been developed (see references) and useful eLearning oriented cataloguing guidelines are available from JORUM and LearnDirect. Currier et al. (2004) see a continued need for tool development to support both cataloguing and search and for guidelines to support effective use of eLearning vocabularies. They describe examples of projects which experienced difficulty in cataloguing LOs and recommend collaborative teams with expertise in subject, pedagogic, metadata and discovery areas. The consistency problems they describe echo studies of indexing/classification practice over the years, showing low intra and inter – indexer consistency. (See Section 4.1 on studies of user information seeking behaviour.)

3.2.5      eScience purposes

Vocabularies for eScience share the general points relating to vocabularies outlined above. They also retain some of their own particular characteristics. A brief selection is mentioned to introduce some issues.

 

One major feature is coverage of non-textual material as a basic information element. Thus vocabularies exist whose purpose is to describe numerical datasets. These range from controlled term lists and Authority Files to the more structured relationship-based vocabularies. As with eLearning, they may attempt to deal with very fine-grained data elements and may involve non-topical vocabularies, for example physical units or parameter files for experiments, as well as various types of authority name. In some disciplines, there is a move to specialised markup languages for this purpose. For example, in Chemistry there have been moves to link chemical names to molecular structures and to describe experiments in a structured way, for purposes of re-use. Ontologies have been used recently in UK Grid projects (for example MyGrid).

 

There is a long tradition of making use of taxonomies and various initiatives facilitating Web-based taxonomic resources have made progress in recent years. Some life science projects are briefly reviewed in Section 5.2.5, including NCBI’s Life Sciences Search Engine and Taxonomy Browser. Much effort has gone into vocabulary-based indexing (and searching) the medical research literature and initiatives such as the UMLS metathesaurus have sought to unify different medical vocabularies. Effort has also gone into indexing abstracts or resources with multiple vocabularies.

 

Recommendation: Studies of user practice with vocabularies describing research data.

3.3      Named entity authority and disambiguation services

Factual data, in the form of named entity authorities, is an important aspect of terminology services. The main function is to identify and use correctly named entities, (a) improving precision and recall in retrieval by joining different name variants of an identical entity and (b) disambiguating identical name forms that refer to different entities. These are the same general controlled vocabulary problems outlined in Section 1.2.3 but they are intensified with name authorities due to the frequency and importance of their occurrence. These problems can be extensive in a single database or repository. They multiply, however, when using different sources for searching or when building aggregator services. Areas of application include support for indexing, linking, searching, browsing, disambiguation, metadata enhancement and terminology creation. Project Perseus (Crane and Jones 2006) found that about 6-7% of all words in text are named entities, i.e. person and organisational names, places, times and dates.

 

Semantic interoperability efforts have aimed to foster consistency by standardising with the help of, primarily, name authority databases and gazetteers or other geographic name authorities. Text and data mining techniques can be instrumental as a support for such authority files and their creation and maintenance or even as an alternative in some of the application areas.

 

In more detail, the results of such efforts are needed to

a)      support keyword assignment and named entity indexing

b)      allow and improve automatic indexing of content

c)      support advanced searching and browsing

d)      allow metadata validation and enhancement operations

e)      allow cross-searching/browsing and linking between several information sources

f)        identify potential candidate terms for the creation of a suitable and topical domain terminology and to contribute to the building of domain-specific authority files

3.3.1      Name Authority databases

Libraries, especially National Libraries, have a long history of activities, controlling names and creating name authorities. This was originally aimed at authors in the traditional printed publication world, via printed and online catalogues and national bibliographies. In its most advanced form, this lists all known name forms; identifies a preferred form; provides additional biographical and affiliation information, including sources to assist in uniquely identifying an author. Each record carries a local identifier number, which can be used to associate records in literature databases with a unique person. Clearly, this level of authority control is quite expensive. The key part of uniquely identifying an author needs to be carried out by humans, even though there can be a high level of machine assistance.

 

The most well-known effort of this kind is the Library of Congress Name Authority File (LCNAF). Name authority records in MARC format can be downloaded free of charge for use in a local library system. In the UK, the British Library (BL) Name Authority List is no longer used by the British Library. Since 1997 the BL has been contributing new personal name headings to LC NAF and a retrospective merging of the files is ongoing.

 

On an international level, several European projects support development and integration of name authority records, emanating primarily from national libraries, i.e. LEAF project - Linking and Exploring Authority Files (LEAF).

 

A national effort in the Netherlands actively integrating academic authors publishing on the Internet with names from the Union Catalogue is the "National Author Thesaurus" (strictly not a thesaurus). OCLC Pica has been commissioned to develop this for the Dutch national digital academic repository network (DARE). (From presentations by Leo Waaijers) 50% of Dutch authors are covered by the National Union Catalogue and another 40% are expected to be added, via matching with the institutional research registration system, Metis. One reason for the anticipated high coverage is that authors of academic journal articles are traditionally not covered by the libraries/National Union Catalogue. Manual additions are expected to lead to a full coverage. Final release of the database is expected for the end of 2006. DARE is actively looking for international cooperation. The ePrints UK project aimed to apply name authority to authors names in the descriptive metadata records the service harvested from UK institutional repositories, but could not find an appropriate source to build upon the requirements of the service.

 

There are various discipline-specific and organisational name lists available, however the requirements of services differ depending on their content.

 

The key issue with name authority files is generating the initial data to populate them. Archivists have always recorded more detail than libraries in name authority files, finding this necessary in order to distinguish between names. The National Register of Archives (NRA) has some 180,000 standardised corporate, personal and family names, each of which needs to be developed from the current skeleton record into a full record by the addition of content and links. There are potentially many thousands more, including some on A2A (Access to Archives).  Developing the name entries in the index into full authority records is a labour-intensive process, and has so far proved an insuperable barrier to the NRA indexes being launched formally as name authority files. Funding is unlikely to be available within The National Archives (TNA) in the foreseeable future.

 

In order to progress development, TNA is keen to collaborate with JISC and other interested bodies. TNA is willing to provide leadership and technical expertise to support the initiative.  Preliminary discussions have already taken place with a range of organisations including JISC, The Arts and Humanities Research Council, The Arts and Humanities Data Service, MLA, The Heritage Lottery Fund. It is also planned to include the British Library in discussions.

 

Subject-specific authority files include the ERIC Identifier Authority List, relating to education, and the American Institute of Physics Authority Database, relating to physics and allied sciences. Professional societies such as IUPAC and IUCr in Chemistry and Crystallography maintain lists associated to their World Directories of researchers and members. Person and institutional names could possibly be extracted from universities and research funding agencies (as in the Netherlands), provided integrity legislation does not prevent such re-use. Commercial enterprises are building services on top of personal and institutional name lists, e.g. the CSA owned Community of Science (COS), claims to have registered about 500,000 researchers from 1600 institutions to assist identifying people with specific areas of expertise.

 

Reference sources are also authorities. Useful sources for building authority systems include rich, traditional sources such as encyclopaedias and dictionaries, but also the recent, participatory and open encyclopaedia on the web, Wikipedia. This provides authority information about people and organisations, performs name disambiguation, synonym control etc. (Wikipedia). The German National Library has cooperated with German Wikipedia from 2005 in the usage of the name authority files, Personennamendatei, PND. 20,000 out of 100,000 biographical articles in German Wikipedia carry PND numbers (identifiers), which can be used for bi-directional links between Wikipedia and authority records, or bibliographic information about publications in library OPACs (Voss 2005). Project Perseus has also been making use of Wikipedia, finding high levels of correctness. In any case, such reference sources are highly valuable as training data for named entity recognition and text mining purposes.

 

Compared with the authority systems created by libraries, usually the lists developed by disciplines and organizations (in their raw form at least) are not authority files ‘proper’. Since they do not contain sufficient and unique information, they are not very well suited for, say, disambiguating names. Different name variants may still appear, for example because the association between a name and an organisation is often temporary and organisations can be renamed, split up, merged etc. Modern authority and access systems could assist in the necessary upgrade to unique identifiers for people and organizations in the non-library lists.

 

In this context, the benefits of standard formats for authority information becomes obvious. The Library of Congress Name Authority Format structures rich information, as does ISAAR(CPF) - the International Standard Archival Authority Record for Corporate Bodies, Persons, and Families (Second edition, 2004) - in the archives world. This includes biographical and historical information about persons/organisations, potentially highly useful for disambiguation purposes. Another standard is the National Council on Archives format. When cross-searching information or joining sources, the lack of interchange mechanisms between different standards is a serious problem. Using authority databases is an expensive but efficient approach, which greatly improves retrieval performance for users and avoids duplication of details about names in many databases, resources and collections.

 

With regard to (unique) identifiers for names, name authority lists normally use internal record numbers to identify author names. For authority services that establish the unique
person, this identifier becomes a unique identifier for the person, e.g. the LC control number in OCLC's service. Nevertheless, other authority lists and services will possibly have assigned another unique identifier to the same person, severely hampering cross-search.


Unique identifiers should be based on proper authority records, not just on different name forms found. Another requirement for these ID's to be useful is an organised cooperation between service providers running local databases in order to correctly merge or link records for identical persons. Correct name disambiguation requires a proper authority system.

National coordination efforts have potential to be recognised as authorities for people clearly belonging to the country and their identifiers might be widely reused. DARE in the Netherlands runs the  "Digital Author Identification" project, carried out in Groningen as part of the DARE "National Author Thesaurus" effort to investigate and solve these problems on national level. In the UK, the Eprints UK project intended to establish authority control. In that context, there was a suggestion that HESA (Higher Education Statistics Agency) identifiers (HEI Identifiers for institutions, work on student and staff IDs) might be a building block, perhaps through a national HESA registry.

At an international standards development level, the IFLA Working Group on Functional Requirements and Numbering of Authority Records (FRANAR) aims to study the feasibility of an International Standard Authority Data Number (IFLA).

Recommendations:

Investigate lists of institutional names and academic affiliations (IESR Agents etc.)

Study the coverage of available name authorities in OPACs and academic web publishing (LEAF, CiteSeer and similar)

Engage in international cooperation (OCLC, DARE, LEAF)

Prototype a demonstrator UK Name Authority File, maybe together with BL and universities (authentication, staff, institution databases) and evaluate its use in a limited application

3.3.2      Other named entity authorities

Apart from personal and organisational names, there are other named entities of relevance to information services for research, learning, memory institutions and Digital Libraries: i.e. place and other geographical names, street names and addresses, dates and periods, products and names for sources or objects of study. The usage of these authorities and their creation is very similar to the ones discussed regarding personal names. Thus, only a few pointers are provided here.

 

Place and geographical names have attracted most interest and activity. Quite a lot of geo-referenced information is available in digital form. Although not the focus of this review, localised searching is a prominent Internet search feature with heavy investment from major companies in the business. This probably tends to be dominated by GIS coordinate data and map-based navigation, rather than place names. This is also the case with HEIRPORT (HEIRPORT), the portal provided by the Historic Environment Resources Network, which makes available data drawn from different public bodies about archaeology and the historic environment, in collaboration with ADS, the Archaeology Data Service (see also Section 5.1.1). A map interface translates a search for a location or area into coordinates, which are used to search the database. To what a degree placenames are used, mapped to coordinates or standardised upon is not immediately clear.

 

In the digital library context, two databases and efforts have been pioneering and still dominate when it comes to global coverage: The database of the Alexandria Digital Library (ADL) and the Getty Thesaurus of Geographic Names (TGN). A good overview can be found in the special DLIB issue on Georeferencing (Hill 2004). Many countries have national and regional Gazetteers, as valuable sources of place names. Historical place names pose challenges. Historical placenames in Britain are addressed e.g. by Southall's Great Britain Historic GIS Project (GBHGIS), the historic boundaries being accessible via Edina (UKBORDERS) and placenames and landmarks from three gazetteers from late 19th century can be queried in BBC Vision of Britain through Time (Vision).

 

JISC is funding two important projects in this field: GeoXWalk (GeoCrossWalk) and the recently started GRADE project (GRADE), investigating the reuse of geospatial data.

 

Recommendation: Address the treatment of place and geographical names in UK services and activities, and the development of standards and authorities, in cooperation between these projects and related terminology efforts.

 

Regarding time periods, a recent important project to establish an authority file (or directory, as the project calls it) is the Time Period Directory at the University of California Berkeley, listing named time periods in connection with associated locations (Petras). Interestingly, the initial set of named periods is extracted from a traditional library subject heading system, the Library of Congress Subject Headings, and they try to model a standard after the ADL Gazetteer Content Standard (which operates on place names). The level of integration between these named entity approaches is obvious from the projects further plans: "... development of prototype systems to support the dynamic interaction of Time Period Directories, digital gazetteers, biographical data and ontological structures like thesauri and classification schemes, in combination with a variety of network-accessible digital library resources ranging from library catalogs to archival collections and digitized versions of historical primary resources." The crucial role of topical terminologies is obvious as well as the rich dynamic options of interaction between maps, biographies, timelines and chronologies with primary research materials.

 

The previously mentioned Perseus project (Perseus) has broad activities with regard to named entities. It has explored text mining techniques for creating different authority lists for predominantly historical texts and has developed useful visualization and navigation options based on this. While the lists themselves might not directly be useful for very different geographical, historical and cultural contexts, the approaches and techniques almost certainly are (for an overview, see Crane 2004).

 

For each geographical and historical/cultural context, usually separate efforts for creation of authority databases are required. They might later add up to broader coverage via international and inter-disciplinary cooperation, if this is deemed useful. However, in the natural and biological sciences particularly, named entities have to be managed and standardised by international bodies and professional organisations. This is happening to a varying degree and many efforts do not necessarily reach the level of established and used standards. In Chemistry, e.g., there are International Chemical Identifiers, Chemical Formula and IUPAC Chemical Names and so-called Colour Books for other terminology at a varying degree of standardisation and alongside commercial "standards" such as the ones from Chemical Abstracts.

 

Recommendation: Support active participation of UK institutions in international naming standardisation efforts in scientific disciplines and, via project support, assist their implementation in UK

3.3.3      Named entity recognition, text mining, name disambiguation

Named entity recognition (e.g. via text mining methods) cannot fully replace authority systems, since only identifies text strings representing named entities, it cannot finally decide which name forms are correct and which are fully equivalent to a given unique named entity. From the context a name appears in, certain assumptions can be made with different degrees of probability, via approaches such as co-reference resolution to identify variants of names for the same object. However in cooperation with existing authority files, the performance of these techniques can be greatly improved. Vice versa, authority files and services can be expanded and improved, based upon text mining from publications and other suitable sources.

 

Name disambiguation services work best based upon appropriate and comprehensive authority services. In narrow disciplines and more coherent collections, named entity recognition and co-reference resolution may be capable of providing a sufficient level of disambiguation, depending on ambitions for the quality of service and a thorough cost-benefit analysis. All approaches covered in this section essentially assist name disambiguation to a greater or lesser degree, but do not necessarily provide a complete solution.

 

In a broader context, text and data mining can be used to improve both repositories and aggregator services, with respect to subject access and terminology use (see section 4.6). There are various possible purposes for applying mining and the specific techniques needed and the approaches relevant will accordingly be different. Particular techniques will be used in relation to named entities. Text mining can be applied in order to identify, link, search, browse, index and extract named entities, such as author and institutional names or e.g. chemical compounds and their components, via identifying more or less standardised named entities, or their substrings from textual materials and publications. Probably equally important is the extraction of relevant topical terms and phrases. Apart from repository documents, other collections such as corpora of text and data and reference works are needed.

 

When using data mining with e-science databases and data (centric) repositories, additional features might be needed for knowledge extraction and hypothesis creation, data selection and comparison, correlations, calculations etc. Not all disciplines and sources are equally suited to mining and knowledge extraction. Lynch (2006) points out, that certain disciplines may be in a favourable position for early and successful exploration of such options and highlights the role of terminologies: "Areas such as biomedicine or chemistry, where much of the literature is relatively well-structured and where a base of investment in the ontologies, specialized vocabularies and vocabulary mappings and similar tools has been extensive, would likely be fertile ground for early advances." Lynch points out the important role of incorporation of markup to facilitate computational processing.

 

There is a rich literature on text and data mining methods and techniques (eg Witten and Frank 2000). The next section outlines some projects and tools.

 

Recommendation: Apply methods of name extraction and investigate their benefits compared to and in combination with traditional authority systems. Build and evaluate different name disambiguation demonstrators.

3.3.4      Tools, Web services

The Library of Congress, National Libraries, LEAF project partners and OCLC ran prominent projects and have tools for creating and maintaining the traditional name authority databases. OCLC runs VIAF, the Virtual International Authority File project. Using its software for matching and linking authority records for personal names, authority records from Die Deutsche Bibliothek are matched to the corresponding authority records from the Library of Congress. Shared OAI servers will maintain the authority files and provide user access to the files (VIAF). The combined approach involving national data about researchers from publication repositories is represented by OCLC Pica who carries out this task for DARE.

 

As a value-adding service OCLC Research developed a Name Authority control (interactive and automated name authority look-up service and web access to authority records in its Linked Authority File), originally for the ePrints UK project (LC Name), as metadata creation support to be hooked up with templates in the DSpace repository software package. Created as web services this and other developments of the Metadata Switch project/Terminology services project at OCLC (OCLC Metadata) can provide remote semantic interoperability enhancing functionality pluggable into local applications.

 

The Perseus Project at Tufts University has a longterm specialisation in named entity recognition/mining resulting in quite good levels of results (Smith 2002). They are planning an open source release of the software tools and to offer a service (Crane and Jones 2006). The GATE project (GATE) has developed tools for automatic tagging of personal names (the technique is now integrated into the Greenstone DL software as well). The University of Sheffield, where GATE is developed, plans to use text mining in digital 18th Century materials (Armadillo).

 

Elsevier's Scopus Author Identifier aims to automatically match name variations and to disambiguate between similar names (STLQ). CiteSeer has made serious efforts in name correction (e.g. with user participation) and name disambiguation using clustering methods based on naïve Bayes and SVM models (Han et al. 2005).

 

The US project NORA and the University of Illinois, Urbana-Champaign, use a tool for rapid flexible mining and machine learning, including a visualization tool, Data to Knowledge (D2K), which is available on an academic and research license. For other work relating to visualization tools, in the context of text and data mining, see (Fayyad et al. 2001; Shneiderman 2002).

 

Cornell University leads developments regarding metadata enhancement tools and services for the National Science Digital Library (NSDL) project, together with partners such as INFOMINE (University of California, Riverside) and its iVia Virtual Library software. Metadata augmentation, apart from enriching metadata records with subject headings and keywords (subject authorities), can comprise transformation services to correct degraded terms from controlled vocabularies and recognize values from recommended vocabularies, ascribing the appropriate vocabulary encoding scheme to statements. New metadata values can be generated, based on mappings between schemas or vocabularies (Hillmann et al. 2004).

 

Guidelines published in an article on the improvement of metadata quality in ePrint archives (Guy et al. 2004 - in the context of the ePrints UK project) underline the importance of early decisions on the usage and granularity of controlled vocabularies, their consistent application and the importance of built-in support for them in metadata editing tools.

 

Acknowledgement: For the material relating to archives and several other valuable pointers, thanks to the authors of the JISC Infrastructure Shared Services Review, A. Chapman and R. Russell, UKOLN.

 

Recommendations:

Experiment with a Name Authority Web Service e.g. to be built into metadata creation tools.

Develop or support metadata enhancement services for correction and enrichment: vocabularies, schemes, mapping, names.

 

3.4      Social tagging and folksonomies

The enthusiastic publicity regarding social tagging and folksonomies (and the broader perspective of Web 2.0) is reminiscent of previous enthusiasm surrounding Semantic Web visions and early metadata initiatives. A balanced approach is needed, acknowledging the value of previous vocabulary work, whilst not ignoring new possibilities.

 

In the context of this report and upcoming JISC initiatives, it is necessary to investigate to what degree social tagging and similar features have the potential to contribute to an improvement of subject indexing and knowledge organisation and subsequently to benefit resource discovery, browsing and searching. This relates both to possible new and already existing services. Another aspect is the role social tagging and folksonomies could play in creating, upgrading and maintaining vocabularies. Our focus in this review is on the use of social tagging and folksonomies as a contribution to knowledge organisation and discovery, rather than on other potential aspects of social tagging, such as social communication, group creation, bilateral recommendation and personal recommendation lists.

 

It is important to be aware that many aspects, both benefits and shortcomings, of social tagging are similar to activities known for a much longer time and under different names: author provided keywords (e.g. in scientific articles, Index Medicus and MedLine), user created browsing structures (e.g. in DMOZ and originally in Yahoo), invited user corrections of systems (e.g. CiteSeer) or user-created metadata. In these contexts, user participation concerning keyword indexing, classification and metadata provision has normally not been seen as an undisputed success, with regard to the functionalities of such systems. It remains to be seen to what degree other and new characteristics of social tagging might be more productive, such as the anticipated mass scale of tagging, potential convergence of terminology through public exposure, direct access to most of the sources involved, support through easy to use tools and visualisation, community-based user interfaces (with access to other peoples tags), along with the realisation of private and immediate rewards. These characteristics might, apart from any scale factor, increase speed and reduce costs, offering new qualities in user oriented information services. Clearly, as a public sector infrastructure providing institution, the JISC has a potential role to play here in terms of encouraging the development of trustworthy, sustainable and freely available services.

3.4.1      Terminology

A wide variety of terms are used to name the participatory activity which is the focus of this section, many of them rather mis-leading. Folksonomy (see Wikipedia entry) is often used as a synonym for ‘social tagging’, rather than identifying the whole of the vocabulary space emanating from tagging activity in a specific service. In this report we use the term ‘social tagging’ (although Wikipedia alone gives 16 different meanings of tag and tagging). In the following sections, we go on to delineate various, specific features.

3.4.2      Context

We need to be aware that social tagging exists in the context of a broad range of participatory or community activities in information systems, sometimes a component, sometimes an enabling, overlapping or alternative feature. Related activities include: Linking; Citation; Annotation; Recommendation; Lists (reading, shopping etc.); Exploration of usage popularity, user behaviour and preferences; User contributed metadata; Collaborative filtering, Social searching; Group learning; Customization and personalization (if shared with others).

 

These activities tend to have in common that they involve communities of use, predominantly secondary and tertiary resources, metadata, opinions, judgments and evaluations, notes and usage experiences, personal views and preferences. Aims include stimulating re-use through reference and recommendation, participating in and contributing to information services, supporting collaboration, cooperative research, learning (and entertainment).

 

It would be useful to pursue some theoretical effort to systematize and structure the field of activity broadly described as ‘social tagging’, to study purposes, methods and outcomes, as a framework for further research, development and comprehensive suites of features and services.

3.4.3      Categorization of tagging systems

The narrow field of social tagging systems has been categorized by several authors (eg Hammond et al. 2005) by: content creator and tag users: oneself or others; by audience: scholarly or general and by object type: web pages/bookmarks/blogs (delicious, Connotea, CiteULike, Technorati), pictures (flickr), music (Last.fm), products (Amazon product tagging), news (Digg) etc. Thus, social tagging is predominantly associated with publications outside traditional channels such as pictures, music, blogs and news. The number of differing systems and applications is so far not very high and the differences rather small.

 

In our context, services enabling tagging by and for a scholarly audience and covering relevant media are most interesting. Several media types need to be considered, such as traditional primary publications, books, journal articles, museum objects, objects in repositories and data sets, name and organisational directories, terminology systems etc.

3.4.4      Disadvantages and problems

Current publications about social tagging (many in blogs) provide long lists of advantages and disadvantages of social tagging and tend to be written by enthusiastic advocates in a highly promotional vein. Few evaluative, systematic studies from professional circles in knowledge organization, information science or semantic web communities have appeared to date. In order to stimulate future projects and improvements, potential disadvantages and problems of existing social tagging systems are discussed, as regards the scope of this review. Potential benefits are reviewed in the following section.

 

Compared with traditional knowledge organisation, social tagging redistributes costs, moving them from term assignment to discovery, or as Ian Davis (2005) puts it, "Tagging bulldozes the cost of classification and piles it onto the price of discovery".

 

An obvious issue with existing social tagging systems is that they are not designed for information discovery and retrieval. They tend to combine various functions in the same application and the same approach is often applied to very different object types: text, web pages, link lists, blogs, pictures and other media, multimedia etc.

 

The most obvious and often mentioned shortcoming is the lack of any control of the vocabulary. Most harmful to retrieval performance is the lack of simple control (irrespectively of being applied at the time of input or as later improvement processes applied by the system), such as control of word forms (singular, plural), morphological forms (nouns), spelling, use of numbers, character sets and transliteration. A certain linking of synonyms and disambiguation of homonyms is crucial for acceptable recall and precision when searching, as is control of how names are presented (first names, last names, initials, nicknames). Equally crucial are place names, dates/times and acronyms. Also missing are the advanced benefits of KOS, regarding concepts, semantic relationships and controlled mapping of terms.

 

Another shortcoming is the absence of rules for indexing/tagging: rules concerning exhaustivity, specificity, granularity, compound construction or provision of context. Tags indicating a personal context (e.g. my brother) may not be useful to the public. Place names acting as a subject/topic should be differentiated from personal associations (e.g. a picture of a car photographed in Labrador has not Labrador as subject; it does not say anything meaningful about Labrador, nor about the breed of dogs with that name). The absence of rules e.g. about phrases (often prohibited) or construction of compound terms leads to various ad-hoc practices, with different special characters used for connection of words or other information encoded into the tags, such as creation of hierarchy and structure, or other non-topical metadata (places, coordinates, times, names, types).

 

The lack of structure among the tags, deprives the systems of concept-based navigational options, such as systematic browsing exploring hierarchies, or other forms of semantic relationships. Alphabetical, indexer-name or popularity sorting are often the only options, hampered in addition by the previously mentioned lack of control. This results in a lack of context for the tagged information items.

 

In summary, while social tagging may have other benefits, as currently constituted, it is not suited for targeted effective search or systematic topical browsing. In retrieval terms, social tagging systems would have low precision and low recall. The only discovery approach which might be favoured is serendipity. There is, of course, some benefit, in situations where there would otherwise be  a complete absence of any other indexing and discovery features.

3.4.5      Advantages and benefits

 

The major, general benefits with existing social tagging systems include the likely ability to:

 

In addition, there may be potential benefits for ambitious information systems addressing the HE sector in the following contexts:

 

Social tagging cannot and should not replace other indexing and knowledge organisation efforts. For the purpose of resource discovery at least, the main recommendation is to explore their strengths and to use them in complementary ways, both by optimising such systems for discovery and by combining them in different ways with more controlled knowledge organisation and retrieval systems. More detailed development and research suggestions are outlined in the two following sections.

3.4.6      Proposed developments

Three directions are proposed to stimulate development efforts and experiments: 1) Improvement of existing social tagging systems; 2) Development of alternative tagging systems; 3) Integration of social tagging features into existing systems and services. The latter two groups of suggestions appear more promising and should be prioritised.

 

1           Improve existing social tagging systems

 

2           Build alternative tagging systems

To stimulate innovation, guidance and improved praxis, the creation of alternative tagging systems with different approaches could be supported, such as:

·         create a social tagging system optimised for discovery and retrieval

·         build a system for a homogeneous service, with well-defined user group and purpose

·         explore systematic use of controlled vocabularies in a tagging system

·         apply advanced mapping of tags to facets, established KOS or authorities for named entities

·         hook library and discovery services into a social tagging system, in the way OCLC does with search engines and institutional repositories (OpenWorldCat in Google, Name Authority service in DSpace).

 

This development direction has a parallel in the effort of the Nature Publishing Company, to create a tagging system specialised in bibliographic information and reference management based on Connotea (Lund et al. 2005).

 

Recommendations:

Experiment with combination of KOS-based controlled indexing with an established vocabulary and free (social) tagging for research purposes in a specific discipline, optimised for discovery and retrieval

Experiment with potential for automatic linking of tags to facets, controlled vocabularies and authorities

 

3           Integration of social tagging features into existing systems and services

An important strand of work when it comes to realising the benefits of social tagging is to integrate user contributions into existing information systems and services. This is a special case of the application of all kinds of participatory approaches. In the example list below, we focus quite narrowly on the provision of tags. Addition and integration of different types of user participation to established services might be equally rewarding.

 

Among the most promising options are:

·         OPACs (Online Public Access Catalogues, for library materials): here, some activity can already be seen, e.g. PennPal or the OPACi prototype from Casey Bisson. The immediate activity would be to allow users to add tags to bibliographic records and then using them for different "views", linking etc. Other options would be inspired by Amazon as e.g. the Open WorldCat reviews.

·         Subject Gateways, Intute hubs: User tags could be used to inform resource selection for final inclusion and to support improved subject access as in OPACs above. User annotation has been tried before, e.g. in SOSIG Grapevine.

·         Directories: Users have been creating and populating categories early on in the Yahoo directory, the dmoz and Open Directory efforts. Yahoo Social Systems has the ambition to do more with the directory.

·         Subject repositories: the situation is similar to OPACs and Subject Gateways. There are benefits for resource selection, the acquisition of new indexing vocabulary, the creation of conceptual structures and categories where they are missing now.

·         Citation services: CiteSeer already offers the option of user corrections; name disambiguation and adding of content tags would be other alternatives.

·         Digital libraries: similar to OPACs, plus creation of structures where there are none. Integration with other participatory efforts would be promising also.

·         Search engines: the big commercial search engines are already active in this area, e.g. Yahoo and Google Answer, Yahoo building its whole brand on becoming a "social system". Specialised, academic and local search engines would greatly benefit from similar approaches, integrated with traditional KOS, maybe. Systems like Vivisimo could use social tagging to label clusters.

·         KOS creation and development systems might benefit from broader user input and reuse of tags from other systems. The DDC editorial system is opening up for broader expert participation.

·         Museum online interactive exhibitions and object catalogues: a few projects are active here, e.g. Steve.museum and the ED2 project at the Cambridge University Museum of Anthropology, mainly with inviting user descriptions of pieces of art and tagging of user experiences (Trant 2006).

·         Metadata enhancement services: as with several services above, richness of indexing, disambiguation and correction of errors come to mind.

·         Blogs, news services and RSS feeds: these services are natively based on user contributions. More advanced usage of tagging could be imagined, however.

 

In addition, all these services could, based on social tagging, provide different and alternative views or layers of resources and search results, co-occurrence clustering or automatic linking from tags in the system to external resources such as flickr and delicious.

 

Recommendation:

Integrate tagging to existing services such as repositories, OPACs, (RDN/Intute) subject gateways, Digital Libraries, KOS creation and management systems, museum exhibitions and catalogues, metadata enhancement services etc.

3.4.7      Research

In the context of social tagging, there are many important aspects which would require research efforts in parallel to service developments. On existing systems now and after the creation of new or improved ones, usage aspects and benefits need to be systematically studied. Very little research seems to be done at this time. Among more specific research topics are:

 

Recommendation:

Comparison study between different types of social contributions: annotation, recommendation, personalization, restructuring of information, categorization, concept space, concept maps, topic map tools. This could inform a prototype integrating different types of user participation with social tagging.    

 

3.5      Best practice guidelines for constructing and using vocabularies

Best practice guidelines for design of different kinds of vocabularies offer practical help. Aitchison et al. (2000) is a standard reference in the UK for thesaurus design and construction, while the Willpower website offer useful practical guidelines, along with a list of commercial software. As well as describing their respective standards the BSI and NISO standards documents also offer best practice design guidelines, with their scope widening now beyond thesauri. The BSI Guide is perhaps particularly relevant for JISC UK purposes. Middleton’s Controlled Vocabulary List includes a Bibliography, list of software and some pointers to Guidelines. The University of British Columbia’s Indexing Resources on the WWW contains an extended set of links to guidelines on related issues.

 

Rosenfeld and Morville (2002) is a widely used textbook for information architecture and website design techniques that build on various vocabularies. It includes chapters on management, ROI and case studies. Daniels and Busch (2005a, 2005b) give specific best (and worst) practice guidelines and ROI considerations from a commercial DC perspective. TASI give an introduction to adopting a vocabulary within a metadata framework. The JISC CO-ODE Project (Section 5.1.2) offers tools and tutorials on ontology design. The GovTalk archive provides design/selection criteria for vocabulary software. The recently published e-Government Metadata Standard (Version 3.1) recommends the Integrated Public Sector Vocabulary (IPSV) as mandatory for its subject element. Concepts from other controlled vocabulary may be added (with the scheme being declared). The IPSV is available in full and abridged versions and in CSV, XML, RDFS and other formats, while various guides to tagging and use generally are also available.

3.6      Network access to vocabularies

 

Both for human Web access and m2m access there is a need to discover appropriate terminologies, and to evaluate, navigate and query the terminology once found. 

 

Many terminologies and thesauri are now made available over the Web, intended for human use. However wider use of KOS and their integration into applications in an automated fashion will require m2m access. There are a number of standards emerging adoption of common standards for representing and accessing vocabularies which are outlined in Section 6. There have been some significant steps forward in an attempt to stimulate wider use of existing terminologies. For example, the OCLC Terminology Services project (Section 5.3.6) has recently made available some dozen vocabularies in the MARC 21 Format for Authority Data in XML on its website, in addition to the Dewey Decimal Classification (DDC) 22 Summaries.

 

There are many web based human readable lists of vocabulary resources (both commercial and freely available), several are detailed in the references. Notable examples include the following. The JISC HILT project has compiled an AtoZ of thesauri. The JISC Pedagogical Vocabularies Project has compiled a list of educational vocabularies. The TASI (the Technical Advisory Service for Images) website has a list of thesauri, classifications and authority lists, along with an introduction to their use within a metadata framework. The Species 2000 website maintains a checklist of online taxonomic databases. The Text Mining Centre has a list of bio-medical ontologies. Middleton’s Controlled Vocabulary List includes subject heading lists, thesauri and classification schemes. The University of British Columbia’s Indexing Resources on the WWW has lists of classification schemes as well as other vocabularies. Dextre Clarke’s extensive review of Taxonomies in the Public Sector also includes sources on potential benefits as well as software and design.

 

Synapse provides the Taxonomy Warehouse of taxonomies, thesauri, classification schemes and authority files, organized by category. This includes both online links to online vocabularies and their own “value-added fulfillment service” of conversion and packaging with other software. The University of Toronto maintains the Subject Analysis Systems (SAS) Collection, which acts as a “North American Clearinghouse for subject classifications and controlled vocabularies in many different subject areas”. The MDA (formerly the Museum Documentation Association) website has a Terminology Bank of cultural heritage vocabularies it has sponsored and publishes online.

 

In the longer term, it is hoped that such human readable lists might be maintained as ‘registries’ although the issue of who is to maintain them has to be resolved. See Section 3.7 on registries (and Section 3.2.4 on the Becta Vocabulary Bank).

3.7      Terminology Registries

 

There are a number of registry initiatives within the education domain designed to support a services oriented approach to component development, providing ‘look-up’ functionality. These registries provide programmatic access to registered data of various sorts. Such registries include service and collection description registries (e.g. Ockham, JISC IESR), transaction service registries (some using UDDI, mainly within the eScience community, such as GRIMOIRES), registries of mapping and crosswalks (OCLC crosswalks registry - see Metadata Switch Project in Section 3.3.4), and metadata schema registries (JISC IEMSR, The European Library, DART). There has been little activity in relation to terminologies with the exception of the NSDL Schema Registry which plans to register both metadata schemas and KOS related controlled vocabularies in use within NSDL.

 

In general, registries enable discovery, navigation, access and re-use of the objects that are registered. In relation to terminologies such registries might take different approaches depending on the functionality they are designed to deliver, whether registering descriptions of vocabularies, registering individual terms and concepts, or usage within domains or discipline. Services based on terminologies (such as disambiguation services, query expansion, mapping) might also be registered whether within a specific terminology services registry or within a services registry with a wider inclusion remit such as the JISC IESR.

 

 

Providing m2m access to information about terminologies, and terminology services would encourage exploitation of existing vocabularies and enable innovative interfacing with applications from ‘other domains’. Policies would need to be established covering status, persistence, identification, and quality.

 

 

Registries might be more or less centralised or distributed, depending on policy and finance drivers as well as on technical design decisions. There is scope for co-operation with other international initiatives, both re-using software and exchanging data. To enable interworking and data exchange, registries themselves need to be standards compliant, although standards are immature in this area, particularly around data description. The ISO/IEC 11179 standard has some relevance here (and in particular the XMDR project taking this forward to register more complex structures), however the driving force behind this standards making activity is influenced by database and data dictionary technologies, rather than enhancement of the semantic interoperability of web based services, a focus of more interest to the JISC community.

 

 

There needs to be careful consideration of the cost benefit of registries. Whilst some funding organisations such as the JISC might consider registries as a means to identify and promote services available to their communities, there also needs to be investigation of the ROI for providers of specific terminologies and services to contribute to registries. Several of the larger vocabularies have commercial business models and m2m use will raise issues around managing IPR and copyright even for smaller community based vocabularies. There might be a variety of business models, but as with other ‘shared services’ it is sometimes unclear who is the obvious funder. Other business issues include clarifying who owns content of a registry? who is responsible for transforming the content of vocabularies to machine  readable structure? is there commercial motivation for KOS owners to ‘work together’ in the context of interoperable registries?

 

Recommendations:

Demonstrate the use of a terminologies registry within JISC IE testbed to include

·        Investigating inclusion of terminologies into IESR, potentially describing vocabularies as collections

·        Developing  marketing proposition for a UK terminology registry (include use scenarios, IPR issues, business models, cost benefit)

·        Evaluating use of the draft metadata description profile proposed by NKOS

·        Maintain collaboration between various UK initiatives (with eScience e.g. GRIMOIRES and learning communities e.g. Becta Vocabulary Tool) and internationally (e.g. NSDL)

 

 

4         Activities with TS

This section attempts to generalise beyond specific projects and types of vocabulary to discuss some ways that terminology services can be applied in wider frameworks. We begin by considering user behaviour, go on to discuss the different types of terminology service in context of the JISC Information Environment and eFramework and finish by considering terminology as part of work in automatic mapping, classification and text mining.

4.1      Studies and models of information seeking behaviour

It is important to consider how people search for information when designing and evaluating TS, in order to reduce the scope for design errors and increase the possibility that services will actually be used. While this is a difficult and complex area, there is a considerable literature on studies of searching behaviour and dedicated conferences, such as Information Seeking in Context (ISIC)[1], have emerged. Such studies offer possible insight into discovery strategies, user needs and user contexts. While remembering that variations in environmental context and individual characteristics, such as training and motivation, can be important, these studies can be a useful resource for planning future developments and evaluation methodology.

 

The term information seeking usually refers to the broader context of an information need, while information searching denotes interaction with a computer for a specific search, although the distinction sometimes becomes blurred (Marchionini, 1995; Spink et al., 2002; Wilson, 1999).

 

Terminology support has been found potentially helpful for both recall and precision.  In an extensive study of online behaviour by search intermediaries, Fidel’s findings (1991) supported the utility of terminology support alongside free text retrieval. According to circumstances both retrieval modes were used to improve either recall or precision. Even professional searchers tended not to make use of synonyms in free text searching, leading to the conclusion that there is a need for well designed thesauri and associated tools.

 

Information seeking models, such as Choo et al. (2000), Kuhlthau (1991) and Saracevic (1997) provide general frameworks of information seeking behavior which can assist with higher-level design aims. Ellis (1989) critiqued the restrictive assumptions of controlled laboratory evaluations and argued for an empirical, behavioural approach to information seeking studies. This led to focus on basic information seeking patterns, such as browsing, chaining, monitoring, etc. Soergel (1994) stressed the need to take account of the full context of indexing, system and user factors in evaluation. Kuhlthau’s (1991) and Marchionini’s (1995) models describe the basic stages in the information searching process, in terms of problem definition, query formulation and execution and examination of results. Blocks et al (2006) provide a low level model of the stages of thesaurus assisted search, intended as a practical reference model for system developers.

 

Studies of searching behaviour generally reveal it to be an iterative process. Bates’ influential, 'berry-picking' searching model (1989) emphasised an evolving search, in contrast to models of a static information need where a single query is optimised. She found that in many cases users' information needs evolve as the session progresses in interaction with the material encountered.

 

This leads to a need to consider the appropriate balance between interactive and automatic TS. For example, the balance between system and user control of terminology supported query expansion (QE) has been the subject of much research. The various Okapi projects conducted a number of experiments with thesaurus based QE in operational settings as part of a probabilistic query model (Beaulieu, 1997). These ranged from fully automatic to interactive QE. Their conclusions favour a balance between automatic and interactive control and explicit versus implicit use of the thesaurus. Other empirical studies considering the user-system balance include Jones et al (1995); Shiri & Revie (2006); Vakkari et al. (2004), and Greenberg  (2001), who compared the performance of different thesaurus relationships in automatic versus interactive query expansion. She argues that intelligent systems should take into account (evolving) user retrieval goals.

 

Research has argued the importance of strategic or conceptual support (e.g. Brajnik et al., 1996; Fidel 1995). Bates (1979) and Fidel (1985) identified a number of tactics or moves respectively employed by professional searchers to modify or reformulate queries, for example moving to a broader or related term. Bates (1990) discusses possibilities for system support of search activities at different levels of granularity, within a framework of end-user control of the search steps. She argues that one reason current interfaces are difficult to use is that they tend not to be designed around typical search behaviours that promote strategic search goals. She particularly recommends that research be directed to system support for end-user searching at the mid-level range of tactics and stratagems, as opposed to basic moves and high level strategies. 

 

Recommendation: User studies of TS in context of JISC IE, illuminating the search process (for work flow of services) and the appropriate balance between interactive and automatic TS.

4.2      Information lifecycle with regard to TS

The model presented here draws on the information lifecycle management model described in the DELOS (deliverable D5.3.1, section 3.2.2) state of the art review of semantic interoperability in Digital Libraries (Patel et al. 2005), which synthesized lifecycle models from knowledge representation and Information Science Fields. The DELOS lifecycle model is applied (and extended) with regard to two different aspects of TS, a) the vocabulary as an entity in itself and b) elements of a vocabulary used as part of an information system (eg providing a search term). In the latter case, a search system might be ‘terminology-aware’ in its use of TS, or it might simply treat terminology elements as a source of uncontrolled terms, for its purposes. These two aspects are combined together informally in the revised framework - it should be emphasized that other configurations and selections of the elements are possible. The purpose is to provide a heuristic, unifying framework for considering the range of TS applications. In Section 5.4, some projects are roughly located within this broad framework.

 

The TS Lifecycle framework is given in Figure 1. Creation here refers to the production of a vocabulary, while Acquisition refers to the stage when the vocabulary is integrated with a collection or a registry of some kind. Identification (considered under Cataloguing) provides a unique key for a vocabulary or a vocabulary element (see Section 6.3). Integration is discussed in Section 4.4. Access, Search and Discovery has been treated in more detail than the DELOS version due to the focus on TS. Of course, the other elements are also relevant to this review, with Acquisition, Maintenance and Archiving being rather less central. Note that the lifecycle may involve creators/authors, publishers, information systems managers, service providers and end-users of different kinds. More generally, this lifecycle model connects or overlaps with wider models of information seeking behaviour (see Section 4.1) and the scholarly lifecycle (Lyon 2003). Note that in practice many user activities involving TS are an iterative process.

 

Creation and modification of vocabularies

    Creating/sustaining vocabularies

 

Publication of vocabularies

    Licensing

 

Acquisition of vocabularies

    Selection, storage

 

Cataloguing (metadata, identification/naming, registration)

    Indexing/classification/annotation

            Intellectual, semi-automatic, automatic, disambiguation

 

 Integration (syntactic and semantic interoperability issues)

    Mapping, merging

 

Mediation

    User interfaces (TS surfaced in interface)

            Faceted/Spatial/Other access afforded

     Browsing and visualization of vocabularies

     Personalization (of interface or vocabulary)

 

Access, search and discovery

    Discovery of services, vocabularies (and concepts), databases/collections,

    Search and retrieval

            Querying

                        TS-aware[2] query

                        Query expansion – synonyms, semantic

                        Cross-searching, cross-browsing across distributed collections

 

Use (as part of a broader service)

    Search+Analysis applications,

    Information extraction, mining

 

Maintenance of vocabularies

    Evolution, versioning

    Application specific, collaborative work

 

Archiving and preservation of vocabularies

 

Figure 1. TS Lifecycle framework

 

4.3      Types of Terminology Web Services

This section discusses terminology services, in relation to the JISC Information Environment and e-Framework. We first discuss Terminology Services as abstract services and then go on to discuss hierarchical layers of services and Terminology Web Services, specifically.


In the context of a service oriented view of the JISC Information Environment, where information services, at least in part, may be offered through web services for use by software applications, it is necessary to extend the scope of terminology related services beyond the few, initial examples documented to date, e.g. by Andy Powell (Powell 2005b, Nov). Terminology services are also mentioned in other service oriented architectures and e-framework initiatives such as those adopted by the DLF Abstract Services Taskforce in the US (DLF) and the e-Framework for Education and Research in the UK (e-Framework), including its component, the E-Learning Framework (ELF). Again, TS are not delineated in any detail and one of the aims of this section of the review is to outline some of the different types of TS, with a view to stimulating more detailed design work in this area.

 

Recommendations:

Develop more precise definitions of TS, as part of the JISC IE and eFramework

Define search process workflow of TS within JISC IE eFramework


At the service components level, as shown in the JISC IE architecture diagram (figure 1 in Powell 2005b, Nov.), a group of services forms the Terminology Services component. Four are subsequently listed as abstract services: Vocabulary search interface, Vocabulary harvest interface, Vocabulary deposit interface and Terminology service (mapping and expanding terms). (News channel and Delete interface are also mentioned without details). Relevant vocabulary standards and protocols should be involved in any bindings of the abstract services. However, it is not clear what precisely is entailed by these services.

4.3.1      Definition of Terminology Web Services

A more comprehensive definition of terminology web services is the following:

Terminology Services are a group of abstract services, presenting and applying vocabularies, their member concepts, terms and relationships, describing the meaning of terms and facilitating semantic interoperability. This is done for purposes of searching, browsing, discovery, translation, mapping, semantic reasoning, subject indexing and classification, harvesting, alerting etc.


Potentially, abstract services supporting creation, storage and management of terminology might be added, such as deposit, manage, edit, delete. They may partly overlap with services for presenting and applying vocabularies, but are not at this stage included in the definition and examples below.

 

Note that in this context, vocabularies include the different types of controlled vocabulary described in Section 3.1 and, additionally, sets of mapped (or translated) terms and concepts resulting from mapping services. Uncontrolled vocabularies, such as uncontrolled term lists, author provided keywords, tagsets, folksonomies should also be included for terminology service purposes.

 

There are layers of services at different levels of granularity. At the bottom level, bindings are particular instantiations of an abstract service, giving (as appropriate) specific data representations, an API and Web Service specification (if that is the architecture adopted). At higher levels, the abstract services will form part of broader application services or JISC Service Components. Abstract services may involve layers of Terminology Services, for example a Search Interface or Harvest Interface might take keywords from a lower level Terminology Service.

 

Recommendation: Within the context of eFramework develop a hierarchical layered set of protocols for TS and standard bindings to (various) APIs


For each service, various standards and protocols apply. They generally fall into two contexts:

a) standards relating to the description, structuring and functions of vocabulary
systems/schemes themselves (eg SKOS Core, VDEX, Zthes, SKOS API, BSI and NISO standards, etc.) – see the various vocabulary related standards described in Section 6.
b) standards related to the TS application context: searching, harvesting, alerting, and other abstract services.


Not all possible bindings and combinations of services are considered here. Some key reference implementations for different types of terminology service would be a useful future development.

 

As an initial step, the next section provides more detail on selected TS. It builds on and extends previous efforts, which deal with terminology services with broad brush strokes (eg Powell 2005b, Nov). However, it is still far from complete. The distinction between 'business processes' and individual abstract services (Powell 2005a, Feb) has not been followed, since this adaptation to the DLF approach is not widely adopted at this time.

4.3.2      Groups (and layers) of abstract terminology services

Three broad groups of abstract terminology services are described below (the third in less detail). In order to illustrate that hierarchical layers of terminology services are necessary, lower level terminology service options for some relevant cases are detailed (in italics). These would be called as lower level services, as part of the implementation. The SKOS API (see Section 6.4.1) is used to express the low level terminology services. This has a Java Web Service binding but is also expressed as a binding-independent protocol (and thus could have an HTTP implementation, say). In the listing below, they could be considered, more or less, as both a specification of a low level terminology service and one possible binding of it. Another binding of the same low level terminology service would be possible. Some lower level OAI harvesting service examples are also shown.

 

The first of the three groups concerns abstract services related to entire vocabulary schemes/systems. The distinction between discovering (identifying) a suitable vocabulary and retrieving metadata about it, versus retrieving member concepts and terms of a vocabulary tends to be overlooked. The differentiation between services relating to a complete vocabulary scheme and its metadata versus services relating to member terms (the second group below) is fundamental, as is the differentiation with services related to the application (more or less seamless) of terminologies in other services.

For example, an abstract service described as 'Vocabulary harvest interface' (Powell 2005b, Nov) does not specify whether it concerns only the 'harvesting' of individual terms and other information about the vocabulary, or groups of terms, or perhaps the harvesting of the complete vocabulary. (The last would tend to be an exception, considering current practice and the rights situation regarding vocabularies.)

Another intended contribution of the description below is to illustrate that terminologies may comprise entities other than terms: i.e. concepts and relationships, and that services can serve such entities from either one or several different vocabularies.

1  Services related to the vocabulary (encoding) systems/schemes

11  discover suitable scheme in vocabulary registry
      and:
       search, browse, harvest, alert, upload/deposit, edit etc.

12 disclose selected or complete information (metadata) about scheme(s)

      getSupportedSemanticRelationsByThesaurus(URI thesaurus)

13  statistics (e.g. information about size and usage levels)


2 Services related to member terms/concepts/relationships

   from one or several vocabulary systems

21 discover/search member

      term/concept/relationship/translations/mappings/structures

      [(authority) look-up] in one or several vocabulary systems

      and:
      search, browse, harvest, alert, upload/deposit, edit etc.
 
      getConceptsMatchingKeywordByThesaurus(keyword, URI thesaurus)
      getConceptsMatchingRegex(regexp)


22 disclose or harvest terms/concepts/relationships/translations/mappings/structures
     (known item)

      getConcept(uri)
      getConceptByExternalID(externalID, URI thesaurus)
      getConceptByPreferredLabel(preferredLabel, URI thesaurus)


23  browse in networks of terminology

24
  disclose subsets of the topological environment of terms/concepts/relationships;
      several terms/concepts and semantic relationships between them; a synset;  

      translations; mappings; subsets of hierarchies; a concept and all related terms;

      a classification and all related information; a term and all translations; etc.

      getSupportedSemanticRelationsByThesaurus(URI thesaurus)
      getAllConceptRelativesByThesaurus(concept, URI thesaurus)
      getConceptRelativesByThesaurus(concept, relation, URI thesaurus)
      getConceptRelativesByPath(concept, relation, URI thesaurus, int distance)

      getTopConcepts(concept, URI thesaurus)
      getTopmostConcepts(URI thesaurus)

      (OAI) harvest of sets


25  harvest a complete vocabulary/mapping set

      (OAI) complete harvesting function

26  upload/deposit member terms/concepts/relationships/translations/mappings

27  edit member terms/concepts/relationships/translations/mappings

28  alert about new or changed

      member terms/concepts/relationships/translations/mappings


3 Services related to the application of terminology in other services

   (a rough illustration)

31  automatic indexing
32  term or keyphrase extraction
33  named entity recognition, data mining
34  automatic translation of term or document
35  query enhancement, query expansion
36  automatic classification
37  automatic mapping
38  semantic reasoning
...

Vocabulary searching and browsing functions, integrated into an information service, are not fundamentally different from external terminologies used for these purposes (i.e. the abstract terminology services in group 2 above). Technically, they could be invoked as (web) services. Even the suggestion of terms from a controlled vocabulary can be accomplished by sending a suitable request to a web service such as 21, 22 or 24.

 

It is doubtful whether vocabularies completely integrated within a data or document collection, to the extent there is no separate representation or access, could be the basis for useful web services. However, the performance of external web service calls is a critical issue for integrated, responsive search and browse interfaces. It remains to be seen whether and which kind of web service architectures will deliver this kind of response. In some circles, there is interest in light(er)weight approaches to web services (eg SRU and REST protocols), due to concern about the overhead imposed by SOAP-based approaches for interactive Web applications, when response time is an issue. This is also taken up in Section 6.4.1 in the discussion of protocols, specialised for user-facing terminology services. The next section gives an illustrative example of layered terminology services from group 2 above, in context of terminology-assisted search.

 

Recommendation: Develop open source, reference terminology web service implementations

4.3.3      Illustration of TS assisted search process

This section gives a breakdown of some of the detailed steps in the terminology-aware search process, as a further illustration of the layers of services that can be involved in searching and the iterative nature of that process. Figure 2 attempts to combine search and browsing operations. It assumes that vocabularies are in a Registry and associated with specific Collections. An initial user search statement needs to be expressed as controlled terminology, either via browsing, or by resolving the initial terms into controlled terms (disambiguating if necessary). In some cases, browsing can trigger a query, otherwise a query is formulated with concepts from the vocabulary. Results can be inspected for query reformulation purposes and different forms of query expansion can be applied. Saved queries or results can form the basis for starting the process over again. The diagram is a simplified version of the reference thesaurus search model, discussed by Blocks et al. (2006). The ultimate aim would be to map the lower level terminology services, outlined in the previous section, into the diagram’s search process ‘work flow’.

 

Figure 2. Illustrative diagram of controlled terminology search process.

Simplification of diagram in (Blocks et al. 2006)

4.3.4      Terminology Web Services review

This section contains a brief review of work to date on terminology services with web service bindings.

 

The OCLC Terminology Services project is described in Section 5.3.6. Salient aspects are briefly reviewed here. Multiple vocabularies are made available via a range of web services (based upon SRW/SRU with the Zthes profile, the MS Research Pane and REST and SOAP protocols). Encoding formats are HTML, MARC-XML and MARC21 Authorities or Classification formats. Other OCLC web services are the Name Authority Service  as an add-on to DSpace and the LAF (LC Name Authority File) web service which was developed in the context of a collaboration with the ePrints UK project
(cf. section 3.3)  (LC Name). Vizine-Goetz (2003) discusses OCLC mapping services, while Vizine-Goetz et al. (2006) describe their SOA architecture and use of Microsoft’s Research Pane.

 

HILT3 has plans to develop web services based upon SRW and SKOS Core (see Section 5.1.4). See also section 3.2.4 on the Becta Vocabulary Studio, which provides a web service interface to its Vocabulary Bank.

 

Various web services activities form part of a wider ECOinformatics Initiative (http://ecoinfo.eionet.europa.eu/), which has seen Environmental Thesaurus and Terminology Workshops. The CSA/NBII Biocomplexity Thesaurus Web Services project have adapted the SKOS API to search and browse the Biocomplexity Thesaurus. Not all functions have been implemented but an efficient keyword search element has been added (CSA/NBII). A multilingual demonstrator is also available. The GEMET multilingual environmental thesaurus was developed by EKOLab using the T-REKS (Thesaurus-based Reference Environmental Knowledge System) model. EIONET (European Environment Information and Observation Network) have developed a web service API for accessing an extended version of GEMET, using an SKOS representation (GEMET). Tudhope and Binding (2006) discuss initial experiences with a web service browser, using a restricted set of the SKOS API functions.

 

The California Environmental Digital Library Network (CalEDLN) uses a web service API for interaction between SKOS encoded vocabularies and metadata editing and search interfaces. The API supports access and browsing of distributed vocabularies, term matching and thesaurus functionality for thematic keywords, California place names, resource types, and person/organization names. The software solutions are open source (CalEDLN). This is a further development of the CERES Protocol work mentioned in Section 6.5.1.

 

DLESE has developed a NSDL Strand Map Service as a Networked Knowledge Organization and visualization system for K-12 Education (the REST-based
web service protocol generates the visualizations) taking the form of an enriched end user thesaurus. It invites both teachers and learners to make connections between important ideas (Sumner 2005).

MelvilSoap is a web service from Die Deutsche Bibliothek. offering DDC 22 in German. It allows users to query Melvil, the German WebDewey, via a SOAP interface. It is intended to be used in classification work and to support searching German databases carrying DDC classification. The service requires subscription (Melvil).

MeSHine (EUTROPA.de) is a web service using MeSH in German and other languages to search the Internet (Google web-APIs) and Pubmed/Medline (using the Entrez-SOAP utilities of NLM – see also Section 5.2.5); SOAP/REST access to other vocabularies can be developed. The vocabularies are structured in XML messages format (MeSHine).

Zisman et al. (2002) discuss experiences from applying Web service wrappers in an 'information bus' approach to the development of a prototype system that integrated various UN FAO data sources with disparate organisation and structure.

 

There is some current work looking to integrate web services more fully with Semantic Web efforts. There is a W3C Semantic Web Services Interest Group, with enhanced semantic description and choreography of web services. The ongoing Web Services and Semantics Project (IST-FP6-004308) is an EC specific support action in this general area.

 

Recommendations:

Collaborate with international efforts in terminology web services

Develop a range of TS-based search and browsing tools

4.4      Mapping

 

Mapping is a key requirement for semantic interoperability in heterogeneous environments. Although schemas, frameworks and tools can help, detailed mapping work at the concept level is necessary, requiring a combination of intellectual work and automated assistance. Significant effort is required for useful results.

 

Although some major integrated vocabularies exist (for example, UMLS, GEMET), it is often the case that construction of purpose built integrated vocabularies is not practical. Accordingly, mapping between vocabularies is important for facilitating access to information resources in different contexts, different purposes and for different user communities.

 

It is sometimes possible, however, to map to an appropriate switching vocabulary. There has been experience with projects, such as HILT and Renardus, mapping to DDC as a central spine (eee Section 5). The Becta Vocabulary Studio, dynamically maps terms in its Vocabulary Bank to a central spine of concepts (see section 3.2.4). There is also ongoing research into employing top level core ontologies as integrative frameworks between different domain vocabularies and heterogeneous datasets within broad domains (see eg Doerr et al. 2003 and the CIDOC CRM, developed for the museum community and being extended to libraries and archives). It is also possible to make us of linguistic resources, such as lexical databases (eg WordNet) and linguistic ontologies (eg Navaretta et al. 2006), to assist mapping efforts.

 

The DELOS Report D5.3.1: Semantic Interoperability in Digital Library Systems (Patel et al. 2005, section 6) discusses these issues in more detail. It compares information science and ontology-based mapping methodological approaches, concluding they are intrinsically fairly similar. The HILT project reports also review different mapping approaches.

 

Zeng and Chan (2004) provide an extensive recent review of mapping work. They identify several methodological options, prominent among these being:

a)      Derivation/Modeling of a specialized or simpler vocabulary from an existing complex vocabulary. For example, facet analysis can play a key role in facilitating semantic interoperability by deconstructing and systematising complex, pre-coordinated Subject Headings that might otherwise prove intractable for mapping purposes. The OCLC FAST project (FAST) has converted LCSH headings via a simplified syntax into a faceted representation. 

b)      Translation/Adaptation from a vocabulary in a different language.

c)      Satellite and Leaf Node Linking of a specialised thesaurus to a large, general thesaurus. This is a cost effective method for augmenting a widely used general vocabulary with more specific local concepts and terms. In time, the additions may be adopted by the general thesaurus editors but in the meantime they should always be identified as local.

d)      Direct Mapping  between concepts in different controlled vocabularies, usually with an intellectual review.

e)       Co-occurrence Mapping between two vocabularies based on their mutual occurrences within the indexing of items within a collection. Co-occurrence mappings are considered looser than direct mapping made by experts.

f)        Switching language used as an intermediary. It can be a new system (e.g. UMLS Metathesaurus) or an existing system. This is one of the most frequently used approaches, see for example use of the DDC in projects HILT and Renardus.

 

Of course, there are also variants and combinations of these approaches. In practice, the success (and cost) of a vocabulary mapping operation will tend to depend on the congruence of the vocabularies to be mapped. Relevant factors include the degree of

overlap, degree of pre/post-coordination, similarity in structure and level of specificity, the target application and context of use (for more details, see the discussion in Patel et al. 2005, Section 6.2.1).

 

Some significant projects have worked in this area – see Section 5 for outlines of HILT, OCLC, Renardus projects. The UN FAO are investing resources into mapping work with the Agrovoc thesaurus (Liang and Fini 2006). The initial HILT project concluded that one high level vocabulary was not feasible for JISC purposes and has piloted terminology services at the collection level for UK higher educational communities, via mapping to a DDC spine. Similarly based on DDC, the Renardus project created a common 'switching' structure to support a cross-browsing service (Koch et. al 2003). OCLC (providers of the DDC) have developed several mappings between major vocabularies (both intellectual and statistical), now available as terminology web services (OCLC Terminology Services, Vizine-Goetz et al. 2003). The OAI protocol is used to provide access to a vocabulary with mappings, via a browser to human users and through the OAI-PMH web service mechanisms to machines. Both direct mappings and co-occurrence mappings are provided, depending on the situation.

 

Part 4 of the draft BSI Standard on Structured Vocabularies is concerned with interoperability and mapping between vocabularies and gives some useful examples, both mono and multilingual. It also has a discussion of the impact on retrieval of different options. This is an important consideration, particularly when no exact equivalent concept exists, and it is necessary to map to a broader or narrower concept, a partially overlapping concept, or to a (Boolean) combination of concepts. It distinguishes mapping for index terms, search terms, pre-coordinated strings, one to many, many to one mappings, etc. Different types of mapping relationships and types of inexact equivalences (partial mappings) are discussed. Set-based approaches to mapping are outlined by Renardus (Koch et al. 2003), with regard to classification schemes. (Note that there may be differences in mapping approaches for different types of KOS, eg classifications versus thesauri). Doerr (2001) proposes an extended set of mapping relationships and discusses mapping issues generally. This was an influence on the draft SKOS-Mapping Schema, which describes RDF thesaurus vocabulary extension for defining inter-thesaurus mappings and equivalence relationships, although it has yet to see serious application to evaluate its proposals.

 

Recommendations:

Investigate and compare different mapping approaches and granularities in pilot projects

Develop a range of TS-based tools to assist in creating mappings

Investigate the potential for standard mapping relationships and a mapping protocol

Collaborate with international efforts in mapping services

 

4.5      Automatic classification and indexing

Automatic classification and indexing (see Section 3.2.1.1 on the distinction) tools are important for addressing the potential resource overheads in applying TS to indexed collections and repositories. Some tools are emerging that should be investigated for JISC purposes. Many argue that a combination of intellectual and automatic methods is currently an optimal approach (eg Hagedorn 2001). Human input can be used to design vocabularies used by subsequent automatic stages and can also intellectually review automatic results.

 

In a recent review of automatic subject classification methods, Golub (2006a) distinguishes three discipline-based approaches: text categorization using AI machine-learning techniques; document clustering using (information retrieval) statistical techniques; document classification using controlled vocabularies. Analysis of citation patterns reveals that the three approaches have tended not to overlap. However this may now be changing. Medelyan and Witten (2006), from the University of Waikato, report on a combination of thesaurus-based indexing with naïve Bayes machine learning methods for domain-specific keyphrase extraction that achieves results close to the inter-indexer consistency found in professional human indexing. Their new Kea++ algorithm is available under an open source license.

 

In a review for HILT, Russell and Day (2001) briefly review some commercial automatic classification tools: Autonomy, Interwoven, Semio, Wordmap. Other commercial products include Collexis with its automatic “fingerprinting” and OCLC’s Connexion interactive cataloguing software. The JCDL 2006 workshop on metadata tools for digital resource repositories provides a list of  exhibitors, some of whom offer indexing tools. Lancaster (2003) is a standard text on vocabulary based indexing and classification generally.

 

Various research projects have explored vocabulary-based subject classification and some automatic tools are freely available. Larson (1992) conducted early experiments using the Library of Congress Classification. OCLC’s longstanding automatic classification project has also investigated automatic web page classification using the DDC and Library of Congress Classification. Their Scorpion project applied a text web page as query to the DDC knowledge base and the resulting tools have been used as classification support in CORC and OCLC’s Connexion cataloguing software. A Scorpion demo and software is available under a research license (OCLC Automatic Classification).

 

The iVia/INFOMINE project at UC Riverside have experimented using LCSH with machine learning based on a large training set. Paynter (2005) discusses corresponding evaluation methods and tools. The iVia and DataFountains tools for focused crawling and automatic classification are available under an open source license.

 

Golub (2006b) investigates the problems faced in applying KOS to text-based subject classification of Web pages. A selection of mis-classified Web pages is analysed in great detail to uncover why the automatic methods assigned inappropriate classes and illustrative examples are discussed. The underlying method combines a classification scheme with a corresponding thesaurus to give a rich set of resources for the algorithm. The techniques are based on the automatic classification approach developed by the DESIRE project for a subject gateway in the Engineering domain (Koch and Ardö 2000). They are now being applied by the University of Lund in the EC ALVIS project. Various demonstrators and tools are available from Lund’s KnowLib (Knowledge Discovery and Digital Library Research) Group, applying terminologies to classification of harvested fulltext web documents. The ALVIS project offers open source tools with automatic topic classification, including DESIRE’s COMBINE (the Combine Harvesting Robot, “an open system for crawling [harvesting and threshing (indexing)] Internet resources”), used  by the Swedish web archive.

 

Recommendation: