The role of classification schemes in Internet resource description and discovery

The role of classification schemes in Internet resource description and discovery
Work Package 3 of Telematics for Research project DESIRE (RE 1004)

Title page
Table of Contents

Previous - Next

3. Review of attempts to apply classification in automated services

3.1. Background

As classification is a time consuming and expensive process it is obvious that investigations into the use of automated solutions are worthwhile. At the same time, classification is an activity where a significant level of human expertise, abstract thinking and understanding is needed and this is not easy to substitute by artificial intelligence or expert systems.

There are no known examples of traditional library classification being overtaken completely by computer software.

However, knowledge structuring in the Internet has to cope with far larger numbers of documents, exponential growth rates and a high risk of changes occurring in documents that already exist.

This is the background for the development of a growing number of research projects and experimental systems, trying to support knowledge structuring activities on the Net. Most of these projects use methods of derived indexing, i.e. they extract information from the documents and use it for the structuring tasks.

Very few seem to make use of traditional library classification systems with universal or subject specific schemes constructed a priori over many years by co-operative organisations, independently from the contents of documents which actually exist in particular collections. This method is called assigned indexing, to devise an indexing language and assign the appropriate concepts or notations to each document.

Among the derived indexing methods and projects are:

systems for text classification in USENET newsgroups or for video and image classification and other tools assisting in information structuring according to ad hoc or personal schemes (bookmark organisation, gathering and structuring collections)
different clustering methods, based on statistical co-occurrence of words, citation links and co-citation structures or other similarity measures (The project HyPursuit at MIT Comp. Science Lab, <URL:http://www.psrg.lcs.mit.edu:80/Projects/CRS/HyPursuit> seems to foresee the possibility of using the LCC classification, amongst other vocabularies, as one kernel around which to cluster documents)
linguistic methods using semantic clustering or (self-organising) concept maps
ontology-based information capturing from the Net
agents and learning systems (e.g. neural network methods) assisting browsing and searching (the largest group)
construction of a topology of links between related resources for distributed scaleable searching

Many of those methods and tools could be used to improve quality controlled subject gateways although are not necessarily useful for building and structuring them from scratch.

Gerry McKiernan, Iowa State University, offers a comprehensive collection of pointers to such projects and systems, including short descriptions, citations and addresses following a study in the Summer of 1996 (McKiernan 1997). The clearinghouse is called Project Aristotle(sm) - Automated Categorization of Web Resources <URL:http://www.public.iastate.edu/~CYBERSTACKS/Aristotle.htm>.

In the following account, keeping to the focus of this report, only examples of methods and projects which stick closely to automatic routines connected with the usage of established library classification systems in Internet services are mentioned.

3.2. Automatic classification of WAIS databases using UDC

The Nordic WAIS/WWW Project at Lund University Library <URL:http://www.ub2.lu.se/W4.html> was the first project which tried to apply simple methods of automatic classification in order to improve the discovery and retrieval of Internet resources (from Summer 1993 to Summer 1994). It focused on a rather coherent single type of resource; about 700 WAIS databases, where it was possible to automatically extract more or less detailed descriptions of their content from the Internet (Ardö, et al. 1994).

The part of the project that dealt with "Automatic indexing and classification of WAIS databases" <UTL:http://www.ub2.lu.se/autoclass.html> consisted of three elements:

a) Automatic detection of new WAIS databases

Automatic classification according to UDC (the English medium edition)
Construction of a WAIS subject tree based on the top levels of UDC.

WWW, WAIS and gopher front-ends provided access to the service from all three protocols.

An enhanced WWW/WAIS gateway allowed searching multiple WAIS databases in the same subject area simultaneously from HTML forms and a primitive relevance feedback <URL:http://www.ub2.lu.se/auto_new/UDC.html>.

Since early 1996, the service has not been updated. This is due to the fact that all directory-of-servers listing WAIS databases ceased their operation and WAIS databases today are "hidden" behind WWW pages, which makes it very difficult to discover them and collect any information about them.

The still usable automatic classification module of the software takes information from the keyword field and free-text parts of the selected database descriptions to construct a list of keywords for every WAIS database. This list is augmented with subject terms for the same database collected from a number of different sites on the net.

A list of suggested classifications is constructed by comparing the words from the collected descriptive vocabulary with UDC's vocabulary (although only a very limited part of it during the project).

When a match is found, the corresponding UDC classification (in the beginning restricted to the top level) is added to the list of suggested classifications with a weighting depending on which source the matching word originates from. Keywords in the subject field, for instance, have higher weights than ordinary words in the description field.

From the list of suggested classifications the final classifications are decided using a heuristic method based on the accumulated weights for each proposed classification and on the number of proposed classifications. Adjusting the weightings is essential in order to improve the quality of the classification process.

An evaluation of the outcome showed surprisingly few mis-classifications. In spite of efforts to use classifcation information from other sources, the result depended primarily on the quality of the original database descriptions provided by the authors/publishers.

The whole process is kept as automatic as possible. The methodology is not tied to UDC but could be used with other classification schemes, like the Library of Congress Classification, to produce different views of the resources.

3.3. Automatic classification of WWW resources in a robot-generated index using computer linguistic methods

Project GERHARD (German Harvest Automated Retrieval and Directory) is run by Oldenburg University Library: <URL:http://gerhard.bis.uni-oldenburg.de/>

GERHARD intends to create a robot-generated index of WWW resources in Germany and to automatically build a browsing structure by subject. It is being run by Oldenburg University Library, and runs for one year (until the end of 1997; a prototype is expected by May 1997) and supported by the Deutsche Forschungsgemeinschaft (DFG).

GERHARD uses a similar, but linguistically more advanced method than the Nordic WAIS-WWW Project did and applies it to a much larger and more heterogeneous set of documents.

Computer linguistic methods are used (as developed by the project partner ISIV, the Institute of Computer Linguistics, Osnabrück University) to match the natural language content of the indexed documents to a library classification system.

The enlarged and multilingual version of UDC used at the ETH library Zürich is chosen as the most suitable classification system for the task since it is optimised for computer use and display (Loth 1996).

The automatic classification process consists of two parts: the computer linguistic analysis and the matching to the specially prepared UDC scheme. The natural language content of HTML pages is segmented into suitable entities, words and phrases, and compared with a dictionary created from the UDC system. The comparison results in a set of UDC notations for every document which are ranked and weighted statistically according to frequencies and to the structure of the document.

The relevant parts of the documents contents are indexed, together with the resulting classification notations, into a database open to direct searching. In addition a UDC subject tree for all documents is dynamically generated and provided as a browsing structure.

This offers the possibility to integrate the index and the browsing structure to allow the user to jump from individual hits in the search results to the wealth of related documents in the proper sections of the classification system (the same feature is offered in the EELS service).

3.4. Automatic classification and indexing using DDC (Project Scorpion, OCLC)

The most important project in the area of automatic classification is OCLC's research project Scorpion <URL:http://purl.oclc.org/scorpion/>.

"Scorpion is a research project at OCLC exploring the indexing and cataloging of electronic resources. Since subject information is the key to advanced retrieval, browsing, and clustering, the primary focus of Scorpion is the building of tools for automatic subject recognition based on well known schemes like the Dewey Decimal System (DDC)" (Shafer 1996).

In Scorpion, the document requiring indexing is treated as a query against a DDC knowledge base consisting of the vocabulary, notations, structure and relations of the classification system. The input terms for the query varies according to the type of documents to be classified, from title and subject headings only to including large parts of the documents' vocabulary. The results of the search, using ranked retrieval, will then become the subjects of the document and can then for instance be used by a human cataloguer for classification support. The knowledge base for DDC is the Electronic Support System (ESS) used by the Electronic Dewey for Windows CD-ROM product from OCLC Forest Press (cf. 2.1 The Dewey Decimal System ). The intention is to increasingly exploit linkages between Dewey classes and LC Subject Headings as well as other subject schemes gathered from different databases and conversion programs.

So far, the project has explored the use of DDC as a concept definition source. DDC was deemed very suitable for the task because it demonstrated a high degree of class integrity: meaning that subject definitions are unambiguous and well-defined. DDC would therefore be a suitable system for automatic classification. (Thompson. et al. 1997)

A similar study of the characteristics of LC Classification could build upon the older experiments by Larson (1992). He tried to use partial-match retrieval techniques between various elements of new records and a database of classification clusters generated from titles and subject headings from previously classified MARC records. Between 46% and 86% of the books were classified correctly. The outcome of the experiment was to recommend a semi-automatic method of classification for books.

3.5. Improvement of classification by Neural Network techniques

KBS_media Lab, Lund: <URL:http://delphi.kstr.lth.se/kbs/projects/kbscross.html>

There are a couple of projects trying to apply neural network technology in order to improve the learning behaviour of software supporting navigation, browsing and organising systems. The project KBS-CROSS is attempting to develop automatic cross-referencing between classification systems with neural networks.

The goal is to produce a computerised tool that cross-references the LC Classification system with the UDC within the domain of building and architecture.

The practical application will be to allow Lund University Library's cataloguers and indexers to index documents faster and more accurately by suggesting UDC notations based on the LCC/LCSH data already associated with the document. Conversely, literature searchers will be offered use of a LCC/LCSH system for searches in literature databases indexed with UDC. An OPAC or another search system would be enabled by this tool to properly handle different knowledge representations.

Data from the Swedish union catalogue LIBRIS is used together with neural network and text processing tools. The resulting code and data is deployed in a demonstration tool in a Dynamic Knowledge Net. The tool's performance is being continuously evaluated by indexing specialists.

3.6. Automatic conversions between classification schemes

The possibility to automatically convert from existing classifications of documents (OPAC or database records, documents in Internet services, etc.) into another scheme used in a quality controlled subject gateway could become a potentially valuable support for the classification task. This method is occasionally used in co-operative cataloguing projects and union catalogues, sometimes even in individual OPACs as soon as cataloguing records using a different classification scheme are imported or exchanged.

Examples:

Pica is automatically converting Library of Congress classification into the Nederlandse Basisclassificatie (BC) notations when LC records are loaded into the Pica OPAC. Similar software is used as a concordance from UDC to BC.

For the Swedish national classification system SAB, mostly used by arts, humanities and social science libraries in the research libraries union catalogue LIBRIS, there exists a printed conversion table from Dewey to SAB. However, it is not very up to date (DDC ed. 19. and SAB 6. ed., 1987).

If there are no "official" conversion tables available, an improvement of the classification task could still be reached by extracting linkages between different classification systems (possible in LIBRIS) or between indexing terms and classification for the same object from existing databases (LC catalogue; EELS etc.) and use it as conversion algorithm.

Ingetraut Dahlberg's idea at a panel during the Thirty-sixth Allerton Institute (Wyly 1995, p. 77) to offer the Internet searcher a black-box, containing a range of concordances between classification schemes, exploiting classification data from different systems for improved search support, would be the ultimate step as far as conversion is concerned.

She had already proposed in 1982 an outline of a so called "Information Coding Classification" as a first step to an effort for a new general system, as a global switching mechanism between all classification systems and the databases that make use of them (Dahlberg, 1995, p.31-34)

3.7. Summary

Automatic classification processes are a necessity if large robot-generated services are to offer a good browsing structure for their documents or advanced filtering techniques as well as proper query expansion tools to improve the search process.

Even if we restrict ourselves to the use of a priori classification of documents with traditional library schemes there are a couple of promising approaches. However, large-scale reliable systems are not to be expected in the immediate future.

Quality controlled subject gateways with limited size could use methods of shared classification and scheme conversions. The work of the human cataloguers and classifiers could be improved by many of the above mentioned techniques and tools as supportive, semi-automatic systems. Clustering and selection methods, agents and self-configuring maps are the most promising tools as are the possible outcomes of project Scorpion.

Next Table of Contents

Page maintained by: UKOLN Metadata Group
Last updated: 14-May-1997