Electronic thesis and dissertation services:
semantic interoperability, subject access, multilinguality

Traugott Koch, UKOLN

E-Thesis Workshop, Amsterdam 2006-01-19/20

1 Introduction

Purpose and content of the paper, limitations:

This short paper is written with the purpose to provide some background information to inform and structure the discussion at the E-Thesis Workshop (organised by JISC, CURL and SURF) in Amsterdam, January 19-20, 2006.
The aim of the event seems to be to investigate steps to increase interoperability between E-Thesis and Dissertation (ETD) data/services in different European countries, to discuss potential cooperative services and to prepare a possible research and development project.

The paper focuses on the aspects of subject access, interoperability and multilinguality. It consists mainly of assorted thoughts and pointers, since there is not yet a definition of intended goals, services, available data, context, participants or audience and secondly, because there is not a set of commonly accepted knowledge and obvious solutions readily available for such a task. The content remains at the level of problems and options and the break-out group using this material will most probably start out there as well.

The content is further limited by the fact that I have never directly been involved in an ETD project. My experience is based on mainly EU-projects dealing with knowledge organization of web based documents, data sets and traditional publications (Renardus, DESIRE, EULER, eBank UK et al) and on selected efforts in Sweden, Denmark and Germany.
The country presentations preceding the break-out sessions at the workshop will probably provide an overview of data, services, plans and discussions from most of Europe, which could not be gathered in time for this background paper.

The International "Electronic Thesis and Dissertation Initiative" with their "Networked Digital Library of Thesis and Dissertations (NDLTD) has highly similar experiences and needs, however.
Ed Fox presented unsolved problems of interoperability, knowledge organisation and subject access in NDLTD to the NKOS workshop at ECDL 2005, hoping for fast solutions. Nobody at the workshop could really promise that. [Fox et al 2005]

As soon as the purpose and scope of the cooperation is defined, a thorough documentation of the available data and metadata, subject access systems, formats and access conditions, workflows, uses and re-uses is required in order to gain sufficient insight into the scale of interoperability problems. Only on such a basis, real options, R&D tasks and solutions can be developed.

Related communities and developments:

We need to be aware of that the problems of subject access, interoperability and multilinguality are and have been discussed and addressed in a highly similar way and throughout some time in several other partly overlapping contexts and communities, apart from our rather narrow (digital) library related community:

The CRIS Research Information Systems community (research projects and their output)
The OAI (Open Archive Initiative) community (variety of open archives and services based on harvested metadata according to the OAI-PMH protocol)
The Institutional Repository/eprints community (using software packages such as DSpace or Eprints)

The discussion regarding ETDs needs to learn from the experiences of these communities and engage in discussions and potentially common developments together with them.
The reason is not only highly similar data, services and ambitions, but also the fact that ETD data will appear in all of these contexts and basic metadata should need to be produced only once, close to the source, and be re-used and augmented in other service contexts (e.g. ETD records might draw from or be moved into an library OPAC with print only TDs; they might be copied into institutional repositories and university-wide eprint databases, move into local, national or international research information systems, TD databases or OAI-based aggregator services, appear in well indexed specialized subject-based services etc.).

The history of local TD services illustrates this situation often quite well. The Lund University Dissertations service [Lund University Dissertations] for example was established to support the workflow of the dissertation process at the university and allow an easier integration of print only TDs and ETDs into one coherent database. It's metadata profile and subject access solutions were then adapted to a Swedish National Research Information System (SAFARI) which doesn't exist anymore and to a connected international subject access vocabulary ([CERIF 1991 classification] and [Ortelius]) which are not maintained and supported anymore. Now there are plans to integrate the Dissertation database with the new eprint service of the University and to adapt to national recommendations for academic e-Publication [SVEP], both requiring dramatically different subject indexing. Printed TDs appear in the library OPAC as well but with only very shallow subject indexing.

In DEFF, Denmark's Electronic Research Library [DEFF], TDs appear in the Danish National Research Database [Danish National Research], a virtual national union catalogue with records from local OPACs and in Global E-prints, an eprint/preprint and research archive database (harvested via the OAI-PMH), all with different metadata profiles, subject indexing and search access. At this time, there is no cross-search option available.

Local workflows and interconnections with eprint platforms lead to tool developments which often might need to be adapted to general interoperability agreements and services rather than to local policies and software platforms alone. Prominent tool projects related to ETDs are the Virginia Tech ETD-db system [Virginia Tech] and Tapir, the Thesis Alive Plugin for Institutional Repositories (for DSpace) at the Edinburgh University Library [Tapir].

Because of these contexts we should not ignore the majority of printed TDs. Their metadata is digitally available and normally manually catalogued by expert staff. Even ETDs are often catalogued like that and their metadata is available in the same databases and will be a source for eprint services as well. OCLCs WorldCat for example contains about 70 000 ETDs among 4,2 million TDs.

TD specifics:

Compared to document types dominating the services of the related communities, TDs feature several specific characteristics which might be important for our tasks: they are very specialized to their content and therefore often not classified and indexed deep enough; they are normally not read and annotated or recommended by many people; normally, extended abstracts are available, in an increasing number of cases even the fulltext of the document; titles and abstracts are sometimes available in two languages and other languages than English still dominate especially in the humanities and social sciences and in many European countries.

Alternatives:

Efforts like this, trying to discuss subject access and (semantic) interoperability and possible improvements in this respect, have always to justify themselves compared to rather simplified thinking about solutions, i.e.:

1) Search engines such as Google and their services, e.g. Google Scholar are sufficient for most relevant purposes and we should not develop any services of our own
2) Social tagging with it's simple, informal and web-based approach is the only feasible and sustainable solution when it comes to subject indexing and classification

Our discussions and solutions will ultimately need to demonstrate superiority regarding to some important criteria or claim clear complementarity with these approaches in order to be further pursued.

Potential actions, e.g.:

Develop service scenarios/use cases for ETD records and services or adapt existing ones

Establish cooperation forum

Find maintenance agency for agreed principles, rules, vocabularies

2 Interoperability

Interoperability is an important issue in all information systems and services. Without syntactic interoperability, data and information cannot be handled properly with regard to its formats, encodings, properties, values and data types etc., not merged nor exchanged. Without semantic interoperability, the meaning of the used language, terminology and metadata values cannot be negotiated or correctly understood.
Accomplishing a sufficient level of interoperability of both kinds is a prerequisite to the creation of any meaningful and useful information service based upon heterogeneous and (often) distributed resources.

Definitions of concepts related to interoperability, as well as prerequisits to and methods of enhancing semantic interoperability in Digital Libraries, has been published in a state-of-the-art report by Delos Workpackage 5 last year [Patel, Koch et.al. 2005]. Even though Digital Libraries are representing a much broader type of service, most of the report is relevant to ETD services as well.

2.1 Syntactic interoperability:

Even though both kinds of interoperability are required, this paper will not deal with the necessary level of syntactic interoperability for an ETD service or any kind of service where ETD records are shared, retrieved or augmented.

Agreed upon and suitable

architectures (e.g. Web services as in the JISC Information Environment Architecture, client-server, peer-to-peer or centralised approaches);
transfer solutions/protocols (e.g. OAI or other harvesting, deposit/active submission or retrieval protocols such as SRW/U);
data model;
record format and syntax (e.g. XML, RDF, XML-MARC);
unique identifiers (e.g. URI, PURL, DOI, Handle) etc.

need to be selected like in every other web based information service.

All cooperative, national (e.g. Germany), or international (e.g. NDLTD) TD/ETD services had to make such agreements and decisions in order to offer any service at all. They have formulated requirements, standards and recommendations as have related national eprint initiatives (such as DINI in Germany [DINI], DEFF in Denmark, SVEP in Sweden, ePrints UK [ePrints UK]).

2.2 Semantic interoperability:

The Delos Semantic Interoperability report [Patel, Koch et al 2005] distinguishes semantic interoperability at three levels of abstraction:

1.Data structures, be it for metadata, content data, collection management data, service description data
2.Categorical data, i.e. data that refer to universals, such as classification, typologies and general subjects
3.Factual data, i.e. data that refer to particulars, such as people, items, places

Data structures:

Most awareness in the field of repositories seem to be found about efforts regarding 1. Data structures. The OAI-PMH, on a protocol level, points to it's necessity by making a few elements from unqualified DC the default of OAI compliance, thus, enforcing data providers to perform the necessary data element mapping between their own structures and these common elements.
SVEP in Sweden establishes a recommended metadata profile (even for reporting of metadata to the national library; for student and exam papers; for common research reporting), ePrints UK recommends establishing a common metadata application profile [UK Recommended] and DINI has a general recommendation for the main Dublin Core elements, but focusing on the presence of certain data elements such as subject, publication and document type.
Quite close, especially to the scope of the DINI recommendations, comes the new Danish national Metadata Exchange Format for Documents (MXD) [Metadata Exchange] which is about to be implemented.
Creating a common metadata profile for services integrating distributed heterogeneous resources and databases is now a quite established method (cf. EU project Renardus) [Renardus].
Registries provide access to such mapings or crosswalks [Patel, Koch et al 2005 4.5.3] for reference and re-use.
Among the TD services, NDLTD (ETD Metadata Standard) [ETD Metadata] and the German "Online dissertations and post-doctoral theses" (MetaDiss) [MetaDiss] feature common metadata profiles, both at least in parts conforming to Dublin Core. Traditionally catalogued TDs and ETDs will of course follow OPAC cataloguing rules (AACR2 and similar) with adapted formats (flavours of MARC) and local subject indexing practices and systems.

Any higher level ETD service will need to agree on a (core) metadadata profile for the purpose of the service and contributing data sources will have to semantically map and export/expose their data accordingly. It can not be copied from but would probably be a core of the existing profiles, adapted and expanded according to the specific characteristics of the intended service.
Exported or harvested individual ETD records should however carry the richest possible metadata with them from the source, to be re-used, transformed, upgraded, mapped or cut out according to different re-use scenarios.

Categorical data:

Classifications and other knowledge organization systems (KOS), referring to 2.Categorical data above, have to a much lesser degree been standardised or mapped as the basis for semantically improved access in the repository context.
DINI and SVEP made certain recommendations (see section 3 below) and especially DINI tries to move towards one common classification system (DDC), at least for set categorization of OAI compliant document collections.

Among the aggregation services developed for UK ePrint repositories (JISC ePrints UK project [ePrints UK] OCLC Research started to experiment with an automatic classification service to enable routing of metadata for eprints to the discipline-focused hubs of the RDN. [Dempsey et al 2004]
Treatment of heterogeneity when integrating data sources has been and still is researched in projects supported by the German Research Foundation, especially by Informationszentrum Sozialwissenschaften in Bonn. [Hellweg et al 2001; Krause 2004] The Berlin Open Forum on Metadata Registries in April 2005, discussed semantic registry support, especially in the language engineering field, which would be an important prerequisit for semantic enhancements regarding categorical data. Categorial data provides the subject access, discussed in more detail in section 3 below.

Factual data:

The third level, factual data, has been addressed by efforts to standardise with the help of name authority databases and gazetteers and other geographic name authorities. Data mining techniques can be instrumental as indexing and mapping support.
As a value-adding service OCLC Research developed Name Authority control (name look-up service) as metadata creation support to be hooked up with templates in the DSpace repository software package. Created as web services this and other developments of the Metadata Switch project/ Terminology services project at OCLC [OCLC Metadata Switch project] can provide remote semantic interoperability enhancing functionality pluggable into local applications.
OCLC Pica is commissioned to develop a "National Author Thesaurus" for the Dutch national repository initiative DARE.
Several European projects support development and integration of name authority records emanating primarily from national libraries (e.g. LEAF Linking and Exploring Authority Files) [LEAF].
ETD services should cooperate with the existing library efforts and higher education authority efforts regarding factual authorities. Author and institutional names and locations in EDTs come from the world of traditional publishing and the university organisational information systems and are much easier to establish and verify than authors of documents on the open Internet.

Metadata enhancement:

Metadata enhancement can take place at all three levels. Different strategies for metadata enhancement involve semantic interoperability efforts. Cornell University leads such developments for the National Science Digital Library NSDL project in the USA, together with partners such as INFOMINE, Univ. of California Riverside and its iVia Virtual Library software. Metadata augmentation can comprise to enrich metadata records with subject headings e.g. LCSH, and keywords, transformation services can correct degradated terms from controlled vocabularies and recognize values from recommended vocabularies and ascribe the appropriate vocabulary encoding scheme to the statements. New metadata values can be generated based on mappings between schemas or vocabularies [Hillmann et al 2004].
Guidelines published in an article on the improvement of metadata quality in ePrint archives in the context of the ePrints UK project [Guy et al 2004] underline the importance of early decisions on the usage and granularity of controlled vocabularies, their consistent application and the importance of built-in support for them in metadata editing tools.

Potential actions, e.g.:

Agree on core metadata (Application Profile) solution incl. value encoding (compatible w. NTLTD, built upon most advanced national "rules")

3 Subject Access

3.1 Problems:

After a period in the beginning of the development of Internet information services, where systematic knowledge organization, subject indexing and classification was seen as replaced by fulltext indexing (as the search engines do) and by navigation via hypertext linking, we have lately seen a renaissance of rather "traditional" subject access. Increasingly, it is realised that knowledge organization with controlled vocabularies/KOS (Knowledge Organization Systems) such as thesauri, classification systems, taxonomies, gazetteers, ontologies etc. is at least as useful for resource discovery and information navigation as it was in the printed world.
Especially, systematic navigation and topical browsing through collections, filtering by category and semantically precise search is totally dependent of well-designed subject access systems and interoperability solutions, preferably using established and controlled vocabularies.

In the Digital Library community a report from the EU project DESIRE summarised potential benefits and early applications 1997. [Koch, Day 1997]
The OAI community started more detailed discussions at least since the Geneva workshop 2003 and even the international ETD initiative and the NDLTD project describe the need to offer mapping, browsing, clustering and visualisation and is doing related research [Fox et al 2005].

Even if thirteen repositories in one single subject area (Library and Information Science) are integrated in the DL-Harvest project, the need for unified subject access solutions is obvious:
"Each archive uses a different scheme for subjects. ...the subject fields are not used consistently in the archives. ... The lack of a shared controlled vocabulary across repositories means that service providers and data providers need to collaborate. More experimentation is needed with regard to automatic and human-mediated metadata generation and enhancement, and full-text indexing in these contexts." [Coleman, Roback 2005]

Many TD metadata records seem not to carry sufficient subject information at all to build advanced discovery services, as can be seen in NDLTD [NDLTD], in OCLC's XTCat [XTCat], in the Danish National Research Database [Danish National Research Database] or even in national library union catalogues. An analysis will show that some sources dont carry any subject information at all, others only a discipline name, only uncontrolled keywords assigned by the authors, a very shallow classification, whereas others provide both keywords, discipline and classifications from several systems (general and subject specific, as in parts of the German Dissertations Online. [Online dissertations].
One reason for the often very shallow indexing and classification, even for traditional TD cataloguing in OPACs, are the highly specialised and frequently interdisciplinary topics and methods dealt with in TDs.

When several such heterogeneous sources are to be joined and integrated the gaps and inconsistencies multiply. Additional problems are the granularity and language of the subject indexing and the level of maintenance and change of the used subject indexing systems. The EU Renardus project [Renardus] documented that heterogeneity regarding the participating subject gateway services which at least all featured rather rich subject information [Renardus usage].

Missing or highly heterogeneous subject information might even be a reason for the frequently missing or limited subject based searching and browsing provided in many TD services at all scales. Sometimes there are no such options at all or one relies on accidental hits in a "all information" search; it is sometimes impossible to limit a search to TDs or combine such a limit with a subject category filter; index look-up for assistance with finding relevant classes or keywords in the database is quite rare as is a systematic subject category browsing system.

Another factor hampering integration might be the fact that most well maintained general and broadly covering KOS are owned privately and require agreements, licenses and even fees to be used. There seems not to exist such both broad and deep major open stable and long-term maintained systems.

3.2 Approaches:

In principle, the following practical approaches regarding subject access are possible:

A) Use one common system, (re)classify/(re)index all documents (manually or with automated support). Select an established or create a new system
B) Map the heterogeneous systems in use to a common "switching" system which provides integrated subject access
C) Support indexing/classification in several different systems which might provide increasing overlap and partial interoperability "from below"
D) Do nothing; rely on fulltext indexing, citation indexing from Google and similar and on success with social tagging and other discovery approaches

A) Use one common system, (re)classify/(re)index all documents (manually or with automated support). Select an established or create a new system

This is a solution suited for a university level TD or a subject TD database but rarely for a national or international service, considering that many records already carry different subject terms and classes from a different context. Even if applied, the common system will not be able to serve all purposes and perform well in all contexts where a TD record will be re-used.
Re-indexing or -classifying is normally not feasible because of prohibitive costs for manual efforts and given that automatic classification and indexing is not performing sufficiently well (yet?) in our situation (a sufficiently deep automatic classification of TDs from all academic fields using an established hierarchical system).

Countries with a common library classification (such as Norway) or subject heading system for most academic libraries will of course benefit from this approach and have a TD service with a common system without any extra efforts. The same is true in the case of Germany which is moving to the common use of DDC (Dewey Decimal Classification). Other systems will convert to it, so TDs are just part of a common approach and the mandatory classification with (slightly adapted) two top-levels from DDC [DDC-Sachgruppen] is the obvious solution. Subsequently, TDs will integrate seamlessly into all other services built upon the same classification system.

For eprints, the German DINI organization (German Inititave for Networked Information) recommends OAI sets to be defined using a two-stage hierarchy i.e. in a subject classification, here ddc e.g. ddc:020 (setSpec) [DINI 2003a]. In the context of the quality control through a certification process for repositories offered by DINI [DINI 2003b], the minimum requirement for a certificate is, among others, the availability of a defined policy for subject indexing and verbal indexing of the documents with keywords or classificatory indexing.
Recommendations which might become future requirements are:

Use of DDC for classification
At least one additional standardized system of verbal or classificatory indexing, general or subject specific
Keywords in English
Abstracts in German and English

"In the long run, indexing not based on a controlled vocabulary will not be sufficient to guarantee good recall and precision in searches for publications. To permit quality searches across multiple document repositories, international indexing systems should be used."

The proposal for Swedish eprints, incl. ETDs, [SVEP] adapts existing system, which, however, are not widely used in related services: a subject classification for scientific publications, based on existing systems at the Central Statistical Office and the Science Council [Central Statistical]. It includes rough mappings to a few other commonly used Knowledge Organization systems to support transition. It is recommended to be used for the specification of sets for selective OAI harvesting as well.

The ePrints UK project initiated work on improving semantic interoperability between and subject access to (english language) eprints, experimenting with terminology web services and automatic classification approaches developed by OCLC and based upon DDC. [ePrints UK]

In Internet and digital contexts, established systems are often, with good reasons, perceived as non-appropriate and not in sync with scientific paradigms and the fast changing vocabulary of researchers and users. Therefore, many initiatives undertake to construct new systems, without considering the need of long-term maintenance and interoperability to heritage systems.
CERIF (Common European Research Information Format) [CERIF] was developed since the late eighties as a recommendation to member states of the EU (European Union). It is intended for use in CRISs (Current Research Information Systems). From the beginning it developed a Research Classification Scheme (even intended for TDs) [CERIF 1991 Discipline Classification Schema], consisting of about 370 classes in three different levels and translated to several languages, which was never updated and not very well adopted. Soon, CERIF orientated towards the use of a similar system, Ortelius Thesaurus on Higher Education (1100 terms in five levels) [Ortelius] with the same fate. Recent CRIS conferences and seminars are not talking a lot about classification and subject access anymore, even though the web pages claim that new initiatives are "under discussion 2005" related to the CISTRANA project that has created the IST ERA Taxonomy (limited to IT and electronics technologies in a wider range of implementation areas) [IST ERA].
CERIF 1991 classification was translated and adopted by the common Swedish Research Information service SAFARI which does not even exist anymore and subsequently by e.g. Lund University Dissertations [Lund University Dissertations] in order to be visible and integrated in SAFARI (see above). Its category structure is now an isolated island instead of widely interoperable.

As with most classification systems, end-users and especially researchers who are expected to classify their documents, even CERIF was disliked widely, perceived dramatically incomplete and not in agreement with discipline terminology.
Especially in the area of Law, Language and Literature local changes were enforced by Lund faculty, inspite of the complementary local institutional categorization, thus establishing differences to the national and European system.

The Danish Metadata Exchange Format for Documents (MXD) [Exchange Metadata] does not propose any common solution yet but has at least a placeholder for both classification and keywords.

For TD only services, it might be tempting, rather than using a scientific topical classification, to use the structure of academic disciplines and national educational systems. They are, however, highly different between universities and from country to country and thus not suitable for higher level services. Such a solution would, again, hamper interoperability with many related services.

Keyword indexing:

In principle, keyword indexing, using controlled keyword lists, subject heading systems (such as the Library of Congress Subject Headings LCSH) or thesauri, could be used to accomplish interoperable subject access as well.

The assignment of keywords certainly aids in improved and more precise retrieval. That is the reason why most services recommend to add keywords to the description of TDs, in addition and complementary to a classification. However, in services independent from library catalogues, you don't see a specific keyword indexing system be mandated. There are very few suitable general international keyword systems available and it is not clear how feasible e.g. LCSH is for the purpose. In the German Dissertationen online, the common library-based Schlagwortnormdatei (SWD) with its verbal indexing rules is recommended. As with the DDC classification, this accomplishes a certain interoperability with other documents in the National Bibliography and German library catalogues.

From the viewpoint of creating integrated access, mapping controlled or uncontrolled keywords to another keyword system is methodologically more difficult and far more time-consuming than mapping classification systems.

To assist keyword-based retrieval, there are methods available incl. open source software, to extract and weight key-terms and -phrases from the available TD information: abstracts and in the case of ETDs: the fulltext of the document.

The fulltext indexing of freely accessible documents by search engines fulfills such a purpose very well, even considering problems with non-textual materials quite frequently appearing in ETDs. So the extra benefit of pursuing fulltext indexing in ETD services without the use of a controlled vocabulary system is doubtful.

Virginia Tech University seems to have abandoned LCSH for author-assigned keywords and replaced LC classification with a generic home-grown one for their ETD cataloguing, as compared with traditional policies.
In a related manner, they investigate the summarisation of individual publications with concept maps for NDLTD. [Fox et al 2005] This is to be done by extracting and translating, if necessary, phrases from ETDs, to establish concept relationships and to combine ontologies to cover broader scopes of knowledge in the collection. The prospects of such an approach are hard to judge. Again, searching and discovering specific documents will be improved, opening up author terminology, cross-discipline and cross-individual interoperability might be very hard to accomplish.
Thus, this is seen as complementary: NDLTD feels the need for a common subject (discipline) classification to support browsing by categories. They want to engage in automatic classification research performed on the fulltext, structure or citations of the ETDs and apply machine learning methods.

B) Map to a common "switching" system

One of the more promising approaches to accomplish interoperability between heterogeneous collections and subject access systems is to map to a common "switching" system with sufficient breadth/coverage and depth to support mapping between large and small, highly specialised and shallow systems. In the case of mapping classification systems the outcome would be one common browsing structure mediating between separate local classification systems and services, without the need to duplicate classification at document level or to replace all the different local classifications. Loss of precision would be reduced as mappings are only done bilaterally and single-directionally, from the common system to each participating system. There is only one step of mediation between every two original systems.

In the EU project Renardus (cf. the remaining demonstrator and project documents at [Renardus]), concluded 2002, we aimed at providing coherent access to (multilingual) web resources as selected and catalogued by many European Subject Gateways (incl. the UK RDN, DutchESS, DEFF, Finnish and German gateways etc.) from about 15 different distributed servers. The solution for integrated subject access was to map from the Dewey Decimal Classification, as a "switching language" (which appeared to be one of very few candidates for this purpose) to the participating gateways' classification and indexing systems. Thus, DDC� could be used as a browsing interface (and search filter)� accessing� the (multilingual) resources� in the local� gateways [Koch et al 2001].

All resources were of course classified/indexed in one system selected by the local Subject Gateway according to some kind of controlled vocabulary. During the project, bilateral intellectual mappings between DDC and the local systems were carried out in cooperation.
We did some experiments with semi-automatic classification and mapping which underlined the need for a high quality controlled vocabulary for any decent results in this respect.
The initial intellectual mapping is of course a major and expensive task. It needs to be maintained and updated each time used parts of the two involved classification systems change.
With regard to sustainability, it was hoped that the owners of the established classification systems could be convinced of the benefits of these mappings even for their own databases, tools and services and would take over the task in the longer perspective.
BTW, German Dissertations were part of Renardus as the only source not being a subject gateway.

On a national level, German subject-oriented digital libraries aim for interoperability in the common science portal Vascoda [Vascoda]. Accompanying research is funded by the Department of Education and Research (BMBF) in a project on model creation and treatment of heterogeneity as part of a Network of Excellence in New Services, Standardisation and Metadata (2004-2007), building upon the research at IZ Sozialwissenschaften/GESIS in Bonn [Hellweg et al 2001 and Krause 2004].

C) Support indexing/classification in several different systems

As a less ambitious alternative to increase interoperability in TD services, indexing/classification in several different systems might be supported trying to increase overlap and build partial interoperability "from below". This approach will certainly not lead to any coherent approach, though.

This would be one alternative if anybody wanted to improve e.g. international services which have widely gathered TDs/ETDs with a large degree of subject indexing heterogeneity. Examples are OCLC's XTCat [XTCat] and the different NDLTD Union Catalogue flavours [NTLTD Search].

XTCat is a collection of 4.2 million theses and dissertations extracted from OCLC's union catalogue WorldCat. This collection was created some time ago, and hasn't been actively updated. The records available are mechanical transformations of the MARC records taken from WorldCat and involve no enrichment or authority control. XTCat is an OAI repository and as such has no form of access related to DDC, LCSH, etc. The only OAI "set" defined is for the ETD subset.
XTCat is independent of the NDLTD Union Catalog collection as presented by OCLC which is dedicated to electronic theses and dissertations harvested from universities around the world (in this catalogue there are about 23 000 European ETDs among 145 000 in total). There is, however, an overlap of 70 000 EDTs between the two.
Both databases can be harvested via OAI (and other access protocols). The Elsevier Scirus service, the VTLS TD service [ETD Union Catalogue] and OAIster have been doing this for example to create their own services. XTCat is of course searchable via the WorldCat interface.
Without further analysis it is not possible to tell quite what degree of subject indexing, DDC, LCSH and similar, there is in these databases. Regarding the source, it might be quite high in XTCat in any case.
Any metadata enrichment to those records, regarding name authorities and classification and indexing has not taken place but would be perfectly feasible.

The subject access systems in the two databases could be a valuable source for any semi-automated classification or multilingual access efforts applied to other TDs.

Probably with similar motifs, German Dissertations online recommends to classify and index TDs (or to keep such indexing) in subject specific systems (in addition to the general DDC and SWD indexing) such as: MSC and ZDM in Mathematics, PACS in Physics, ACM in Computing or MeSH as subject heading indexing in Medicine. Even two frequently used general systems, Basisklassifikation and Regensburger Verbundklassifikation are invited and stored. DINI as well will recommend further controlled vocabularies for German eprints.

Many disciplines have common and well established subject indexing systems which, when used for indexing of ETDs allow specific documents to integrate into subject based services of different kind. When these subject specific systems are mapped to general systems such as DDC, documents indexed with these specific systems only can automatically be correctly placed in broad general services without any extra efforts.

A general recommendation should be to keep all subject indexing information in TD records and carry it with the records into new contexts. This will make many of the here mentioned operations easier.
A related recommendation is to classify (and index) as detailed as it can be afforded as close to the source of the publication as possible to improve later options of the document to be re-used in both general and specialised services.

D) Do nothing. Social tagging

Do nothing means here to rely on fulltext indexing or citation indexing from Google and similar, on success with social tagging systems and on other discovery approaches.

The existence of this (workshop) initiative indicates that such a strategy is not seen as sufficient by the organisers and participants. This view has the support of NDLTD and national initiatives to support eprints and institutional repositories as well as of parts of the CRIS and OAI communities.

As far as social (or community) tagging is concerned, the special characteristics of TDs lead to low initial expectations that this approach will contribute to useful general discovery services. TDs are dealing with very specialised publications where largely unstructured tagging will not be very efficient. TDs are normally read and used by quite few people only and most of them would never feature in general social bookmarking or recommender systems.

We will have to watch the system closest to TDs so far, CiteUlike, a bookmarking service especially for academic papers [CiteUlike]. It is, however, quite probable that it will remain a rather complementary discovery/alert instrument regarding TDs.

In order to play an active role in such a clarification, one could of course start experimenting with subject tagging using controlled vocabulary in a subservice for ETDs or attached to subject based services and evaluate performance and usability.

Potential actions, e.g.:

Issue general recommendation to keep all subject indexing information in TD records and carry it with the records into new contexts.

Issue related recommendation to classify (and index) as detailed as it can be afforded as close to the source of the publication as possible to improve later options of the documents to be re-used in both general and specialised services.

Start experimenting with "social" subject tagging using controlled vocabulary in a subservice for ETDs or attached to subject based services and evaluate performance and usability.

Agree on minimum subject access principles and features

Investigate browsing system/classification and granularity

Stimulate development of (semi-)automatic classification methods

Stimulate development of (semi-)automatic classification mapping methods

4 Multilingual Access

4.1 Problems:

It is hard to provide any meaningful background information when the ambitions of intended cooperations and services and the language distribution of existing TDs/ETDs are unknown.

Multilinguality can refer to different aspects of an information system or to all of them at once:

The documents metadata
The documents fulltext
The services user interface
The users queries
The subject indexing systems

One can aim for systems allowing and handling mixed languages or systems which require bi- or multilinguality throughout, for all documents and features, or, as Carol Peters formulates [Peters 2005]:

Multiple language access, browsing, display, e.g. in multilingual portals, supporting monolingual search in several different languages
Cross-language information discovery and retrieval, based on translation, dictionaries, thesauri etc.

Challenges she mentions are:

Querying of multilingual collection in one language against documents in many other languages
Filtering, selecting, ranking retrieved documents (in multiple languages)
Presenting retrieved information in an interpretable and exploitable fashion

Translation resources are e.g. machine translation, parallel/comparable corpora and bilingual dictionaries.

4.2 Approaches:

Because of recommendations, e.g. by DINI for German eprints, and in many local systems of non-English speaking countries, we can expect certain metada to appear frequently in two different languages, typically in the local language and in English, one of the two being a translation. Most often, that is the title and the abstract of a document, sometimes keywords as well. Recommending English language keywords will definitively help multilingual access. For further improvement, bilingual or multilingual indexing systems can be used or created from scratch or by translation efforts. Classification systems can normally be operated with language neutral codes and are especially suited to be used for such a purpose.

Multilingual user interfaces are obviously much easier to provide than e.g. user query translation; that might be more fesible than full metadata translation or translation of complete documents.

The best option for multilingual subject access seems the usage of multilingual subject indexing systems to be. In Europe there is a rich experience with multilingual subject specific thesauri e.g. FAO's documents as indexed with the five languages of the AGROVOC thesaurus [AGROVOC]; EU documents using multilingual vocabularies like EURODICAUTOM [EURODICAUTOM]; the European Education Thesaurus [European Education]; the GEneral Multilingual Environmental Thesaurus, GEMET or the European Treasury Browser Thesaurus (European Schoolnet EU project) [European Treasury].

Countries with several official languages have often such systems, e.g. Finlands library-based bilingual subject heading system or the three-lingual UDC classification in Switzerland (ETH Zuerich).
General classification systems are often translated into several languages, not only the already mentioned CERIF 1991 classification, but more so large systems such as DDC (to different degree and up-to-dateness available in up to 30 different languages).

When using such systems and if the licenses are in place, then, multilingual access is merely a question of a suitable supportive user interface, using the vocabulary openly or behind the scenes. It can be very useful, almost replacing the need to translate queries or documents.

As an experiment in the Renardus project, we developed a pilot demonstrating multilingual access and browsing in the DDC, and thus, providing multilingual access to multilingual resources [Renardus ML]. It was of course limited to the languages and depths of classification provided by DDC/OCLC for the purpose and featured only very rudimentary user interface solutions.

Multilinguality can also be provided through mapping of systems in different languages: a project cooperation run by the Conference of European National Librarians had started to create manual mappings between german SWD, french Rameau and english LCSH in some limited subject areas and expose them in a prototype and retrieval demonstrator [MACS Multilingual Access to Subject Headings].

Potential actions, e.g.:

Agree on ambitions for multilinguality

Investigate practical methods to improve multilingual access

Summary

This paper provides background information and pointers to a discussion about potential Electronic Thesis and Dissertation services and semantic interoperability, subject access and multilinguality aspects in this context. It points out that available data and the aims of cooperation or new services need to be analyzed before more focused advice is possible. Several other communities have discussed and are developing highly similar features which need to be considered. Coordination with them is necessary. Every approach need to be aware that TD records are created in different contexts and, to an even larger extend, will be (re)-used in multiple ways and systems.

Semantic interoperability needs to operate on a data structure (metadata profile), categorical data (classification, indexing, KOS) and factual data (authority values) level. Metadata enhancement in the context of services, after initial creation, is an important activity.

Regarding subject access, mapping of heterogeneous systems to one common "switching" system for integrated browsing and search filtering and for a certain multilingual access seems to be the interoperability approach to favour at this time. This approach does not only require many practical agreements between cooperating partners and local/national recommendations, but needs also to support and participate in larger scale experimentation, research and development with a practical focus, e.g. (semi)-automatic classification and indexing and classification mapping methods and end-user usability studies. The exploitation of subject access data in many existing TD and broader systems seems far from sufficiently well designed to be of much benefit to end-users resource discovery.

References/Notes

AGROVOC thesaurus http://www.fao.org/aims/ag_intro.htm

Central Statistical Office and the Science Council categories http://www.ub.uu.se/epub/categories/

CERIF http://www.eurocris.org/en/taskgroups/cerif/

CERIF 1991 Discipline Classification Schema ftp://ftp.cordis.lu/pub/cerif/docs/CERIFOJ.DOC

CiteUlike http://www.citeulike.org/

Coleman, A. and Roback, J. (2005). Open Access Federation for Library and Information Science. DLIST and DL-Harvest. In: D-Lib Magazine, Dec. 2005
http://www.dlib.org/dlib/december05/coleman/12coleman.html

Danish National Research Database http://forskningsbasen.deff.dk/?lang=eng

DDC-Sachgruppen der Deutschen Bibliographie http://opac.dbf.ddb.de:8280/hermes/help/du/hilfe_sachgruppen.html#ddcsg

DEFF, Denmark's Electronic Research Library: separated search services http://www.deff.dk/default.aspx?lang=english

Dempsey, Lorcan, Childress, Eric, Godby, Carol Jean, Hickey, Thomas B., Vizine-Goetz, Diane and Young, Jeff (2004). Metadata switch: thinking about some metadata management and knowledge organization issues in the changing research and learning landscape.
Available at: http://www.oclc.org/research/publications/archive/2004/dempsey-mslitaguide.pdf

DINI (German Inititave for Networked Information) (2003a). Electronic Publishing in Higher Education. How to design OAI Interfaces -Recommendations- October 2003 http://www.dini.de/documents/OAI-Empfehlungen-Okt2003-en.pdf
DINI (German Initiative for Networked Information) (2003b). DINI-Certificate. Document and Publication Repositories. October 2003 http://www.dini.de/documents/Zertifikat-en.pdf

ePrints UK http://www.rdn.ac.uk/projects/eprints-uk/ and http://www.ukoln.ac.uk/projects/eprints-uk/

ETD Metadata Standard http://www.ndltd.org/standards/metadata/current.html

EURODICAUTOM http://europa.eu.int/eurodicautom/Controller

European Education Thesaurus http://www.eurydice.org/TeeForm/FrameSet_EN.htm

European Treasury Browser Thesaurus http://www.eun.org/eun.org2/eun/en/etb/content.cfm?lang=en&ov=7208

Fox, E. A., Zhang, B. and Richardson, R. (2005). Networked Knowledge Organization for the NDLTD. Presentation at the NKOS Workshop at ECDL 2005.
http://www2.db.dk/nkos2005/Fox-Zhang-Richardson.pdf

GEMET, GEneral Multilingual Environmental Thesaurus http://www.mu.niedersachsen.de/cds/etc-cds_neu/library/select.html

Guy, M., Powell, A. and Day, M. (2004). Improving the Quality of Metadata in Eprint Archives. In: Ariadne 38, Jan. 2004
http://www.ariadne.ac.uk/issue38/guy/

Hellweg, Heiko, J�rgen Krause, Thomas Mandl, Jutta Marx, Matthias N.O. M�ller, Peter Mutschke, Robert Str�tgen (2001). Treatment of Semantic Heterogeneity in Information Retrieval. In: IZ-Arbeitsbericht Nr. 23, May 2001, 47 p.
http://www.gesis.org/Publikationen/Berichte/IZ_Arbeitsberichte/pdf/ab_23.pdf

Hillmann, D. I., Dushay, N. and Phipps, J. (2004). Improving Metadata Quality: Augmentation and Recombination. DC2004 Conference, Shanghai.
http://metamanagement.comm.nsdl.org/Metadata_Augmentation--DC2004.html

IST ERA Taxonomy http://www.cistrana.org/files/IST_ERA_taxonomy.pdf

Koch, Traugott and Day, Michael (1997). The role of classification schemes in Internet resource description and discovery (EU Project DESIRE. Deliverable D3.2.3)
http://www.lub.lu.se/desire/radar/reports/D3.2.3/

Koch, Traugott, Neuroth, Heike and Day, Michael (2001). Renardus: Cross-browsing European subject gateways via a common classification system (DDC). In: "Subject Retrieval in a Networked Environment". Proceedings of the IFLA Satellite Meeting sponsored by the IFLA Section on Classification and Indexing and the IFLA Section on Information Technology, 14-16 August 2001, Dublin, OH, USA.
UBCIM Publications - New Series Vol. 25, Muenchen 2003. pp25-33.
Manuscript at: http://www.lub.lu.se/~traugott/drafts/preifla-final.html

Krause, Juergen (2004). Standardization, Heterogeneity and the Quality of Content Analysis: a key conflict of digital libraries and its solution. IFLA Journal 4/2004, pp.310-318
http://www.ifla.org/V/iflaj/lj-4-2004.pdf

LEAF project http://www.crxnet.com/leaf/

Lund University Dissertations http://theses.lub.lu.se/postgrad/?lang=en

MACS (Multilingual Access to Subject Headings) http://laborix.kub.nl/prj/macs/macsd.html

Metadata Exchange Format for Documents (MXD) http://www.forskningsdatabasen.dk/cgi-bin/egwddf2/6969/screen.tcl/name=about5&service=ddf2&context1=a&lang=eng

MetaDiss, based upon the Dublin Core Metadata Element Set (DCMES) http://deposit.ddb.de/metadiss.htm

NDLTD Networked Digital Library of Thesis and Dissertations http://www.ndltd.org

NDLTD Search and Browse ETDs http://www.ndltd.org/browse.en.html
ETD Union Catalogue http://zippo.vtls.com/cgi-bin/ndltd/chameleon

OCLC Metadata Switch project http://www.oclc.org/research/projects/mswitch/default.htm

Online dissertations and post-doctoral theses (since July 1998) http://deposit.ddb.de/netzpub/web_online-hochschulschriften_e.htm (info) at Die Deutsche Bibliothek http://deposit.ddb.de/index_e.htm (incl. Links to search)
Based at the "Dissertationen online"project funded by the DFG http://www.dissonline.de/index.htm

Ortelius Thesaurus on Higher Education http://www.cordis.lu/cerif/src/toolkit.htm#subject

Patel, Manjula, Koch, Traugott, Doerr, Martin and Tsinaraki, Chrisa (2005). Semantic Interoperability in Digital Library Systems.
DELOS Network of Excellence on Digital Libraries, European Union, Sixth Framework Programme. Deliverable D5.3.1 June 2005. 71 pp.
Available at: http://www.ukoln.ac.uk/ukoln/staff/t.koch/publ/SI-in-DLs.doc

Peters, Carol (2005). Multilingual Information Access for Digital Libraries. Presentation at Delos brainstorming meeting, Juan les Pins, December 5-6, 2005.

Renardus. http://www.renardus.org/
Renardus Project Archive and Associated Research and Development. 2002. http://www.renardus.org/about_us/project_archive.html

Renardus ML http://renardus.lub.lu.se/ml-browse/

Renardus usage of subject vocabularies:
Koch, Traugott (2000) Cross-browsing in Renardus: Usage of subject vocabularies at Renardus gateways http://www.lub.lu.se/renardus/class.html
Renardus classification systems used http://www.lub.lu.se/renardus/D6.1class.html
Renardus keyword systems used http://www.lub.lu.se/renardus/D6.1keyw.html

SVEP (Coordination of e-Publishing within Swedish Higher Education) http://www.ub.uu.se/epub/recommendations/rek1_0.pdf (in Swedish only)

Tapir http://sourceforge.net/projects/tapir-eul/
Jones, R. (2004). The Tapir: Adding E-Thesis Functionality to DSpace. In: Ariadne 41. http://www.ariadne.ac.uk/issue41/jones/intro.html

UK Recommended E-Theses Metadata Set http://www2.rgu.ac.uk/library/guidelines/metadata.html

Vascoda http://www.vascoda.de/

Virginia Tech ETD-db system http://scholar.lib.vt.edu/ETD-db/

XTCat http://www.oclc.org/research/projects/etd/default.htm
http://alcme.oclc.org/xtcat/index.html

Traugott Koch
Created: 2006-01-10
Last modified: 2006-01-24
URL: http://www.ukoln.ac.uk/ukoln/staff/t.koch/publ/e-thesis-200601.html

Electronic thesis and dissertation services: semantic interoperability, subject access, multilinguality

Traugott Koch, UKOLN

E-Thesis Workshop, Amsterdam 2006-01-19/20

Contents:

1 Introduction

Purpose and content of the paper, limitations:

Related communities and developments:

TD specifics:

Alternatives:

2 Interoperability

2.1 Syntactic interoperability:

2.2 Semantic interoperability:

3 Subject Access

3.1 Problems:

3.2 Approaches:

A) Use one common system, (re)classify/(re)index all documents (manually or with automated support). Select an established or create a new system

B) Map to a common "switching" system

C) Support indexing/classification in several different systems

D) Do nothing. Social tagging

4 Multilingual Access

Summary

References/Notes

Electronic thesis and dissertation services:
semantic interoperability, subject access, multilinguality