Koch, Traugott: Terminology and subject access issues reg. eBank UK

Report, August 2006

Contents:

1. Introduction
2. Purpose of subject access and terminology in eBank UK. Requirements
3. Terminologies
4. Keyword assignment/indexing
5. Ontology development?
6. Data and text mining applications
7. Discovery: Searching, browsing, linking
8. Conclusions
9. References

1. Introduction

The purpose of this report is to describe issues and to list possible investigations and activities related to subject access and terminologies for a potential next phase of the eBank UK project (http://www.ukoln.ac.uk/projects/ebank-uk/) and for improvement of the eCrystals repository and the eBank aggregator.

eBank solutions are relevant for and influence, beyond the projects own services, related data repositories, publication repositories, aggregators and other services down the line who will re-use or interact with eBank data. At this time, eBank records appear in a few general information sources, such as OAIster and Google. Our solutions, i.e. standardised named entities such as InChIs or author names and keywords are not only influencing other services, they should be actively re-used elsewhere for improved interoperability and a better cost-benefit relation.

The issues regarding searching eCrystals metadata are probably highly similar to the ones related to searching metadata for publications; compared with fulltext searching of textual publications, however, searching in data files, -sets and -holdings is most probably considerably different.

Other e-Science projects and repositories will face very similar issues of terminology usage and subject access. This text focuses on access to the eCrystals data repository, its holdings, files and the metadata about them as displayed in the repository's html pages that represent the data holdings and point to it's component files. The report makes no assumptions about general applicability of any of the problems and solutions described. It is published outside the project to stimulate readers to compare and to investigate similar terminology and subject access issues in their own projects.

2. Purpose of subject access and terminology in eBank UK. Requirements

Subject indexing, subject access features and terminologies are needed to assist in discovery and retrieval in the eCrystals repository [eCrystals repository] and to allow a continuation of the search for identical and similar material in other sources and databases (for a relatively early discussion related to OAI repositories in the UK see [Simpson]). Most metadata elements in the eCrystals repository can be used to search for a "known object": data holdings produced by certain authors, belonging to a given institution, created in a certain year. To specify a given data holding/crystal structure report (HTML pages in the repository summarising information on the crystal structure and linking to the related data files which make up the data holding), especially considering that some authors are represented with many reports already in this repository, InChI, Chemical formula (both subject elements) and IUPAC name (used as title element) have to be specified. Here, we run into many issues related to standardisation, representation and the treatment of search strings in different search services and elements. Advice and browsing options are required to assist a user even reg. known object searching (cf. the search section below).

For subject or topical searching, the eCrystals repository offers the subject elements Keywords and Compound Class (only useful as very rough search filter). Again, searching and browsing for topics needs to be supported with appropriate features and help information to be really useful even for discipline experts (see search section below). For a fully developed subject access, probably more fine grained classes of compounds, classes of molecules and components of structures would need to be more easy browsable and searchable, incl. search by structural features and substructure search via options to search certain strings or to search with graphical support.

Apart from the use of terminologies and subject access features for discovery and retrieval, the same features are to be used by the creators of metadata, e.g. authors selecting keywords or specifying an InChI (using the IUPAC InChI generator tool offered as support by the eCrystals repository). Certain support features and searching/browsing devices might very well be the same for both groups of "users", people providing and managing data and metadata and people searching the repository.

The original selection of subject access features and terminologies was done based on the experience of Crystallography researchers. The initial requirement study for eBank UK [Day] did not go into such details nor was it amended after the availability of the first versions of the real repository. It seems quite important, now after some initial experiences and work, and before a planned extension into a discipline-wide federated service, to go back and again define who is supposed to search and use the repository and eBank services, for what purpose and with what level of knowledge. Now, first discovery and retrieval experiences could be evaluated and improved requirements and specifications could be formulated. As soon as other data sources reuse the Southampton data (eCrystals holds at this time only data from the crystallographers at Southampton University), link to it or cross-search it (or vice versa), additional requirements emanating from heterogeneous sources, purposes and users will need to be investigated and the eCrystals solution to be adapted. After such new requirement and usage studies, we need to go back to the question, what kind of subject access and terminology support is needed to satisfy present and future needs of all kinds of users.

For many of the intended and future uses of a repository, questions of interoperability need to be investigated and solved. That does not only influence metadata semantics and syntaxes, but also the relationships between standards or other solutions regarding chemical names, formulas, structures and their representation and also research aspects and application contexts and similar, as expressed in keywords. They need to be discoverable from other related terminologies, naming and representation schemes (and vice versa) and from the query vocabulary of different groups of (potential) users, a goal which can be addressed e.g. by mapping or conversion between terminologies and schemes.

For these tasks, we will have a better basis when other Crystallography data is joining eBank, other independent Crystallography repositories emerge, repositories from related disciplines (other subdisciplines in Chemistry, Mineralogy, Biosciences, Pharmacology, Materials Science in Engineering etc.) start interacting or cross-linking and searching is developed between eCrystals, other data repositories and publication repositories (from research institutions, publishers, scientific societies etc.).

As probably everybody else in the fields of e-Science and Digital Libraries, we know very little about the requirements for subject based discovery of data sets. In addition, these requirements and usage contexts are probably quite different in different disciplines. Regarding Crystallography and our eBank repository, we will need to investigate this issue in more detail. The pedagogical evaluation of the eBank project phase 2 might come up with some insights or could be extended to cover such details. There might be literature about information searching in Crystallography. We need to describe proper, realistic and detailed use cases involving data discovery. Not the least, capturing user queries (and failures), user studies and analysis of usage logs of the repository itself or similar information services could provide valuable insights.
All of this will help us to understand the role our present subject access features and discovery options play and assist in developing complementary features, improved support and user interface solutions.

In a data model, and conceptually, e.g. the assignment of keywords plays a different role when used for discovery of data holdings and sets, than it plays in searching publications/text or in finding smaller entities inside data files. With traditional indexing, e.g. much finer grained indexing of data inside files, is probably not feasible. The NERC DataGrid project is presently engaged in intensive development work to solve these problems applying a vocabulary server, terminology mapping and a terminology web service to support NDG's discovery portal [Lawrence]. A cooperation with NDG and other e-Science projects engaged in terminology developments would certainly be helpful. Other techniques have probably to be explored, such as data mining (see section 6) and data markup. Both will be different from the same techniques applied to text, which are rather well developed. We should find out, in cooperation with the JISC funded SPECTRa project, how to apply e.g. the Chemical Markup Language (CML) for such purposes and try to adapt their tools.

Among the new conceptual problems is e.g. what exactly the eCrystals keywords express. As far as can be judged from their appearance within repository records, they are not describing the data holding, features of the data in the files, or even methods and techniques used to generate the data. Rather they seem to indicate usage and research contexts in which the data holding/crystal structure might be relevant, predominantly from the full discipline of Chemistry. So, these keywords are probably not to be found anywhere in the data files and are not directly comparative to keywords depicting the topics of a paper. Thus, any future cross-search has to be carefully designed, i.e. the eBank aggregator as provided by UKOLN [eBank UK aggregator] . The submitting and the search interfaces need to provide appropriate information as well.

When more advanced subject access features are implemented, user studies, logs etc. have to be applied again, to be able to judge the benefits evolving from it and to further improve the solutions.

A systematic study of terminology and subject access issues needs to take at least the following categories into account:

Methods:
Entities. Mining. Higher level hypothesis and knowledge generation

Services:
Searching, browsing, harvesting, awareness/news, data extraction

Users:
Disciplines: Crystallography; related disciplines (Chemistry, Mineralogy, Biosciences, Pharmacology, Materials Science in Engineering); general
Levels: researchers, learners, general public

Document types:
Data, data holdings, publications, learning objects; metadata.

Data provision and service entities:
a) one repository
b) group of repositories (subject, institutional)
c) third party and aggregator services (subject, institutional, general)

3. Terminologies

Based on the need for terminologies in different contexts, as described above, we have to look for suitable candidate vocabularies in the discipline, in related disciplines or, for limited purposes, even general systems. For interoperability and in order to keep down creation and maintenance costs, its only in very rare cases advisable to create new terminologies from scratch. Even if based on existing terminologies, costs and efforts required to adapt and maintain such terminologies are considerable and such steps should only be taken by scientific societies and other institutions with a "global" long-term commitment and status in the community/discipline (or by established corporate entities in certain cases).
In the following, we will look at candidate terminologies for the eBank repository and services and describe what is used at this time and what might be explored in the future. Other Crystallography or Chemistry related services might possibly have an interest in the details of the remainder of this section.

To our knowledge there are no specific terminologies for our purposes developed and available in Crystallography (the CIF dictionary, see subsection 3 below, is about metadata element semantics only). Regarding topical vocabularies, there seems no detailed classification or categorisation scheme available (Compound Class is only an extremely shallow categorisation for filtering purposes) nor any larger ontologies covering the discipline (see section 5). If there can be made any use of parts of established classification systems such as DDC and UDC from the library world, or systems such as Chemical Abstracts from the commercial database, abstract&index services world, has to be investigated at a later time.

Regarding keyword indexing, a list of keywords from IUCr (International Union of Crystallographers) is used by the eBank project (details see section 2.1 below).

For future work, we need to investigate alternative terminological options from all of Chemistry or other terminologies which might contain Crystallography concepts and terms, find out about who is maintaining them and what their status and level of usage and acceptance is. We need to study their coverage i.e. topics, methodologies, techniques, materials, related disciplines etc. and the quality of their content and construction.
This includes the Chemical Abstracts system used in the CA database and owned by the company with the same name. Probably, there is terminology in the CA system used to index Crystallography in chemical articles or even the Crystallography publications themselves. Suitable terminologies could be hidden in other related services and databases. The Cambridge Structural Database (CSD), hosted at CCDC [CSD] seems to contain bibliographic information and CIF metadata, but no topical vocabulary.

Specialised dictionaries and encyclopedias, textbooks and learning models could be useful sources for terms and especially definitions of terms and concepts.
Didactical and e-Learning vocabularies might be available for this (though not known to us) or related disciplines or could be extracted from e.g. subject centres in e-Learning or portals such as PSIGate or the RDN Engineering hub.

All of these terminologies need most probably to be amended to cover specific terms for a data repository, its granular data, the machinery and workflow used and the kind of crystal structure analysis which is the main content of the eBank repository.

Regarding authority solutions for named entities, eBank uses several syntax encoding schemes from Chemistry which are more or less standardised: InChI (International Chemical Identifier) for chemical compounds, Chemical Formula and IUPAC Chemical Names (details see below). A certain involvement in standardisation efforts regarding such schemes, at least via IUCr, is clearly important for large subject repository efforts like eBank.

Regarding name authorities (person names and institutional names), membership information from professional societies such as IUCr could possibly be used or information from universities with relevant institutions, provided integrity legislation does not prevent such re-use.
Another option is cooperation with library and publication oriented name authority projects (e.g. LEAF, OCLC, Library of Congress etc.). Finally, such authority files and services could be based upon text mining from publications and other suitable sources (see section 6).

What eBank UK currently uses:

At this time, eBank UK uses and needs, according to the Crystallographers and repository providers, terminology in three areas:

1) Keywords covering usage /problem contexts, "what crystallographers do", methods, procedures, techniques etc.
= IUCr keywords

The maintainers of the repository have been adapting a keyword list provided by the International Union of Crystallographers which contains about 4000 terms covering all of Chemistry. This list is compiled from the members of the World Directory of Crystallographers (where members without any restriction or guidance can fill in terms describing their areas of interest and expertise when joining), cf. [World Directory]. As such, this list contains uncontrolled terms. The project has created a selection of these terms for usage primarily as metadata creation support in the repository deposition tool (the list is at: [eBank UK keywords] ). The principles of the selection or adaptation are not known to the author of this report nor the frequency of term-use in the present repository. The list as a whole seems too large as support for indexing in this repository, however. Because of the adaptation, and since it has not been standardised by the community, the list is identified as the vocabulary "Ebank UK keywords" in the eBank UK namespace [eBank UK namespace] .

Further steps to convert this list into a proper controlled list according to the rules of terminology creation might be advisable. If it, maybe together with free keywords assigned to other IUCr publications, e.g. some Acta Crystallographica series, can serve as basis for a new Crystallography terminology such as a thesaurus, and a classification system of some kind (introducing hierarchies) or even an ontology has to be decided later.
Because of its character and the fact that IUCr acts as a major publisher in the discipline, there might be a possibility for connections to publication vocabularies and other databases, especially in the whole field of Chemistry.

On the road to proper discipline-wide, sustainable and interoperable terminology systems, it is an advantage, that IUCr appeared to be prepared to support further development of the list and to host and maintain it.

The problems for aggregator services such as the EBank aggregator [eBank UK aggregator] are far more wide-reaching as having a decent topical terminology in the eCrystals repository. Today, searching both bibliographic publications (from IUCr only at this time) and metadata about the data holdings of eCrystals is severely hampered by the fact that both metadata sets have very little in common. Matches for hits are mostly based on authors last names, and when it comes to topics, on accidental identity between words from the abstracts of publications and keywords in the data repository. In order to become interoperable, other data repositories or publication databases would either have to use the same keyword list as eCrystals, or this list would have to be mapped to all the other subject access systems in important related sources, which would be a huge and ongoing task. In addition, the specialised crystallographic terminology needs to map back to more general Chemistry or generic systems covering many disciplines.

2) Names of the objects of study and their components: crystal structures, chemical compounds

2.1 IUPAC (International Union of Pure and Applied Chemistry) Chemical Names.
They are used in the eCrystals repository as titles for the data holdings. It is a syntax encoding scheme, containing rules about how to name, but not a vocabulary enumerating the full list of possible values.

Chemical expertise and software support is needed to properly create such an IUPAC Chemical Name following current IUPAC conventions for their generation, as provided in the Colour Books on chemical nomenclature, terminology, symbols and units (nomenclature guides developed in the early 90's for organic compounds, inorganic chemistry, macromolecular and biochemical nomenclature). The eCrystals repository deposit interface offers these guidelines to the data holding submitter. A plug-in tool to support creation of proper names is available as free software, but not yet integrated into the deposit system.

The Gold Book [Compendium] contains 7000 terms (and phrases), with authoritative definitions, spanning the whole range of chemistry. It is a controlled vocabulary, maintained by IUPAC and soon made available in XML format. eBank does not use the Gold Book at this time.

2.2 A second scheme used for specifying named entities, is the InChI (International Chemical Identifier), an open standard for chemical structure representation [Stein et al; IUPAC InChI site]. InChI allows to encode molecular structure as a text string, and, what is important for crystal structure data, to include the phase of a compound and the crystalline phase descriptor (for further details about the eBank usage cf. [Coles et al 2006]). At this time, InChIs do not yet allow to describe all crystal structure data.
There are quite many tools available for the generation of InChIs. The eBank deposition toolbox uses the IUPAC InChI generator to convert an uploaded structure file into an InChI string and displays it in an animated 2D form so that the submitter can check its integrity.

2.3 A third syntax encoding scheme is used in the eBank subject elements, the Chemical Formula. The deposit system provides guidelines for its composition. It is a convention in Crystallography based on the standard documented as CIF dictionary. Other chemists could potentially create chemical formulas in a different way.

In order to uniquely identify chemical compounds, CAS numbers (Chemical Abstracts) are quite well established. They are completely numerical but do not convey any further information. This makes them in principle easier to use for linking and searching purposes and could reduce errors. To assign them to data sets, the numbers need to be looked up. Requiring new CAS numbers might, however, involve fees. We will have to investigate if it makes sense in the context of eBank to add or at least map to CAS numbers in order to find identical compounds in other, especially commercial databases.

IUAPC chemical names are more generic and "fuzzy", whereas InChIs, containing a hierarchical structure, are more precise and specific. The former provide, however, superior human readability, if one can say this at all about an IUPAC name. InChIs might be easier to work with in digital services.
All three schemes together provide a quite rich approach to describe the chemical compounds, their structure and components. If the rules are strictly followed, if validation can be applied and the practice in other databases and sources is equivalent, then there is a good potential to base cross-searching, cross-browsing and linking operations on these schemes and, thus, to implement an important stepping stone to semantic interoperability (cf. chapter 7).

3) The data, dataset and file parameters and features:
=CIF dataset parameters/features.

The final result of a crystal structure determination is a Crystallographic Information File (CIF) [Brown]. The terminology used in CIF files for describing the elements/fields is specified in dataset-specific CIF dictionaries (in HTML, expected to be done in XML).
"The Core CIF Dictionary is a set of data names designed to cover the requirements of archiving and exchanging raw and processed data and derived structural results for single-crystal small-molecule and inorganic crystal studies." In addition, it comprises specifications for bibliographic metadata related to journal publication. [IUCr Crystallographic]

This is all part of the "Crystallographic Information Framework, a broader system of exchange protocols based on data dictionaries and relational rules expressible in different machine-readable manifestations, including, but not restricted to, Crystallographic Information File and XML". CIF is a standard now (Version 1.1, working specification 2003). The standard, the dictionaries and related publications are all available at [IUCr Crystallographic].

The author of this report is not sure, if the Ebank Dataset Types which in the metadata records specify categories of the experimental process (workflow stages such as processing, solution, validation etc.) and the slightly different headings for groups of files and individual files, which are used in the crystal structure report HTML pages, are taken from the CIF standard/dictionary, from another source or invented by the project.

The eBank deposit toolbox performs several value-adding processes to the data files, e.g. validation of the structure with the webservice CHECKCIF and translation of the structure into CML format, thus, improving discovery and reuse of the data.
The terminology used in CIF files is crucial for all data discovery and retrieval from inside data files in the eCrystals repository.

Maintenance problem, sustainability:

Future studies regarding requirements and user experiences as proposed in chapter 2 and evolving services across repositories and databases will make it possible to decide to what a degree the now used terminologies and subject access features need to be modified, further "controlled" (on the road to controlled vocabularies) or standardised. Only in an international context of professional societies it can be decided if new terminologies need to be created, if they will be adopted and understood by users and if the benefits justify the costs.

As indicated above, IUCr and to a certain degree IUPAC, seem to be prepared to take responsibility for long-term maintenance and hosting of such terminologies even though sufficient funding is not necessarily assured. Their role is even more crucial if new terminologies and maybe formal ontologies were developed and mappings for interoperability with other (sub)disciplines and communities needed to be created and permanently updated.
Everything needs to be done and support provided to enable proper standardisation focusing on all components needed for digital repositories, documentation, preservation and exchange of data and interoperable services, even beyond the obvious techniques of a semantic web.

4. Keyword assignment/indexing

Ebank UK keywords based on the IUCr list (see section 3: part 1.) are now offered as a long unstructured alphabetical list of terms to authors/submitters of crystal structure report for selection. It is unclear what the criteria and rules are for keyword assignment in the repository, no guidelines or rules are provided to submitters and metadata creators. The relationship between the keywords and the content of the data holding needs to be declared and described to the submitters. They need to know which and how many terms to use in order to enhance consistency. There are general rules for document indexing (and even an ISO standard) which might give some guidance. Both extrems are unsuitable: too few terms and too many. There are general rules about coverage, exhaustivity and specificity in indexing as well. The general granularity of the indexing has to be decided too, depending on present and future needs and applications and related to the costs of indexing (time). All of these specifications are heavily influencing search and retrieval, not only in this repository but in all further related and co-searched sources down the line.

There are as well several problems of quality to consider: who is most suited to do term assignment, what level of errors is observed, how can they be corrected, how to assure a sufficient level of consistency in indexing etc.? One could investigate several support and validation features based on semi-automatic indexing (which would need to be developed together with suitable partners). A further goal would be to be able to have such a tool proposing most of the relevant terms automatically.

For interoperability and as a background, it would be very useful to investigate the indexing principles and terminologies used in other related databases such as CSD, the Cambridge Crystal Structure Database, or by other journals in the discipline (e.g. IUCr journals) or by Chemical Abstracts to index publications/journal articles.

Important areas of future development are:

1) Exploration of the benefit of text and data markup, e.g. using CML, the Chemical Markup Language [Open Source] for subject access and the potential opportunities arising from a combination of text markup with keyword indexing and traditional "fulltext" searching.

2) Experimentation with methods of participatory indexing (social tagging) and folksonomies (categorization and clustering done by users), maybe initially carried out by well-defined research groups in specified research areas. A discipline such as Crystallography and research data are among the best suited settings to explore the benefits of these approaches.

5. Ontology development?

It is often said, that decent digital information systems need to feature ontologies. Representatives for this discipline and the repository providers have talked about such ambitions as well. They wanted to turn the list of 4000 terms into an ontology. The expectation to be able to do this, in addition in a short period of time, might be due to a differing understanding of the term Ontology. Applying certain measures of control to a vocabulary or organising it into a hierarchy of some kind doesn't, according to a precise, semantic web related understanding, make a vocabulary an ontology.

We should only talk about an ontology when precise and rich semantic relationships are linked to concepts, primarily for the purpose of logical reasoning, encoded in OWL, KIF or similar languages. [cf. Patel, Koch et al 2005, ch. 4] This is, however, a very time-consuming, expensive and intellectually demanding task
We need to find out, if there in the foreseeable future is a need for ontologies and logical reasoning on data and data holdings in the discipline and who is carrying out and funding such activities.

It is not obvious that an ontology is needed to support data discovery and retrieval in Crystallography data repositories. For these purposes, improved terminology creation, development and usage, e.g. turning a vocabulary into a real controlled terminology or creating hierarchies in the form of a classification system or thesaurus should be sufficient. There is no agreement about, if and when such a controlled vocabulary system turns into an ontology. The author of this report supports a view of Knowledge Organisation Systems (KOS) as a continuum of increasing levels of control and precision. So, we will have to specify what effort to apply and for what reasons.
For "normal" browsing and searching, a (maybe existing) classification system would even be preferable. Again, such a system is hard to create for already existing terminologies, by people not involved in the creation and maintenance of the terms and it is even harder to maintain. Normally, the advice is to leave this job to institutions with a long-term engagement in the field, with broad knowledge and financial commitment.

6. Data and text mining applications

It is quite probable that some kind of text and data mining needs to be applied to eBank, to improve both the repository, the aggregator, subject access and terminology use. There are several different possible purposes for applying mining and the specific techniques needed and the approaches relevant will accordingly be different.

We will need text mining in order to identify, link, search, browse, index and extract named entities, their substructures and components, such as author and institutional names and compounds and their components, via identifying InChIs, IUPAC Chemical Names and chemical formulae incl. their parts from textual materials and publications. Probably equally important will be the extraction of relevant topical terms and phrases for the same purposes already mentioned. Apart from the eCrystals repository, other collections such as corpora of text and data and reference works will be needed. Approaches such as co-reference resolution will probably be relevant in our discipline as well, to identify variants of names for the same object. We need to find out about further domain-specific problems in this context.

Results can be used

a) to support submitters keyword assignment and named entity indexing;
b) to allow and improve certain automatic indexing of the repository content;
c) to support more advanced searching and browsing of the repository;
d) to allow metadata validation and enhancement operations in the repository;
e) to perform b, c and d on the content of other repositories, especially collections of related publications;
f) to allow cross-searching, cross-browsing and linking between several data and text publication repositories;
g) to identify potential candidate terms for the creation of a suitable and topical domain terminology (to improve, update or replace the present eBank keywords) and to contribute to the building of domain-specific authority files.

Such authority files can subsequently be used in mining and indexing operations again.

Using data mining, we will need to look at the same features plus additional ones which might be needed for knowledge extraction and hypothesis creation, data selection and comparison, correlations, calculations etc. Open, large-scale computational access to scholarly literature and data, incl. unreleased and non-public information as recently discussed by Clifford Lynch [Lynch], is crucial for accelerating the rate of scientific discovery and growth of scholarship, to extracting new knowledge, discovering correlations and creating hypotheses. Lynch points out, that Chemistry is in a favorable position to early and successfully explore such options and makes the role of terminologies crystal clear: "Areas such as biomedicine or chemistry, where much of the literature is relatively well-structured and where a base of investment in the ontologies, specialized vocabularies and vocabulary mappings and similar tools has been extensive, would likely be fertile ground for early advances." Lynch points out the important role of incorporation of markup to facilitate computational processing (and mentions Murray-Rust).

Thorough investigations with suitable partners will be needed, to identify useful approaches and techniques and to specify the level and character of artificial intelligence algorithms needed (probably a combination of rule-based and statistical analysis) for the tasks. We will need to find out as well, to what a degree the data and values of the discipline are really suited to mining and knowledge extraction.

There is quite a lot of, even open source, software and tools available. The GATE project [GATE] could e.g. help with automatic tagging of personal names (now part of the Greenstone DL software as well). The US project NORA and the University of Illinois Urbana Champain use a tool for rapid flexible mining and machine learning incl. visualization called D2K Data to Knowledge [D2K] which might be useful.
The Perseus Project at Tufts University has been specialising during considerable time in named entity recognition/mining [Perseus]. They are planning an open source release and offering a service. Similar experienced partners would be CiteSeer and NSDL/INFOMINE or OCLC.
Other potential UK partners could be the University of Sheffield, which plans to use text mining in digital 18th Cent. materials [Sheffield] and the national Text Mining Centre [National Centre].

As ususual, cost/benefit and usage studies have to reveal if mining is worthwhile doing.

7. Discovery: Searching, browsing, linking

The main purpose with subject access/indexing and the use of terminologies is not to provide repository functionality or data curation/preservation but to improve discovery of data via features such as searching, browsing and linking. This type of discovery is working differently in services where eBank records are made available, already at this early stage with few such services e.g. the repository, the aggregator, OAIster or Google. Other search engines, OAI aggregators and literature databases will follow soon and a certain consistency and correctness needs to be accomplished.

At this time, the only linking available is in the eBank aggregator where a few example links from selected bibliographic records of articles published in IUCr journals are manually created to corresponding records for crystal structure reports in the eCrystals repository. (Note reg. the text in this section: the aggregator has been changed since this report was written)

Browsing is on offer in the repository only, but so far limited to browsing by year, author and compound class. Browsing by subject, i.e. keywords, is not available yet. There is no browsing option for eBank records available at the eBank aggregator or other services "down the line", such as OAIster or Google.

BTW, browsing by author reveals, that there is no control of person names at this time, identical authors appear sometimes with two to three different name variants. To correct the search result, a certain indexing principle has been applied and this requires a given query syntax (which is advised on in the interface). Other services who have harvested the repository, apply however different indexing and query principles and produce, consequently, wrong results. This can be seen e.g. in the aggregator and in OAIster, where the variants give highly different results.

In the public version of the repository (eCrystals) a "Subject" element can not be searched (yet), eBank keywords are not searchable (yet). Advanced search allows only to search the Title element (IUPAC names) among the elements carrying subject information, Simple search addresses, however, both the Title element and an Abstract element (which is not displayed) and a few others combined. (The eBank aggregator, however, consisting primarily of the same records as the repository, applies quite different options, e.g. titles are not searchable, but all of the subject elements are).

Regarding the Title element (IUPAC Name) search arguments such as integers etc. are omitted; search results are not correct if the full title is used; certain string searching, with literals, works (in Advanced search). Different search systems index such complicated strings differently and apply different default search principles (e.g. complete string only, certain or all substrings, from the beginning of the string with right truncation) which leads to incorrect results.
Guidance for searching the title and the combined elements is missing (it exists reg. author and date elements, though); in the aggregator, there is no such support yet. None of the services offers a browsable dictionary lookup for any element, e.g. author names, apart from the element Compound Class. Initially, there was a decision not to develop "advanced features" in the aggregator due to limited resources. The features mentioned here need, however, to be in place in any decent repository and aggregator, otherwise the data on offer can not be resonably well explored.

Named entities such as an InChI could theoretically result in precise hits of documents containing these entities in search engines like Google, provided they tolerate such complex and specifically formatted strings for searching [cf. Coles et al 2005]. Another requirement is that different authors or systems use identical values for identical structures or at least are consistent for a sub-struture or a partial string. The eCrystals repository did for a short while feature such click-through from the InChIs listed on the data holding html pages to a Google search but this seems not to be supported by Google anymore.

It is not meaningful to do this evaluation in any more detail at this point. Further development has started, but needs to be intensified. Searching this kind of material is complicated and the present state offers not a good demonstrator. The same is true for the eBank aggregator.

Here are a few further issues which need to be investigated and dealt with in the near future:

Regarding the content to be searched, we need to identify what needs to be searchable and how to accomplish it. Is it useful to be able to search (and browse) substructures in IUPAC names, components of InChIs and chemical formulas? How can we decompose these codes for machine indexing and searching? Do we need to add graphical search (input of substructures in graphical form) and how can we accomplish that, what existing tools could be used?

What needs to be searchable inside the dataholding and its files as opposed to metadata or summaries such as in the HTML pages/crystal structure reports?

What meaningful browsing has to be added, e.g. the list of keywords, structured vocabularies, compound and structure components, names etc.?

Because of the indirect relationship between keywords and the content in data files (as described in chapter 2), it is especially important to expand the keyword list to cover the core of Crystallography research and, thus, to make better links to what the data really cover.
In our case, especially, it is absolutely necessary to make the list of keywords readily available in the search interface for searching and browsing, even hierarchical browsing, which requires to provide some semantic structure to the list, relationships between concepts and terms etc. Basically the same features are needed to support submitters and service staff when assigning keywords.

Discovery and user interface features need to be adapted to different user groups of the services: to experts and Crystallography researchers, to colleagues from related disciplines (Chemistry, Mineralogy, Pharmacology, Biotechnology, Materials Science etc.), to students, to general users etc.

Suitable levels of granularity need to be specified for searching and browsing, adapted to the data available and different use cases.
Regarding the terminologies used, we need to study their quality and usability, how they are understood and applied by both authors/metadata creators and different kinds of users who search the services.

The services need to be developed in a way that meaningful and correct cross-searching and browsing with other related repositories and literature databases can be accomplished. This includes, making the repository content available to other search protocols than OAI-PMH and Z39.50 to the degree required for such purposes.

8. Conclusions

The issues elaborated on in this report need to be discussed in the eBank project, both related to future project and funding applications and to the repository and services which will be made available independent of any projects: what do we need and want to do and how to prioritise. Several of the issues will be relevant to repositories and services in other disciplines as well.

Improvement of metadata creation support, of subject access and user interface features and the development of terminologies enabling improved interoperability as well as active participation in the development of standards in the narrower and wider discipline seems imperative.

If this project and the eCrystals repository initiative wants to follow leading developments in the e-Science and Digital Library field and create more value out of the research efforts, most of the other suggestions would need to be addressed in the rather near future, especially knowledge extraction, text and data mining, integration between markup, mining, indexing and retrieval, integration/linking between literature, reference information and data, large-scale computational access etc.

9. References

Brown, I.D. and McMahon, B. (2002). CIF: The computer language of Crystallography. In: Acta Cryst., 2002, B58, pp.317-324

CSD Cambridge Structural Database http://www.ccdc.cam.ac.uk/products/csd/

Coles, S.J., Day, N.E., Murray-Rust, P., Rzepa, H.S. and Zhang, Y. (2005). Enhancement of the chemical semantic web through the use of InChI identifiers. In: Org. Biomol. Chem., 2005:10, pp.1832-1834.

Coles, S. J. et al. (2006).
An e-Science environment for service crystallography -from submission to dissemination. In: Journal of Chemical Information and Modeling, Special Issue on eScience, Forthcoming 2006. pp17-18
http://www.ukoln.ac.uk/projects/ebank-uk/dissemination/JCIMrevised2-Coles.doc

Compendium of Chemical Terminology, IUPAC Recommendations (2nd Edn.), Oxford: Blackwell Scientific Publications. (1997)
IUPAC nomenclature. In: Wikipedia http://en.wikipedia.org/wiki/IUPAC_nomenclature
IUPAC Compendium of Chemical Terminology, Online version of the "Gold Book", from IUPAC: http://www.iupac.org/publications/compendium/index.html
Online version from RSC allows free text searching :
http://www.chemsoc.org/cgi-shell/empower.exe?DB=goldbook

D2K http://alg.ncsa.uiuc.edu/do/tools/d2k

Day, M. (2004). EBank UK project scenarios and user requirements.
http://www.ukoln.ac.uk/projects/ebank-uk/requirements/scenarios.html

eBank UK aggregator demo http://eprints-uk.rdn.ac.uk/ebank-demo/

eBank UK keywords http://ecrystals.chem.soton.ac.uk/key_A.html

eBank UK namespace http://www.ukoln.ac.uk/projects/ebank-uk/schemas/terms/

eCrystals repository http://ecrystals.chem.soton.ac.uk/

GATE project http://gate.ac.uk/

IUCr Crystallographic Information Framework
http://www.iucr.org/iucr-top/cif/

IUPAC InChI site http://www.iupac.org/inchi/

Lawrence, Bryan (2006). Distributed data, distributed governance, distributed vocabularies: The NERC Data Grid. Presentation at: Semantic Workshop, Edinburgh, June 2006. http://home.badc.rl.ac.uk/lawrence/lawrence/presentations/2006/06/08/NDG_SemanticEdinburgh06.ppt

Lynch, Clifford (2006). Open Computation: Beyond Human-Reader-Centric Views of Scholarly Literatures.
http://www.cni.org/staff/clifford_publications.html

National Centre for Text Mining, UK. http://www.nactem.ac.uk/

NORA project http://www.noraproject.org

OAIster http://oaister.umdl.umich.edu/

OpenSource Site for CML http://cml.sourceforge.net/
P. Murray--Rust and H. S. Rzepa et al., Chemical Markup, XML, and the Worldwide Web. 1-6. In: J. Chem. Inf. Comput. Sci., 1999-2006
espec.:
P. Murray--Rust and H. S. Rzepa, Chemical Markup, XML and the World--Wide Web. 4. CML Schema, J. Chem. Inf. Comput. Sci. 2003, 43, 757-772. http://dx.doi.org/10.1021/ci0256541

Patel, Manjula, Koch, Traugott, Doerr, Martin and Tsinaraki, Chrisa (2005). Semantic Interoperability in Digital Library Systems.
DELOS Network of Excellence on Digital Libraries, European Union, Sixth Framework Programme. Deliverable D5.3.1 Febr./June 2005. 70 pp.
http://delos-wp5.ukoln.ac.uk/project-outcomes/SI-in-DLs/

Perseus project http://www.perseus.tufts.edu/

Sheffield http://www.hrionline.ac.uk/armadillo/sources.html

Simpson, Pauline (Jan 2003). Subject categorization.
http://tardis.eprints.org/discussion/

Stein, Stephen E., Heller, Stephen R. and Tchekhovskoi, Dmitrii (2003). An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier, in Proceedings of the 2003 International Chemical Information Conference (Nimes), Infonortics, pp. 131-143.

World Directory of Crystallographers term list
http://wdc.iucr.org/user/help/scientific_research_interests.html

Content by: Traugott Koch of UKOLN.
Created: 2005-05-09
This version last modified: 2006-11-23
Email comments to: t.koch@ukoln.ac.uk
URL: http://www.ukoln.ac.uk/projects/ebank-uk/dissemination/termino-public.html