eBank UK Feasibility Report on Data Set Description and Schema to access the results of experiments in crystallography

Produced by the eBank UK project team
Workpackage Number: 6 Workpackage Title: Supporting studies, study 2
21 October 2004

The eBank UK project has produced a prototype demonstrator of a service based on Eprints software to provide access to the detailed results of scientific experiments in chemistry, and in particular crystallography. To present this complex data in a retrievable and meaningful way requires that it is described through metadata and appropriate metadata schema that allow the information to be harvested and re-published by other services through alternative interfaces. The challenge faced by the project is the complexity and volume of data that are to be made accessible from the principal points in the network dissemination chain - institutional archives, aggregators, service providers, portals, and prospectively other data providers such as publishers and digital libraries. In both respects, the design of the metadata containers and schema is critical, and is perhaps the key contribution of the first phase of the project (to September 2004). The report describes the metadata and schema adopted during the initial phase of the project, and shows how the records so-described are presented in the demonstrators. The advantages and limitations of the approach are briefly evaluated with a view to generalising the schema for the presentation of experimental data from other science disciplines through other service providers, which will be followed up during phase 2 of the project.

A journal publication describing the results of scientific work is typically a distillation of experimental data. This description is aimed at a wider audience than the immediate peers of the authors, so placing the work in its primary context and reducing the data to the most significant results is critical in making the work more widely known. Those immediate peers, however, may require access to more of the original data produced in the work, to verify reproducibility or to build on those data, for example.

Modern science can produce large volumes of data as computational tools enable experiments to be performed more frequently and more efficiently. In crystallography in the 1960s a PhD student might have investigated three or so structures. Now this number can be analysed in a single morning, yet the publishing protocols for reporting this work are essentially unchanged.

As long as publication has been detached from the means of production and format of this data, managing and providing access to full experimental data has not been simple. Journals, especially those based on print formats, do not have the space for such data. In crystallography just 300,000 crystal structures are documented in database archives, against an estimated 1.5 million known structures: less than 20% of data generated in crystallographic work is reaching the public domain due to publication bottlenecks. Furthermore, there are in excess of 24 million chemical compounds known. As high-throughput technologies, automation and e-science become embedded in chemical and crystallographic working routines, the publication bottleneck issue can only become more severe.

The task is now assisted by the emergence of electronic networks. Experimental data are produced electronically, so are immediately amenable to network distribution. What needs to be done is to mark-up the data so it can be discovered and made available to both machine and human readers. This is the process of creating metadata. While the Internet and the World Wide Web offer standard protocols for distribution, now being supplemented for the type of scientific data sources described here by e-science and grid technologies, particular subjects require specialised metadata and means of discovery. Dublin Core (DC) is a metadata standard that has emerged to describe the 'core', or essential, elements of a bibliographic record, say of an item that might be found in an academic library. A mechanism designed to improve discovery of such records is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). If the library is considered to be the example data provider, the OAI-PMH allows independent data services to 'harvest' the DC records into a database and enable these records to be searched among records from other selected data providers. Cross-searching techniques generally send specific search requests in parallel to different sources (by some specified protocol) and combine the various responses into a result for the cross-search. In contrast, search services built on harvested metadata carry out local searches on the pre-harvested metadata. DC and OAI provide a minimum level of interoperability between data providers and diverse service providers.

Basic DC does not describe specialised subject terms, but can be extended by means of 'qualified' DC. In this case a schema is devised to describe the extended terms. If a schema definition, an XML document describing the schema terms, is linked from a DC record, then readers, typically machine readers but also some human readers, can make sense of the extended terminology.

In crystallography there are a number of different ways to describe the subject of the data sets. Experiments revolve around a single molecule which can be thought of as the ‘topic’ of the experiments. There are a number of established ways of identifying molecules, which include internationally recognised methods of specifying their formulas or names. In chemistry a very important identifier used for exchange of chemical information is the IUPAC-NIST Chemical Identifier (INChI). An INChI encodes a lot of chemistry that cannot be expressed logically by any other means. These different vocabularies have been incorporated into the schema through the encoding schemes facility of qualified DC.

This report describes the schema implemented by the eBank UK project to disseminate e-data reports describing crystallographic structures (eCrystallographyDataReports). Schemas are not new, but what this report aims to do is justify the design of the schema in terms of underlying experimental processes and the network of services it is designed to serve. Some of the services envisaged are new and not widely known in the chemistry community; others are better known and offer standards-based services but are not well connected within the network of services; then there are those that do not yet exist in practice, so are replicated experimentally by the project. The schema is thus pivotal in making connections between current and anticipated network services, and the report assesses the feasibility of the initial implementation in terms of its use by the respective services, with recommendations for future improvements in the next stage of the project.

In the eBank project the actual and prospective service partners include:

Institutional archives based on Eprints software, adapted to manage e-data reports as well as original eprints, author-produced copies of the published papers that are associated with the data reports; in this case the exemplar institutional archive is based at Southampton University (Crystal Structure Report Archive http://ecrystals.chem.soton.ac.uk/)
Aggregator services such as eBank UK, that focus on harvesting records of e-data reports (http://eprints-uk.rdn.ac.uk/ebank-demo/)
Portals that present services such as search within a broader science context; in this case the exemplar interface is PSIgate (http://www.psigate.ac.uk/ebank/), a subject-based hub within the JISC Resource Discovery Network (RDN)
Crystallographic databases, such as the Crystal Structure Database (CSD) at the Cambridge Crystallographic Data Centre, the accepted international repository for crystallographic data
Identifier services, e.g. a server that can take a crystallographic record and return an INChI for the item described
Journal publishers, such as the International Union of Crystallography (IUCr); published journal papers, eprints and e-data reports need to be linked using established linking standards such as OpenURL

Objective of this report

This report was produced as a deliverable to JISC as proposed in the original proposal for the eBank project, where the deliverable was described as follows.

This study will consider the description of (multimedia) datasets within the chemistry domain particularly with regard to the perceived hierarchy of data and metadata from raw data up to “published results”. A variety of issues will be addressed, including identifying common attributes of a dataset and relating these to domain-specific characteristics, managing legacy data, metadata created at source by laboratory equipment and the relationship to data curation activities. The Combechem project will be used as a case study and metadata from three sources (e-Lab book, crystallography data and physical chemistry data) will be investigated. Outcomes of the study will be a report and a draft schema for describing chemistry datasets.

The original project proposal described work to be done in the year from September 2003. More recently funding was agreed to extend the project