eBank UK Feasibility Report on dataset Description and Schema to access the results of experiments in crystallography

Produced by the eBank UK project team

29 November 2004

The eBank UK project has produced a prototype demonstrator of a service based on EPrints.org software providing access to the detailed results of scientific experiments in crystallography. To present this complex data in a retrievable and meaningful way requires that it is described by metadata using appropriate metadata schema that support harvesting and re-use by other services through alternative interfaces. The challenge faced by the project is the complexity and volume of data that are to be made accessible from the principal points in the network dissemination chain - institutional archives, aggregators, service providers, portals, and prospectively other data providers such as publishers and digital libraries. The design of the metadata schema is critical to the success of the demonstrator, and is perhaps the key contribution of the first phase of the project (to September 2004). The report describes the metadata schema adopted during the initial phase of the project, and shows how the metadata records based on these schemas are presented in the demonstrators. The advantages and limitations of the approach are briefly evaluated with a view to appropriateness of the schema for the presentation of experimental data from other science disciplines through other service providers, which will be investigated during phase 2 of the project.

A journal publication describing the results of scientific work is typically a distillation of experimental data. The publication is aimed at a wider audience than the immediate peers of the authors, so placing the work in its primary context and reducing the data to the most significant results is critical in making the work more widely known. Those immediate peers, however, may require access to more of the original data produced in the work, to verify reproducibility or to build on those data, for example.

Modern science can produce large volumes of data as computational tools enable experiments to be performed more frequently and more efficiently. In crystallography in the 1960s a PhD student might have investigated three or so structures. Now this number can be analysed in a single morning, yet the publishing protocols for reporting this work are essentially unchanged.

As long as publication has been detached from the means of production and format of this data, managing and providing access to full experimental data has not been simple. Although some journals have attempted to store data relating to published articles, typically this data is only a partial set of the complete dataset, and many journals, especially those based on print formats, do not have the space for any such data. In crystallography just 300,000 crystal structures are documented in database archives, against an estimated 1.5 million known structures: less than 20% of data generated in crystallographic work is reaching the public domain due to publication bottlenecks. Furthermore, there are in excess of 24 million chemical compounds known. As high-throughput technologies, automation and e-science become embedded in chemical and crystallographic working routines, the publication bottleneck issue can only become more severe.

A potential solution is offered by the emergence of electronic networks. Experimental data are produced electronically, so are immediately amenable to network distribution. What needs to be done is to describe the data, preferably by automatically produced metadata, so that the data can be discovered and made available to both machine and human readers. While the Internet and the World Wide Web offer standard protocols for distribution of documents, now being supplemented for the type of scientific data sources described here by e-science and grid technologies, particular domains require specialised metadata and means of discovery. Dublin Core (DC) is a metadata standard that has emerged to provide the 'core', or essential, elements to describe a variety of resources, say of an item that might be found in an academic library. A mechanism designed to improve distribution of such records is the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). If the library is considered to be the example data provider, the OAI-PMH allows independent data services to 'harvest' the DC records into a database and enable these records to be searched alongside records from other selected data providers. Cross-searching techniques generally send specific search requests in parallel to different sources (by some specified protocol) and combine the various responses into a result for the cross-search. In contrast, search services built on harvested metadata carry out local searches on the pre-harvested metadata. DC and OAI provide a minimum level of interoperability between data providers and diverse service providers.

Basic DC does not include specialised domain specific terms, but can be extended by means of 'qualified' DC. In this case a schema is devised to describe the extended terms. An XML schema can be drawn up as a template for records using the extended term set, thus facilitating m2m use of records conforming to the extended set. An RDF schema describing the extended term set would enable machine interpretation of property and sub-property relationships, and such a schema also allows tools to relay the semantics of terms to human readers.

Based on an analysis of user requirements the project found that crystallography datasets might be usefully described using a number of properties in addition to 'simple Dublin Core'. Experiments revolve around a single molecule which can be thought of as the ‘topic’ of the experiments. There are a number of established ways of identifying molecules, which include internationally recognised methods of specifying their formulas or names. In chemistry a very important identifier used for exchange of chemical information is the IUPAC-NIST Chemical Identifier (INChI). An INChI encodes a lot of chemistry that cannot be expressed easily by any other means. These different vocabularies have been incorporated into the schema through the encoding schemes facility of qualified DC.

This report describes the schema implemented by the eBank UK project to export metadatadescribing crystallographic datasets. The design of the metadata schema is based on the characteristics of the experimental data it describes, and the services that will be supported by using the metadata. In the eBank project the actual and prospective service partners include:

Institutional archives based on EPrints.org software, adapted to manage e-data reports as well as original eprints, author-produced copies of the published papers that are associated with the data reports; in this case the exemplar institutional archive is based at Southampton University (Crystal Structure Report Archive http://ecrystals.chem.soton.ac.uk/)
Aggregator services such as eBank UK, that focus on harvesting records of e-data reports (http://eprints-uk.rdn.ac.uk/ebank-demo/)
Portals that present services such as search within a broader science context; in this case the exemplar interface is PSIgate (http://www.psigate.ac.uk/ebank/), a subject-based hub within the JISC Resource Discovery Network (RDN)
Crystallographic databases, such as the Crystal Structure Database (CSD) at the Cambridge Crystallographic Data Centre, the accepted international repository for crystallographic data
Identifier services, e.g. a server that can take a crystallographic record and return an INChI for the item described
Journal publishers, such as the International Union of Crystallography (IUCr); published journal papers, eprints and e-data reports need to be linked using established linking standards such as OpenURL

The underlying science process: crystallography workflow

A data archive has been created at the University of Southampton, built on Eprints, OAI-compliant software that was designed to manage author deposit of papers. The user interface has been adapted to manage the deposit of datasets produced by crystallographers from the National Crystallography Service (NCS) at Southampton. During the deposit process, metadata about the datasets is entered manually or generated automatically. Since there are inherent relationships between datasets, the metadata is designed to reflect the scheme of the experimental procedure, outlined in Figure 1. In the case of crystallography, datasets are related by sequence since they are generated (by measurement or analysis) from a series of sequential stages in the experimental process.

Figure 1. Generalised workflow for crystallography experiments

datasets do not need to be stored at a single location such as the Eprints archive at NCS. By using OAI-based DC, interoperability conditions mean that datasets stored at different locations can be accessed by users as though they were from a single 'virtual' archive depending on the OAI service provider used.

Metadata schema

'Local' institutional archive

The metadata schema resulting from analysis of the workflow (Table 1) captures the files generated during the course of the experiment. Each of the files is stored in the host, or local, archive for access by users mediated by a single e-data report. This report links to the individual data files as well as other relevant sources, such as eprints, and possibly external structure databases, and presents an interactive visualisation of the derived structure (Figure 2).

**Table 1: Representation of the crystallography experiment schema, indicating all the files generated during the course of the experiment**
Name	Description of the stage	Files associated with this stage			Metadata associated with this stage
Name	Description of the stage	File	Type	Description	Name	Data Type
Initialisation	Mount new sample on diffractometer Parameterisation to set up data collection	.htm i.kcd *_sample.jpg	HTML BINARY JPG	Metadata for crystallography expt Unit cell determination images Image of sample	Morphology Solvent Sample_image	STRING (SET) STRING .JPG
Collection	Collect data	s.kcd _crystal.jpg	BINARY JPG	Diffraction images Image of crystal	Temperature Crystal_image	*INTEGER .JPG
Processing	Process and correct images	.hkl .htm _0KL.jpg _H0L.jpg *_HK0.jpg	ASCII HTML JPG JPG JPG	Derived dataset Report file Synthesised image compiled from .kcd’s Synthesised image compiled from .kcd’s Synthesised image compiled from .kcd’s	Cell_a Cell_b Cell_c Cell_alpha Cell_beta Cell_gamma Crystal_system Completeness	NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER STRING (SET) INTEGER (%)
Solution	Solve structure	.prp xs.lst	ASCII ASCII	Log of symmetry determination process Solution log file	Space_group Figure_of_merit	STRING (SET) NUMBER
Refinement	Refine structure	xl.lst .res	ASCII ASCII	Final refinement listing Output coordinates	R1_obs wR2_obs R1_all wR2_all	NUMBER NUMBER NUMBER NUMBER
CIF	Produce CIF	.cif _checkcif.htm .cml	ASCII HTML CML	Final results Automatic validation results Final results (with chemical content)	Formula_moiety	*STRING
Report	Generate e-Data report	.html	HTML	Publication format (HTML/XHTML)	eDataReport_type Authors Affiliations Formula_empirical Compound_name CCDC_Code Compound_class Keywords Available_data Related_publications	CRYSTAL STRUCTURE STRING STRING STRING STRING STRING STRING (SET) STRING (SET) *STRING (SET) STRING

Figure 2. eCrystallographyDataReport shown to a user (partial view) via the adapted Eprints.org archive interface (Note. The crystal diagram is presented using a Java applet and can be manipulated interactively)

Distribution to aggregator services and portals

To enhance the visibility of archived datasets, the exposed metadata can be harvested by independent service providers such as aggregators and portals. In this project there are two demonstrator services: the eBank UK aggregator service at Bath University, in effect a specialist aggregator and search service for e-data reports; and PSIGate, the physical sciences hub of the JISC Resource Discovery Network, which offers search results in a broader science context.

Only the metadata need to be harvested, rather than the full datasets, as the reports link to the constituent data files in the original archive. For this purpose the e-data report is represented by a DC schema designed for dissemination via an OAI interface. Table 2 shows the schema elements presented to the OAI interface for the exchange of eBank data between data provider and service provider. Explanations of the elements and how they map to user requirements are given in the Appendix (part 1).

Table 2: Schema elements for eCrystallographyDataReports presented to the OAI interface within a data archive (data provider)
Data Name Data Description Data Type XML wrapped content

EPrint_type 'Crystal Structure' String Phrase 'Crystal Structure'

Authors ePrint creator(s) String ePrint authors 'Surname, Christian name, initial'

Affiliations Institution(s) of creator(s) String Various authors addresses

Formula_empirical Total atom count String Atom symbols with their total count (can be real number) subscript

Compound_name IUPAC Chemical name String Chemical name with text & integers

CCDC_Code Cambridge Structural Database identifier String 6 character code (may become numeric in future)

Compound_class Chemical category String (set) 1 word descriptor of chemical category

Available_data Actual data available for various ePrint stages (Y/N) Y/N Toggle Y or N presence of data associated with RAW & RESULTS stages

Related_publications Other output containing this compound/structure String Literature reference link

Publication_date Date of releasing ePrint to eBank/world String Date of public release of ePrint

Last_revised_date Date ePrint last revised String Date of latest modification to ePrint

Keywords Categories String (set?) Phrase describing chemical relevance

Scheme 2D diagram String Two dimensional structural diagram as SMILES string

IChI International Chemical Identifier String Unique compound identifier (contains some structural information)

**Table 2: Schema elements for eCrystallographyDataReports presented to the OAI interface within a data archive (data provider)**
Data Name	Data Description	Data Type	XML wrapped content
EPrint_type	'Crystal Structure'	String	Phrase 'Crystal Structure'
Authors	ePrint creator(s)	String	ePrint authors 'Surname, Christian name, initial'
Affiliations	Institution(s) of creator(s)	String	Various authors addresses
Formula_empirical	Total atom count	String	Atom symbols with their total count (can be real number) subscript
Compound_name	IUPAC Chemical name	String	Chemical name with text & integers
CCDC_Code	Cambridge Structural Database identifier	String	6 character code (may become numeric in future)
Compound_class	Chemical category	String (set)	1 word descriptor of chemical category
Available_data	Actual data available for various ePrint stages (Y/N)	Y/N Toggle	Y or N presence of data associated with RAW & RESULTS stages
Related_publications	Other output containing this compound/structure	String	Literature reference link
Publication_date	Date of releasing ePrint to eBank/world	String	Date of public release of ePrint
Last_revised_date	Date ePrint last revised	String	Date of latest modification to ePrint
Keywords	Categories	String (set?)	Phrase describing chemical relevance
Scheme	2D diagram	String	Two dimensional structural diagram as SMILES string
IChI	International Chemical Identifier	String	Unique compound identifier (contains some structural information)

The search interface presented by the eBank UK demo is shown in Figure 3a. A similar search interface offered by PSIgate is shown in Figure 3b. The PSIgate search uses an RDN-include type mechanism: search requests run scripts on the eBank UK server. Although a stylesheet is used to reformat the data, the portal has no control over what data are passed across. Service providers can re-present records, such as the one shown in Figure 2, ideally supplemented with additional information such as links to other relevant sources, such as published papers and library holdings, or other information on which the provider holds data.

Figure 3. eBank demo service provider search interfaces: a, presented through the eBank UK; b, from PSIgate

E-data reports are represented as records in an XML format, defined and constrained by the adopted schema. An eCrystallographyDataReport might not be commonly encountered by a digital library OAI harvester, which would need to refer to the eBank schema to understand its contents. An eBank record conforms to the schema described above, although additional 'layers' might be added to provide richer metadata for other service providers. For example, while the current eBank data might be harvested by specialist crystallography services, more general providers of digital library services might require additional information to be able to handle such data.

With increasingly complex digital objects becoming available for harvesting, such as objects with multiple components and multiple metadata components, 'containers' are needed to transport not just the core data but the additional components too. The Metadata Encoding and Transmission Standard (METS) is such a container and provides an XML document format for encoding metadata necessary for both management of digital library objects within a repository and exchange of such objects between repositories. METS recognises that describing digital objects requires an increasingly complex series of metadata descriptions - administrative, structural and technical metadata, for example. Other proposed 'containers' for describing complex objects include MPEG21 Digital Item Declaration Language (DIDL) and content packaging standards from elearning organisations such as the IMS Global Learning Consortium.

A schematic view of metadata exchanged in eBank UK project using OAI-PMH with METS wrapper elements is shown in Figure 4. An example eBank record that includes a METS layer is shown in the Appendix (part 3).

Figure 4. Schematic view of metadata exchanged in eBank project using OAI-PMH

This record contains declarations linking to the eBank XML schema definitions (.xsd). Two .xsd documents have been created for the eBank demo, defining

a DC container element (http://www.rdn.ac.uk/oai/ebank/20040611/ebank_dc.xsd)
eBank terminology, in particular the crystallography terms used (http://www.rdn.ac.uk/oai/ebank/20040611/ebankterms.xsd)

The latter includes the key definitions. The terms used will be replaced by official types from the bodies concerned, e.g. IUPAC, CCDC, when such types become available. Both documents can also be found in the Appendix (part 2).

Assessment

The eBank project was originally funded by JISC for one year from September 2003. In that time eBank has demonstrated how new infrastructure can be built on existing and emerging services to integrate and disseminate new sources of data, in this case research data generated by solving crystallography structures. Although within the design process some consideration has been given to wider applicability, the project acknowledges that the the current schema have been developed within a particular area of chemistry. The next phase of the project intends to explore whether the existing schema can be applied to other areas of chemistry and beyond. Creating a more generic scientific schema will depend on gaining consensus on a generic scientific data model. To reach such a consensus will require, as was the case with crystallography, an intimate understanding of the underlying experimental processes that are to be represented, and active involvement in the relevant science communities.

Within the limited confines of the eBank infrastructure, it has been shown that data from experiments can be produced and structured for effective dissemination from the data producer to a local archive for storage and then on to aggregator and discovery services. The project has provided a demonstratior of a search interface based on harvested metadata which can in future be used for evaluating against user requirements.

The principal strengths of the current eBank approach, as revealed by its application of a dataset description and schema, and future requirements, are listed below.

Strengths

Improved access to scientific data and timely dissemination of data results
Streamlined access to data through a common interface
Re-use of data by other scientists
New ways to assess and validate scientific results
Curation of data built into scientist’s workflow

Future work is required in a number of areas.

Future work

Engage crystallography community to encourage use of 'open data archives'
Apply the data model to other scientific data
Promote standards-based approach to other scientific communities
Work towards a generic data model for scientific datasets, with a consensus on an associated metadata schemas
Embed eBank more robustly in the scientific workflow
Link datasets and published articles
Work with publishers to explore mutual benefits and services

Alternative data models also need to be considered. Within the UK research communities the Council for the Central Laboratory of the Research Councils (CCLRC) has developed a data model that attempts to describe the relationship between experiments, investigators, data holdings, datasets, data files, logical and physical locations. This model will be considered further in eBank phase 2.

Future plans include working closely with IUCr and CCDC to integrate the eBank approach into chemistry-related publications so this is the globally accepted route for publishing crystal structures. Initial discussions with chemistry publishers such as the American Chemical Society (ACS) and Taylor and Francis, a learned society and commercial publisher respectively, indicate that the eBank open access OAI-based approach to accessing crystal structures is one solution to the current publication bottleneck problem.

Conclusion

The first phase of the eBank project has been successful in demonstrating proof-of-concept in the linking of research data generated in the discrete area of combinatorial chemistry to derived ePrints and scientific papers. By building services on eData reports more information can be conveyed to users than was previously possible by journal papers alone. Increasingly such services will be seen as part of the enhanced scholarly communication that network technologies make possible. eBank can be seen as one of the emerging network services based on grid and e-science technologies, gaining also from the growth of open access based on institutional archives. Through combination of e-science and open access services such as eBank can deliver the kind of data volumes now required, together with the means for search, discovery and access. The eBank project has shown that a framework built on standard metadata components make it feasible for different services in the scholarly communication chain to cooperate. The initial implementation is focussed to crystallography, but more generic applications are anticipated.

eBank is not just about chemistry, or even crystallography, although these disciplines provide a very good exemplar. It is about how to structure e-data reports and how to use this structure to make these data accessible from the principal points in the network dissemination chain - institutional archives, aggregators, service providers, portals and, prospectively, other data providers such as CCDC, publishers and digital libraries.

Acknowledgements

The following people have contributed to the work described in this report:
(University of Southampton) Leslie Carr, Simon Coles, Jeremy Frey, Christopher Gutteridge, Steve Hitchcock, Mike Hursthouse; (UKOLN, University of Bath) Michael Day, Monica Duke, Rachel Heery, Liz Lyon, Andy Powell; (PSIgate, University of Manchester) John Blunden-Ellis, Paul Meehan

Appendix

Elements in the eBank Schema
by Monica Duke and Rachel Heery

This document defines the schema for the exchange of eBank data between data provider and service provider in the eBank UK project, with explanations for the elements and mapping to user requirements.

At the time of writing this report had reached Version 3 (modified 21st September 2004)
Download the (Word) document from: http://www.ukoln.ac.uk/projects/ebank-uk/schemas/ebank-schema.doc