EBank UK project scenarios and user requirements

Michael Day
UKOLN, University of Bath

Version 1.0, 14 December 2004

  UKOLN logo | Link to home page

 -

1. Introduction

The eBank UK project has been funded by the Joint Information Systems Committee (JISC) (http://www.jisc.ac.uk/) as part of its Semantic Grid and Autonomic Computing Programme. The project is led by UKOLN at the University of Bath in partnership with the Universities of Southampton and Manchester. The project is investigating the role of aggregator services in linking metadata describing scientific papers to datasets made available from repositories (Heery, et al., 2004; Lyon, et al., 2004). The primary subject focus of the project is chemistry and eBank is working with the University of Southampton project Combechem, which is funded by the UK Engineering and Physical Sciences Research Council (EPSRC) (http://www.epsrc.ac.uk/) as an e-science testbed (Frey, et al., 2003; http://www.combechem.org/).

The eBank project is advocating a 'publication at source' philosophy based on open access principles. It is based on the idea that research teams could routinely deposit datasets in institutional repositories as part of the data creation and processing workflow. This would facilitate the subsequent linking of datasets to peer-reviewed papers or the results data published in specialised databases. Storing multiple states of datasets in institutional repositories would also enable users to find, retrieve and reanalyse data produced at earlier stages of the experimental workflow, helping to document the exact processes carried out as part of these and providing information on the provenance of data, e.g. for providing links back to records in laboratory notebooks or information on the preparation of materials.

For the development of a demonstrator system, the eBank project decided to focus on the sub-discipline of crystallography, as this has a well-defined data creation workflow and a tradition of sharing results data in an internationally accepted standard, the Crystallographic Information File (CIF) adopted by the International Union of Crystallography (Hall, Allen & Brown, 1991; Brown & McMahon, 2002; http://www.iucr.org/cif/). In addition, secondary services like the Cambridge Structural Database (CSD) provide facilities for the acquisition, storage, validation, retrieval, analysis and visualisation of small-molecule crystal structures, again mostly available in CIF format (Allen, 2002; http://www.ccdc.cam.ac.uk/). Many crystallographic journals encourage (or require) the submission of structures in CIF format and the CSD acts as an official data depository on behalf of a number of these.

The architecture of the eBank demonstrator is based on metadata harvesting and uses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) (http://www.openarchives.org/). This protocol is based on defining the interactions between two classes of service. OAI 'data providers' act as repositories of resources and make selected metadata about those resources available so that one or more 'service providers' can harvest them. In turn, service providers can harvest metadata from multiple data providers and provide a single interface that aggregates records into a single virtual collection.

For the eBank UK demonstrator, a test 'data provider' repository of crystallography datasets was created at the University of Southampton and populated with test metadata (http://ecrystals.chem.soton.ac.uk/). This "Crystal Structure Report Archive" is physically (and conceptually) distinct from the eBank aggregator service and has its own user requirements, e.g. for the deposit and validation of data at relevant stages of the data creation and processing workflow, for searching and access control, etc. These requirements will not be discussed in this paper. The crystal structure report repository uses a specially enhanced version of the EPrints.org software developed at the University of Southampton (http://software.eprints.org/) and metadata about the datasets can be entered manually or generated automatically. The repository has its own interface (Figure 1) - distinct from that of the eBank UK aggregator service - and users can search the repository using a range of bibliographic and chemical parameters, or browse it by date, creator name or class of compound.

screenshot of prototype search interface

Figure 1: Prototype search interface for the University of Southampton Crystal Structure Report Archive (November 2004)

The remainder of this document will discuss the user requirements for the eBank UK demonstrator and outline some potential user scenarios.

2. Scenarios

Scenarios are stories that describe the possible behaviour of a system from a user perspective, intended to inform the process of designing information systems (Gonçalves, et al. 2004, p. 274). In eBank, technical meetings were used to elaborate and discuss a small number of potential user scenarios.

2.1 Existing scenarios

In order to be able to better understand what the eBank UK project is proposing, it might be useful to look at how crystallographic data is already disseminated through the journal system and third party services provided by organisations like the Cambridge Crystallographic Data Service (CCDS) (Allen, et al., 1979; Allen, 2002) . The exchange of crystal structure data between chemists has been facilitated by the existence of the Crystallographic Information File (CIF), a standard maintained by the International Union of Crystallography (IUCr). An existing scenario might help illustrate the current position.

Dr. Z is reading a printed copy of the paper:

Charmant, J. P. H., Norman, N. C., Orpen, A. G. & Whittell, G. R. (2004). "Carbonyldichlorohydridobis(triphenylphosphine)iridium(III)" Acta Cryst., E60, m162- m163. Available to subscribers from: http://journals.iucr.org/e/issues/2004/02/00/dn6111/index.html

Dr. Z would like to obtain data about this structure for comparison with others produced by her own group. Acta Crystallographica Section E: Structure Reports Online is an electronic-only journal published by IUCr and is intended for the rapid communication of concise struture reports. All papers in this journal are accompanied by the results data in CIF format. In order to obtain the CIF, Dr. Z can use her Web browser to go to the Structure Reports Online database (http://journals.iucr.org/e/) and look for the paper; she can either browse by journal issue or search for relevant terms. From the results page, Dr. Z should be able to find the paper, and - assuming that she or her institution has a valid subscription to the journal - can download the accompanying CIF, as well as any other required information (e.g., three-dimensional views). If Dr. Z needs any additional information, she has the option of corresponding with the lead author at the School of Chemistry at the University of Bristol, whose contact details are provided in both the paper and CIF.

In this scenario, each CIF is linked to a published peer-reviewed paper. Dr. Z can find the data by searching the Structure Reports Online database for the metadata associated with the paper, specifically for author names and terms used in titles, keywords, abstracts and the full text of papers. Results can also be filtered by journal name and date. In addition, the database allows Dr. Z to search for a specific bibliographic reference. The paper also has a unique identifier, a Digital Object Identifier (DOI), so instead of having to search or browse the IUCr journals Web site, Dr. Z could also just enter the relevant identifier (doi:10.1107/S1600536804000479) into a DOI resolver service (e.g., http://dx.doi.org/) to find the paper.

Alternatively, Dr. Z would be able to look for and retrieve the data in CIF format from the Cambridge Structural Database (CSD), which is the official depository for data published in over fifty major chemistry journals, including those published by IUCr (Allen, 2002). The ConQuest interface to CSD would give Dr. Z an additional range of search options based on the chemistry of the structures (Bruno, et al., 2002; http://www.ccdc.cam.ac.uk/products/csd_system/conquest/).

It is perhaps worth noting that the IUCr as a publisher has a very positive attitude to making data available. Other publishers have more restrictive policies, so the scenario outlined here may not apply to the datasets published by other journals.

2.2 eBank UK scenarios

Currently, therefore, there are several different ways for chemists to find and retrieve the data that has been published in crystallographic structure reports. However, it is not always easy to find and retrieve those datasets that have not been formally published in peer-reviewed journals. The increasing use of high-throughput instruments, like those provided by the EPSRC National Crystallography Service (NCS) at the University of Southampton (http://www.soton.ac.uk/~xservice/), means that not all structures are now able to be published in a timely manner (e.g., Allen, 2004; Hursthouse, 2004). There is, therefore, a growing need for new and innovative ways of making crystallographic data available, e.g. through open-access repositories provided by the research groups that produce them. The eBank project set up a demonstration "Crystal Structure Report Archive" (http://ecrystals.chem.soton.ac.uk/) at the University of Southampton to help develop this idea. Another potential problem with the current system is that it is highly geared towards the dissemination of CIFs and the peer-reviewed papers that are based on them. Getting access to datasets produced at earlier stages of the data creation process is currently impossible without the direct co-operation of the research teams that produce them. While, in many cases, this may be desirable, eBank was interested in demonstrating the possibility of linking structure reports to those datasets produced earlier in the data creation workflow (Figure 2).

picture of workflow process in crystallography

Figure 2: Generalised workflow for crystallography experiments

The eBank UK project was not looking to reinvent the exact functionality developed by IUCr journals or the CSD, but to investigate whether the OAI-PMH could be used to support the creation of institutional repositories designed for crystallographic datasets and the third-party aggregator services that could facilitate the retrieval of datasets published in this way.

2.2.1 Scenario 1: Linking from paper to dataset

Professor Y of the University of Poppleton is reading the paper:

Besli, S., et al., (2002). "Structural investigations of phosphorus-nitrogen compounds. 5. Relationships between molecular parameters of 2,2-diphenyl-4,6-cis-oxytetra(ethyleneoxy)-4,6-R2-cyclotriphosphazatrienes (R = Cl, OCH2CF3, OPh, OMe, NHPh, NHBut) and substituent basicity constants."

The paper is linked to a crystal structure report on "2,2-Diphenyl-4,6-cis-oxy(tetraethyleneoxy)-4,6-bis(2,2,2-trifluoroethoxy)cyclotriphosphazatriene." In this case, the original paper was published in Acta Crystallographica Section B (B58, 1067-1073), so a CIF of this structure is available in IUCr's Structure Reports Online database (doi:10.1107/S0108768102018608). In order to obtain this, Prof. Y could just follow the proceedure followed by Dr. Z in the above scenario. Instead, Prof. Y has already read the supplementary information on the preparation of the compounds, and wonders whether any of the earlier data files would be available. Knowing that the eBank service would provide access to this information, if available. Prof. Y points her Web browser at the eBank service, uses the search interface to look for the first author's name ("Besli"). The results page lists the paper that she has already consulted and two structure reports, one linked to the relevant paper as a "related dataset." Following the link to the dataset related to the paper (http://ecrystals.chem.soton.ac.uk/archive/00000062/) takes Prof. Y to the Crystal Structure Report Archive at Southampton, which gives further information on the structure, some administrative information, and a three dimensional image (Figure 3). Linked to this page are the CIF and a list of other data files produced at earlier stages of the experimental workflow. While journals and the CSD only make the final results CIF dataset available, eBank is able to provide Prof. Y with access to ALL of the datasets generated during the course of the experiment. Prof. Y decides which one of these she needs to consult and downloads the relevant file for reanalysis. Because she has access rights to all publicly available data linked to the eBank system, and because (in this case) the research team that produced the original data have given permission for its free distribution through eBank, the professor can download the relevant files for processing on her own computer. There may be terms and conditions that limit exactly what Prof. Y is able to do with the data, and this (or some other authentication system) would need to be available at the time of download.

screenshot of an ecrystal report interface

Figure 3: Crystal Structure Report for: 2,2-Diphenyl-4,6-cis-oxy(tetraethyleneoxy)-4,6-bis(2,2,2-trifluoroethoxy)cyclotriphosphazatriene (November 2004)

Potentially, the eBank aggregator could also provide links to those papers that have reused (or cited) datasets. If Prof. Y used the downloaded data to produce a new structure report and paper, these could be deposited in the University of Poppleton's institutional repository. If both the paper and the structure report provide a citation to the URI used by Southampton's Crystal Structure Report Archive, it might be possible to match the links to provide a link from the new strcuture report and paper to the old structure. In this case, on searching again for "Besli", the metadata for the structure report discussed above would be provided with "related paper" links to the original article by Besli, et al., and the new paper and structure reports produced by Prof. Y and her team. This is the kind of automatic linking functionality that will be explored by the second phase of the eBank UK project.

2.2.2 Scenario 2: Searching for datasets

Dr. X is looking for some crystal structures produced by a research group based at another UK university. He knows the general type of compound that he is looking for, but does not know the exact formula or the IUPAC name. He elects to search for these structures through the eBank aggregator and connects to the service. He does not know the exact chemical formula, the IUPAC name, or the CCDC code for the structures he needs. Instead, he searches for the name of the research group head in the "author" field, limiting his search by the compound class "Organic" and by date to retrieve only the most recent structures submitted to the system. The results provide two lists, one for crystal structure reports, the other for publications. Because Dr X has specified the type of compound class, the search only retrieves details of 15 structure reports. He browses the results, checking the formula and IUPAC name fields for information on the chemical makeup of the crystals. He is interested in just five of the structures, so for each of these, he links to the locally-hosted structure report repository to view the more detailed metadata for each one and to see exactly what data is available for download. If Dr. X wishes to download any of these datasets for reanalysis, he may have to have to fulfil the authentication requirements of the local system (data provider).

It is perhaps worth noting that for the pilot, eBank was not trying to emulate the more sophisticated search functionality of the CSD or even the IUCr Structure Reports Online database. The latter permits searching on bibliographic-type metadata and the full-text of papers. CSD provides more advanced ways of searching for chemical information. CCDC's free CIF depository request service (http://www.ccdc.cam.ac.uk/) enables retrieval by CCDC deposit codes or the bibliographic citation of a related publication. It would be possible for users to search first the CSD for specific chemical features and then use the information retrieved (e.g., CCDC codes or author names) to search eBank.

3. eBank User Requirements

The eBank UK proof-of-concept demonstrator is based on the broad architecture defined by the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). This divides implementers into two main groups: data providers, who maintain one or more repositories that support the protocol as a means of exposing metadata, and service providers, who issue protocol requests to data providers and uses the metadata thus harvested as "a basis for building value-added services" (http://www.openarchives.org/documents/FAQ.html). The OAI-PHM mandates that data providers support simple Dublin Core (DC) metadata but allows them to also support any other metadata format, provided it can be encoded as XML. Basic DC is not sufficient to describe crystallographic structures, so eBank has had to produce its own metadata schema to support this (for a discussion of issues, see: http://www.ukoln.ac.uk/projects/ebank-uk/schemas/ebank-reportv1_1.html).

For the eBank demonstrator, crystallographic datasets will be made available as structure reports through OAI 'data providers' like the pilot Southampton Crystal Structure Reports Archive, which was developed using the EPrints.org software. User requirements for this service will include the deposit and validation of submissions, as well as various kinds of search and linking functionality. The data provider repositories are, however, not the main focus of this report.

The eBank UK aggregator (service provider) demonstrator is based on an OAI-based architecture that enables the harvesting of metadata from multiple databases of crystallographic data (data provider). For the pilot, metadata about datasets will be harvested from the the University of Southampton's Crystal Structure Report Archive. This will be linked with metadata about research papers harvested from elsewhere, and the pilot aggregator will provide an integrated search and browse interface. For the purpose of the project, UKOLN was responsible for developing this demonstrator. The project will also explore the possibility of including versions of this aggregator service in third party services, specifically PSIgate (http://www.psigate.ac.uk/), part of the UK's Resource Discovery Network (RDN). The use of the OAI-PMH to expose metadata means that - in theory - other service providers could develop to take advantage of datasets made available in this way. This is, however, beyond the scope of this document.

3.1 Search functionality

The prototype eBank UK system will enable searches to be undertaken on:

UKOLN developed a pilot eBank search interface with these basic search elements (Figure 4).

screenshot of elements in the pilot search interface

Figure 4: Prototype search interface for the eBank UK aggregator service (November 2004)

There may be an need for more detailed ways of searching the research papers harvested by eBank, e.g. searching on specific bibliographic information or retrieval by citation. This was not implemented in the pilot.

The results page should display the basic metadata about each retrieved structure report and paper. The structure report metadata would need to include its title, a list of the people associated with its creation, their institution [not implemented in the pilot], the date of deposit (or release), the empirical formula, IUPAC name and CCDC code, the compound class and any assigned keywords. The results should also provide links to the richer metadata held in the structure report repository and to any papers (published or unpublished) that claim to describe the structure. Users also said that the results summary should include an indication of which type of datasets were available for a specific report, i.e.:

Papers would be described with appropriate bibliographical metadata, e.g. title, abstract, author(s), and source journal details, also linked to full-text or publishers' Web sites and to related datasets

The pilot eBank service developed by UKOLN gives an idea of how this results interface might look like in action (Figure 5).

picture of results interface

Figure 5: Prototype results interface for the eBank UK aggregator service (November 2004)

3.2 Linking functionality

The eBank service will need to harvest metadata on any links between datasets (structure reports), published papers and e-prints. The key to doing this accurately will be the consistent use of URIs or other unique identifiers. The system must be able to identify when paper 'cites' a structure report, then capture this information for embedding in the eBank results page. In this way it should be possible to link from a particular structure report to all of the other structure reports, papers or e-prints that make use of it in any way. Conversely, it should be possible to trace back from published papers, to e-prints, to data files at various stages of processing.

3.3 Embedding eBank in third-party services

A final requirement is to be able to 'embed' eBank functionality (e.g. it's search facility) into third party services. To demonstrate the potential of this, the pilot search service was added to the PSIgate interface (Figure 6) using the 'include' technology developed by the RDN (http://www.rdn.ac.uk/rdn-i/).

picture of search interface at PSIgate

Figure 6: Prototype search interface for the eBank UK aggregator service embedded in the PSIgate interface (November 2004)

References

Allen, F. H. (2002). "The Cambridge Structural Database: a quarter of a million crystal structures and rising." Acta Cryst., B58, 380-388.

Allen, F. H. (2004). "High-throughput crystallography: the challenge of publishing, storing and using the results." Crystallography Reviews, 10, 3-15.

Allen, F. H., Bellard, S., Brice, M. D., Cartwright, B. A., Doubleday, A., Higgs, H., Hummelink, T., Hummelink-Peters, B. G., Kennard, O., Motherwell, W. D. S., Rodgers, J. R., & Watson, D. G. (1979). "The Cambridge Crystallographic Data Centre: computer-based search, retrieval, analysis and display of information." Acta Cryst., B35, 2331-2339.

Brown, I. D., & McMahon, B. (2002). "CIF: the computer language of crystallography." Acta Cryst., B58, 317-324.

Bruno, I. J, Cole, J. C., Edgington, P. R., Kessler, M., Macrae, C. F., McCabe, P., Pearson, J. & Taylor, R., (2002). "New software for searching the Cambridge Structural Database and visualizing crystal structures." Acta Cryst., B58, 389-397.

Frey, J. G., Bradley, M., Essex, J. W., Hursthouse, M. B., Lewis, S. M., Luck, M. M., Moreau, L. A. V. M., De Roure, D. C., Surridge, M., & Welsh, A. H. (2003). "Combinatorial chemistry and the Grid." In Berman, F., Fox, G., & Hey, A. J. G. (eds.), Grid computing: making the global infrastructure a reality. Chichester: Wiley, 945-962.

Gonçalves, M. A., Fox, E. A., Watson, L. T., & Kipp, N. A. (2004). "Streams, structures, spaces, scenarios, societies (5S): a formal model for digital libraries." ACM Transactions on Information Systems, 22, 270-312.

Hall, S. R., Allen, F. H. & Brown, I. D. (1991). "The Crystallographic Information File: a new standard archive file for crystallography." Acta Cryst., A47, 655- 685.

Heery, R., Duke, M., Day, M., Lyon, L., Hursthouse, M. B., Frey, J. G., Coles, S. J., Gutteridge, C., & Carr, L. A. (2004). "Integrating research data into the publication workflow: the eBank UK experience." PV-2004: Ensuring the Long-Term Preservation and Adding Value to the Scientific and Technical Data, Frascati, Italy, 5-7 October 2004. Retrieved October 16, 2004, from: http://www.ukoln.ac.uk/projects/ebank-uk/dissemination/PV2004-heery.pdf

Hursthouse, M. B. (2004). "High-throughput chemical crystallography (HTCC): meeting and greeting the combichem challenge." Crystallography Reviews, 10, 85-96.

Lyon, L., Heery, R., Duke, M., Coles, S., Frey, J., Hursthouse, M., Carr, L., & Gutteridge, C. (2004). "eBank UK: linking research data, scholarly communication and learning." Third UK e-Science Programme All Hands Meeting (AHM 2004), Nottingham, UK, 31 August - 3 September 2004. Retrieved October 16, 2004, from: http://www.allhands.org.uk/proceedings/papers/237.pdf

Acknowledgements

The following people contributed to the work described in this report: Michael Day, Monica Duke, Rachel Heery, Liz Lyon, Andy Powell (UKOLN, University of Bath); Simon Coles, Jeremy Frey, Mike Hursthouse (School of Chemistry, University of Southampton), Leslie Carr, Chris Gutteridge, Steve Hitchcock (School of Electronics and Computer Science, University of Southampton); John Blunden-Ellis, Paul Meehan (PSIgate, University of Manchester).