eBank UK technical requirements for aggregator service and cross-searching web interface |
|
![]()
This document describes the technical requirements for an aggregator and search service developed as part of the eBank UK, a JISC-funded project which is a part of the Semantic Grid Programme.
The project builds on the technical architecture currently being deployed within the ePrints UK Project and which has been described in a recent Ariadne article. The architecture supports the harvesting of metadata from eprint archives in UK academic institutions and elsewhere using the OAI Protocol for Metadata Harvesting (OAI-PMH). The eBank UK Project will augment this work by also harvesting metadata about research data from institutional 'e-data repositories'. Initially this will encompass data made available by Combechem, but will include data from other sources in the longer term. Metadata records harvested from e-data repositories will be stored in the central database alongside the eprint metadata records gathered as part of the ePrints UK Project.
A web interface to the central database of eprint and research metadata will be hosted at the Resource Discovery Network(RDN), and a CGI-based embedding mechanism will be made available to embed the service into the PSIGate web site.
The software requirements for the 'e-data repositories' are being developed by Southampton and modifications will be made to the eprints.org software to provide storage for and metadata descriptions of the research data output. These requirements are outside the scope of this document.
The metadata schema requirements for research data are being investigated jointly by Southampton and UKOLN. [Project Metadata Schema Page]
The user requirements are being documented in a separate work package. [Project User Requirements Page]
The system consists of e-data and e-print repositories which expose metadata about e-data and e-prints. The metadata links the e-prints to the e-data records. The metadata is harvested into a central database, where it is indexed and made available for searching via a web interface. Remote interfaces that give access to searching of the central database will also be supported using an embedding mechanism.
Several forms of publications may refer to datasets, and these include peer-reviewed articles, fast track Letters formats, as well as other sources, e.g.
Secondary literature:
The publications and the datasets are described by metadata, and the metadata is made available for harvesting (as e-print and e-data records) via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Note: Some of the data sets may be stored and made available for downloading from an e-data repository, however they may also be located outside the eBank UK system, e.g. in a data storage faciilty managed by third parties.
The requirements for the e-print and e-data repositories are:
The system is agnostic about whether repositories expose metadata about e-prints or e-data or both and about whether the repositories hold metadata and content (e-data and/or e-prints) or just metadata. The system is also agnostic about whether content-holding repositories store the metadata embedded into the content or separately from it. The only requirement of repositories is that they are able to expose metadata using the appropriate schema and the OAI-PMH.
The metadata schema requirements for data sets will be the subject of further study
within the project and will be treated in a separate document. Here,
some of the discussion to date is outlined since the metadata
schema adopted may have some impact on the specific implementation
details of the architecture.
Metadata about e-data is defined in the CLRC Scientific Metadata Model as
"all the information, additional to the raw data itself, which a potential user of the data would need to know to be able to make full and accurate use of the data in a subsequent scientific analysis"
From discussions at the kick-off meeting it was decided to start with a minimal set of metadata and build up to more detailed descriptions. The initial minimal list was documented as:
The metadata from the e-prints archive and the e-data archives is harvested using the Open Archives Initiative Protocol for Metadata Harvesting (version 2). The harvested metadata is stored in a central repository. The frequency of harvesting will initially depend on the frequency of updates of the repositories that are being harvested, but more regular harvesting may be required in the longer term
There is a requirement for the research data to be referenced from the publications or learning materials so that a user can roll-back to the original data from an information source.
Some links are required between publication and dataset metadata and the
publications and metadata sets. These links are:
Publication metadata --> publication
publication metadata --> dataset
dataset metadata --> publication
dataset metadata --> dataset
These links are required to identify and locate the publications and datasets,
and should be based on existing standards and mechanisms such as URIs
(e.g. URL, PURL or DOI), and/or OpenURLs (e.g. in electronic versions of publications, which might include a DOI where one exists).
Currently
citations (e.g. from a publication to another publication, or citing a dataset)
typically occur in a human-readable form (i.e. plain text),
but increasingly they will take other forms, such as those mentioned above,
which this project is investigating.
Note: This is an open issue which we need to discuss,
and decide which form of URI to go for - what are the pros and cons?
Note 2: Publication metadata with these links will need to be made available - are there any existing ones? If not some need to be created once the form link has been decided.
The searching capabilities are still to be defined in detail, depending on the outcome of the user requirments. However at this stage it is assumed that searching within specific fields will be needed, thus the indexing and searching engine is required to search over the metadata records. Some technical requirements fall from the user requirements. The user requirements will answer questions such as: Do searches need to be limited to one set of metadata (i.e. e-prints or e-data)? There are also dependencies between the choice of metadata schema used and the indexing and searching engine that can be applied. The search interface will be available as Z39.50 (Bath Profile), one of the protocols defined within the JISC Information Environment architecture, and SRW, the Search/Retrieve Web Service protocol (also recommended in the JISC IE.
The Web Interface allows end-users to carry out searches against the e-print and e-data metadata,
through a web browser.
This is the human-facing part of the system that carries search
requests to the eBank system, interacting with the searching
mechanisms, and displays results to users.
The detailed requirements will develop from the user requirements.
An iterative style of development is being adopted with cycles of user
testing and feedback, building incremental functionality into the interface
as this becomes available from the underlying systems,
and in response to better elucidation of the requirements.
The Web Interface will conform with Web Accessibility guidelines (the W3C Web Accessibility Initiative Recommendations - any particular level?)
The aim is to be accessible by a variety of browsers, hardware systems
(PDAs as well ??), automated programs and end users.
The Web Site must be usable by browsers that support W3C recommendations such as HTML,
Cascading Style Sheet (CSS) and the Document Object Model (DOM). The
appearance of the website will be controlled by the use of style sheets
in line with W3C architecture and accessibility recommendations.
Text-based content will be delivered as XHTML 1.0 or HTML 4.
Accessibility requirements adapted from NOF guidelines
The scope of the interface includes presentation of the results
showing e-prints and e-data that match the search criteria,
and their availability. It is not within scope to
display the data sets themselves,
the user may retrieve the data sets from thier location (the link to this location needs to be investigated).
The support of interaction with the data sets
using data-specific applications (e.g. by using plug-ins) is considered to be out of scope for the purposes of this pilot.
The embedding mechanism will enable the eBank UK searches to be embedded within an external website. The user will be able to enter requests from the external web site. The mechanism will relay the request to the eBank service and return results to the user in an adaptable form that allows the results to be presented within the user's site. The mechanism will be based on CGI; WSRP will also be investigated.
Acces Management to data sets may be required for example to provide access to referees during the refereeing process, or to limit access to authorised persons. Access management is considered out of scope for the pilot development, however the user requirements will be documenting authentication and authorisation needs. Hooks will be made available where possible, for example, by noting the levels of access needed, and developing the metadata schemas accordingly.
The e-data repositories will be made available by modification of the
eprints.org software. These repositories will be co-ordinated
by Southampton.
An existing OAI-PMH toolkit will be used for the harvesting and to support the
central repository. The central repository of harvested metadata records consists of a database which is hosted at the RDN.
An existing implementation of this set-up is available as the
e-prints UK project which harvests e-prints metadata using the ARC software.
It is planned that e-Bank UK will re-use the e-Prints UK installation and
central repository of e-prints, but will provide additional mechanisms for
harvesting e-data metadata from the e-data repositories into the central database.
The indexing and searching facilities will be provided by Cheshire.
Cheshire is, at its heart, an SGML search engine. It supports full text documents, and many different types of query including relevance ranked, boolean and mixtures of the two. It can run in a server mode, and doing so uses the Z39.50 protocol, or via a CGI front end to be accessed from a web browser.
It is open source, and freely available for use by academic or non commercial organisations. It is written and maintained primarily by Ray Larson at UC Berkeley.
The use of the ARC toolkit and Cheshire system used in tandem to
deliver search services is proven and documented at
http://www.rdn.ac.uk/publications/resourcefinder/arch/.
The metadata samples will be provided by Combechem (and contacts)
initially, then through WP5.
|
UKOLN, University of Bath http://www.ukoln.ac.uk / Intelligence, Agents, Multimedia Group, School of Electronics and Computer
Science, University of Southampton School of Chemistry, University of Southampton PSIGate, University of Manchester |
|