Arts and Humanities Data Service Banner AHDS Icon
Dividing Line (Red)

Discovering Online Resources. Metadata and Middleware:
A Systems Architecture for Cross-Domain Discovery

Daniel Greenstein, Arts and Humanities Data Service Executive (Daniel.Greenstein@ahds.ac.uk) and
Robin Murray, Fretwell-Downing Informatics (rjmurray@fdgroup.co.uk)

Dividing Line (Red)

Contents

  1. Introduction
  2. 2 The AHDS's HTTP/Z39.50 Gateway
  3. 3 Conclusion

Dividing Line (Red)

1 Introduction

The interpretation of the Dublin Core discussed previously allows a degree of common description to be applied to information resources that are otherwise documented according to different (domain-specific) standards and practices. Implementation issues remain to be addressed, however, if its full benefits are to be realised and access integrated to disparate scholarly collections. The implementation described here, to be developed for the Arts and Humanities Data Service, goes some way in addressing these issues. Using the Dublin Core-styled metadata outlined in Chapter 3 it will trial the reiterative search and retrieve model for cross-domain discovery that emerged from the MODELS 4 workshop (Russell, this volume). In particular, it will enable users to:

The solution uses software based upon the Z39.50 network applications protocol (Library of Congress 1997a). That software acts as a mediating layer between on the one hand, a World Wide Web interface from which users query a range of different catalogue databases and to which merged result sets are returned to the user, and on the other, the underlying catalogue databases themselves. From the users point of view, this 'middleware' irons out any differences that may exist in the underlying databases (e.g. in their native record structure, query language, and record syntax). Too briefly, systems based upon the Z39.50 network applications protocol will typically comprise Z39.50 clients and Z39.50 targets. Clients issue user-supplied queries to targets and integrate result sets that are returned therefrom. Targets are associated with underlying databases. They receive the queries issued from the Z39.50 client, translate them into queries comprehensible to the native database, issue the queries to the native database, retrieve a result from the native database, and translate the result into a format that can be passed back to the client. Z39.50 clients can themselves behave as Z39.50 targets. The AHDS's Z39.50 client, for example, is designed to search in parallel across any one or several of the AHDS Service Providers' Z39.50-enabled catalogue databases. When queried remotely from a third-party Z39.50 client it will act as a Z39.50 target enabling the remote client to query the Service Providers' catalogue databases as a virtual uniform database.

Until now, the Z39.50-based systems have been used to integrate access to distributed online databases whose native record structures share a great deal in common. Systems have been developed, for example, for libraries, archives, museums, and social science data archives where databases conform more or less closely to a particular data description standard: MARC for libraries (Library of Congress 1997b), the Encoded Archival Description for archives (EAD 1997), the data standard recommended by the Consortium for the Computer Interchange of Museum Information for museums (CIMI 1997), and the standard study description for social science data archives (DDI 1997). In these cases, the Z39.50 systems act principally to iron out differences which stem from the underlying database hardware and software and the implications those differences have for record syntax, query language, etc. As yet, Z39.50 systems are not normally used to iron out differences which exist between catalogue databases because their records are formatted according to fundamentally different data description standards. Z39.50 systems do not as yet enable users to search across the catalogue databases that are maintained in different scholarly, curatorial, or other domains.

Representing a microcosm of this more generic cross-domain discovery problem, the AHDS is positioned to trial potentially generisable applications of Z39.50. Its collections are geographically distributed amongst five Service Providers and intrinsically interdisciplinary - Service Providers collect, manage, and re-distribute digital resources of interest to specific academic communities. The Service Providers will develop their own on-line catalogue databases which will differ by necessity. In cataloguing its electronic texts, for example, the Oxford Text Archive conforms to the Guidelines recommended by the Text Encoding Initiative (TEI 1995) for the use of SGML (Structured Generic Markup Language - a standard formalism for encoding electronic texts). For its database software, the Text Archive will use an SGML-aware search and retrieval engine which will enable users to carry out sophisticated searches of its holdings, either within individual texts, across selections of texts, or across the entire collection. The Archaeology Data Service, on the other hand, catalogues its holdings according to standards more suitable to archaeological information resources and including that recommended by the National Geospatial Data Framework (NGDF 1997). Rather than storing its records as SGML-encoded texts, the Archaeology Data Service will store them in a tabular database which can be queried with SQL (Structured Query Language). Across the AHDS as a whole, we expect each Service Provider to adopt the record structure which most adequately describes its holdings given their particular structure, provenance, and intellectual contents. We further expect the development of at least two possibly three generically different online database implementations: two SQL-based (at the Archaeology Data Service and at the Visual Arts Data Service), two SGML-based (at the Oxford Text Archive and the History Data Service), and one possibly based on object-oriented database software (at the Performing Arts Data Service).

The Service Providers' adoption of different catalogue record structures is essential if they are adequately to describe their holdings - historical databases are necessarily described differently than music recordings. The diverse range of database software and hardware platforms reflects Service Providers' responsiveness to their users' particular search and retrieval needs and to the infrastructure and expertise which exists in their host institutions. Owing to the intrinsic interdisciplinarity of humanities research, however, the AHDS also needs to allow users to search simultaneously across its distributed, interdisciplinary, and differently catalogued holdings. Hypothetically, it must enable a user interested in Shakespeare to discover an electronic copy of a Shakespearean play (at the Oxford Text Archive), a digitised film clip of Olivier's performance as Hamlet (at the Performing Arts Data Service), or a database with information about 300 years' worth of British Shakespearean performances (at the History Data Service). The interpretation of the Dublin Core outlined in Chapter 3 provides a mechanisms for expressing elements which are used commonly to describe the information resources in our distributed collections. The Z39.50-based tools described below will enable users to benefit from that commonality and to search for and locate information resources across the domains occupied by our interdisciplinary and mixed-media holdings.

The AHDS's HTTP/Z39.50 Gateway

2.1 System overview

The Gateway will enable users to query the AHDS's distributed holdings in an integrated fashion. A high-level schematic drawing showing a posible implementation of the Gateway and its relation to internal and external systems is provided in Figure 4.1. It shows the Gateway as a single point of entry to the online catalogue databases developed by the five AHDS Service Providers. It can also search other Z39.50-enabled databases, and is accessed by end users through a World Wide Web interface. As well as having a direct user interface through the World Wide Web, the Gateway may also provide its own Z39.50 target which would allow remote Z39.50 clients to view the databases of all the AHDS Service Providers as a virtual uniform database. 'User Profiles' and 'Database Profiles' make up the Gateway's knowledge bank about approved users, and known databases, respectively. 'Database Centroids' are experimental tools which may assist users in selecting which of the several databases known to the Gateway they should usefully include in any particular query and need not detain us here (Knight and Hamilton 1997; Panotzki 1996).

Figure 4.1 Schematic of the AHDS Gateway and its relation to internal and external systems.

Figure 4.1

2.2 Overall architecture

The possible architecture for the Gateway is depicted in Figure 4.2. It shows the Gateway comprising four inter-related services or functions. These are described in greater detail below.

Figure 4.2 AHDS Gateway architecture.

Figure 4.2

2.2.1 User services access layer
The layer is responsible for the Gateway's user interface. It comprises four components which are described below.

A web gateway allows users to access the Gateway from any standard
web browser.

An information landscape definition provides users with a contextualised map of the information and services available from the Gateway.

Dynamic interface definition ensures that the user interface changes to reflect the services available to users as they move through the landscape.
This is especially important where query forms and returned result sets are concerned. If the user elects to search a number of underlying databases, the elements or fields (e.g. creator, title, subject) which are presented to the user for searching and which are returned in result sets will be those that are commonly supported by the selected underlying databases. As database selections change, so may the elements or fields that are presented to the user for searching and for formatting uniform result sets. Given the AHDS Service Providers' widespread adoption of a common Dublin Core-styled element set for resource description, any search involving one or several of the AHDS Service Provider catalogue databases and any results returned from such a search, will present elements from that set. Since the Gateway will in time enable users to search catalogue databases and other online information resources that are not maintained by the AHDS and so not necessarily conversant with the AHDS's common element set, dynamic interface definition is vital to the Gateway's function.

Authentication which may be implemented if required to ensure the Gateway is accessed only by bona fide users.

2.2.2 Service description layer
This layer enables the Gateway to smooth out anomalies between underlying databases. It comprises two principal components, an 'Explain' proxy service and a record syntax conversion facility.

The Explain proxy service may store profiles of databases that are known to and thus searchable by the Gateway. A database's profile will include information about its contents and record structure, the kinds of queries it supports, and the format in which results are returned. The Z39.50 Explain function is meant to generate such information about a database whenever it is queried by a Z39.50 client. To do this, both the Z39.50 clients and targets involved in the query need to support Z39.50 Explain. Presently, however, Z39.50 Explain is still in a developmental stage. Its capabilities are not entirely understood, nor is it implemented universally with Z39.50-aware systems. The Explain proxy service is an interim measure which will ensure that the Gateway has appropriate knowledge of the systems it is intended to interact with until such time as Explain is more fully developed and universally implemented.

Record syntax conversion. Even within the AHDS, Service Providers will use different catalogue databases and these will return records to the Gateway in multiple record formats. The record syntax conversion facility will obscure this heterogeneity from the user's point of view by converting incoming record structures into some standard internal format which will display single uniform result sets to users.

2.2.3 Distributed access layer
This layer provides facilities to query multiple remote databases in parallel. It consists of two major components.

A parallel query manager propagates a user's query to multiple remote databases by spawning Z39.50 clients. Results from each client are collated and passed to the upper layers of the Gateway (and so to the user) as a single result set.

Z39.50 client(s) independently manage query sessions with a remote database.

2.2.4 Administration database
The Gateway's administration database manages administrative and configuration data for the operation of the Gateway and comprises two components.

Database profiles describe the addresses and capabilities of remote databases which can be searched by the Gateway the the Explain proxy database described above.

Syntax schema describe the record structures that may be returned by remote targets and informs record syntax conversion as described above.

2.3 System interaction and component requirements

The following section outlines the features and requirements of the AHDS Gateway with reference to an example user session.
2.3.1
From a standard web browser, the user logs into or visits the Gateway and is presented with a contextualised map of the services that are available from the Gateway. The Web Gateway component of the 'User Access Layer' is responsible for providing this interface either through standard cgi-type technology and/or by downloading JAVA applets to the client. The Web Gateway will typically support static HTML pages providing information and instruction, for example; HTML forms allowing users to enter queries against selected databases; session support enabling the Gateway to maintain state information so that, for example, large result sets can be passed to the web browser a page at a time; dynamic HTML page generation for hit lists and full record detail pages. The map or information landscape is produced by the Landscape Definition component of the User Access Layer.
2.3.2
The Gateway presents a list of database targets for the user to select from. The database targets will include the catalogue databases mounted by the AHDS Service Providers and will grow to include other online information resources of interest to humanities scholars. The list may be presented by service name (e.g. Visual Arts Data Service, Medieval University Archive). It may also be presented by keywords or even images which are used to characterise the underlying information, for example, by discipline area or subject (e.g. dance or philosophy), by historical period (e.g. the nineteenth century) or place (e.g. Britain), even by the resources' physical location (e.g. in Sheffield). In such cases, by selecting keywords or images the user automatically includes a range of catalogue databases in any subsequent search. To support this function, the Landscape Definition may consist of site-configurable, static HTML pages allowing the provision of a tailored environment. It must also interact dynamically with the Profile Database to allow intelligent searching of selected targets.
2.3.3
The user enters a query. Having selected a set of databases, the user is presented with a search form which allows the user to enter a query against named elements or searchable fields which the underlying databases commonly support. The list of searchable elements will either conform to the AHDS's core element set or dynamically be determined by the Gateway with reference to the Database Profiles or by implementing Z39.50 Explain. The Dynamic Interface Definition component is responsible for querying the Explain Proxy Databases and ascertaining which attributes are supported by the selected databases. Using its knowledge of the capabilities of the underlying databases the Service Description Layer may adapt the query to a more optimal expression for each database.
2.3.4
Remote databases are queried in parallel. The Distributed Access Layer will establish Z39.50 associations for each of the selected databases and independently progress the search against them. The status of each database should be displayed on the user interface to inform the user of the progress of their search.
2.3.5
A unified result set is displayed to the user. The Parallel Query Manager constructs a unified result set for the parallel query. This may involve converting record syntaxes from heterogeneous databases for presentation to the user in a unified format.
2.3.6
The user acts on the results of the search. This may involve obtaining more information about a resource which is referred to in the result set (e.g. by browsing any fuller information that is returned but not displayed in the uniform result set), saving the result set to disk or printing it out, or even ordering a resource (more information about the AHDS's resource ordering system and how it interacts with the Gateway is available from the AHDS's Web pages). The user may also wish to refine the search by issuing another query or issuing the same query across a different range of databases.

3 Conclusion

Though relatively complex, the system described above is intended to realise a simple vision: to enable scholars to find information resources which are appropriate to their needs irrespective of where, by whom, or in what format they are stored. Initially, the system will provide access to a small number of scholarly humanities collections, but it will be extensible, and in time provide access to a wider range of information resources. Indeed, the system's extensibility is required if scholarly communities are to take full advantage of network technologies and the proliferation of online information resources. Integrating access to such resources will require more than metadata and middleware, however. There are other challenges, some of which are only just coming to light as we gain some limited insight into resource discovery in extensively distributed and cross-domain environments. Some of these challenges are addressed in the next chapter, which sets AHDS/UKOLN work on Metadata in a wider context.

Dividing Line (Red)
Return to table of contents

Send comments or questions to info@ahds.ac.uk
Last modified: Monday, 17-Nov-97 16:52:01 GMT by D. Greenstein
URL: http://www.ahds.ac.uk/public/arlist.html


This page was originally part of the Arts and Humanities Data Service (AHDS) Website: http://ahds.ac.uk/public/metadata/disc_06.html
Rescued (courtesy of the Internet Archive) and migrated to the UKOLN Website: 08-Apr-2011; Last updated: 06-May-2011.
The content is identical, but changes have been made to the HTML in an attempt to make it validate, and some links have been updated or deactivated.

Valid XHTML 1.0 Transitional

UKOLN logo