DESIRE Handbook: Section 3, Technical implementation

Standards

In this chapter ...

Introduction

Information gateways, in common with other Internet-based services, need to make important decisions about the use of standards. Standards can be utilised for many of the functions that an information gateway typically offers. For example, there are existing standards for communication protocols and search protocols, for metadata formats, for the content of bibliographic records (cataloguing) and for subject classification schemes.

Standards can be viewed as falling into three categories:

For the purpose of this chapter’s overview, the term standard refers to all three.

Background

Standards are a mechanism to clearly define the workings of a system, thus making it known or “open”. Standards define the interfaces between components in systems and the data-structures that pass across those interfaces. By building or using systems that conform to open standards we allow components that have been developed by others to interface with ours. Standards are the mechanism of co-operative development. The use of systems that are not standards-based normally leads to proprietary, single vendor solutions.

Because information gateways are complex, co-operative systems, using standards makes sense. However some of the standards relating to interoperability are the subject of ongoing research. One area where this is particularly true is with index sharing among servers in order to provide the ‘forward knowledge’ required for query routing. Another is the ordering and ranking of results from multiple search targets. Information gateways are likely to want to make use of such standards, but should do so with care and in the knowledge that they are still under development.

Structured data

Information gateways are designed to share ‘information’ with the end-user - a query is received and some results are returned. A standard such as HTML allows us to return that information in a way that can be displayed by any Web browser that supports HTML. However, it does not enable the client system or end-user to re-use that data within their local systems. By returning structured data, based on standards such as XML, ROADS or LDAP records or one of several Z39.50 record syntaxes, we allow the client or intermediate system to re-use the data. As Brian Kernighan has noted:

The problem with ‘What You See Is What You Get’ is what you see is all you’ve got.

The re-use of the structured data may be as simple as merging the results from several different services or it may be more complex, such as loading the data into a local spreadsheet or database for further processing.

** Case study - structured vs. unstructured content delivery **

The value of an information gateway is enhanced when the content is provided in a structured, machine-usable format and search protocols are implemented. For example, consider the JISC Current Content Collection Web-site - a set of linked HTML pages describing the rich set of data collections made available by the Joint Information Systems Committee on behalf of the UK Higher Education community. On an experimental basis, the descriptions are now available as a ROADS database, allowing them to be searched. More importantly, the use of ROADS allows these descriptions to be cross-searched with other similar ROADS databases. Furthermore, because the data is returned in a structured format, the database could form the basis of the first stage of resource discovery by some ‘intelligent’ broker or gateway.

JISC Current Content Collection: <URL:http://www.jisc.ac.uk/collections/>

ROADS-based version: <URL:http://roads.ukoln.ac.uk/jisc-ccc/cgi-bin/search.pl>

Standards - a quick tour

The remainder of this chapter will outline the key areas where standards are most likely to be of importance to information gateways. Note, however, that the intention is not to enumerate an exhaustive list of standards or to describe each standard in any detail, rather to simply introduce the key standards-related areas. A more detailed 'state of the art' survey of relevant technologies and standards has been produced by the PRIDE project (1999).

Search and retrieve

Information gateways typically allow end-users to search for, or browse through, catalogues of Internet resource descriptions. Browsing is often implemented, behind the scenes, as a specialised form of searching. Search and retrieve protocols provide the mechanism by which a client can pass a search request to a server and have the results returned back to it. The client is typically remote from the server and communicates with it across the Internet. The client is typically associated with a real (i.e. human) end-user though, as in the case of intelligent agents for example, this is not necessarily the case. The end-user may also be remote from the client, as is the case with the ROADS Web-based WHOIS++ client.

The standards most directly relevant to information gateways in this area are LDAP, WHOIS++ and Z39.50.

CROSS REFERENCE Interoperability chapter

Forward knowledge

In a multiple-service environment, additional functionality is needed in order to support cross-searching. This is the term used to indicate that a client or gateway (broker) allows a single query to be passed to multiple services. Searches may be performed in series (one after the other) or in parallel (all at once). More complex systems allow one service to pass the query on to another service, using what is known as a referral. Such referrals may be based on ‘forward knowledge’ – where information about the contents of each service database is shared between servers allowing them to determine those services that are likely to be able to satisfy a particular search request.

The standards that are relevant here include ‘centroids’ - the information that forms the basis of ‘forward knowledge’ - and CIP - the protocol that is used to share centroids between servers (Panotzki 1996; Allen and Mealling 1998).

WHOIS++ supports referrals that are based on centroids and CIP (RFC 1913). LDAP originally only supported non-intelligent referrals, i.e. not based on forward-knowledge. Recent work has led to the integration of centroid and CIP technologies with LDAP. Z39.50 does not support centroids and CIP natively, though there is some experimental work investigating how it might make use of these standards.

By developing protocol gateways between WHOIS++, LDAP and Z39.50 it is possible to intelligently route queries between services that support different protocols.

** Case study - Project Issac **

The Internet Scout Project is located at the University of Wisconsin-Madison (USA).

Their research involves providing gateway facilities between their LDAP based collections of Internet resource descriptions and those based on other standards. Currently they have implemented LDAP to WHOIS++ gateways using the CIP protocol (Internet Scout Project 1999).

Project ISAAC: <URL:http://scout.cs.wisc.edu/research/>

Record syntaxes

Record syntaxes allow search results to be returned in defined formats. Some protocols, such as Z39.50, support multiple record syntaxes and allow the client to stipulate the record syntax that the server should use as part of the query. Other protocols, such as WHOIS++ and LDAP have record syntaxes built into them – results are always returned using the same syntax.

The record syntax standards commonly used within subject gateways include WHOIS++/ROADS templates, LDAP records and the GRS-1 and various MARC standards associated with Z39.50.

CROSS REFERENCE Metadata formats chapter

Cataloguing rules

Given a particular record syntax, cataloguers must be consistent in the way they use the various attributes (or metadata fields) that are available to them. Without consistent usage, the collection of records that make up an information gateway database will not provide a coherent whole. Furthermore, conversion between record syntaxes and sharing and merging records will become problematical.

CROSS REFERENCE Cataloguing chapter

Subject classification

Manually created information gateways typically sort the resources that they describe into subject areas using classification schemes. There are many standard classification schemes and thesauri that can be used for this purpose. The choice is important, particularly if collaboration and cross-searching with other information gateways is envisaged. Gateways based on different classification schemes may require complex mapping tables in order to interoperate successfully.

CROSS REFERENCE Classification chapter

Harvesting

Some information gateways are partially or wholly built by harvesting information from resources that are available on the Web. Typically, these gateways use robots to gather in that information automatically (e.g. Cao 1997). In the main, the standard Web communication protocol, HTTP, is used to transfer information from the remote Web server to the robot. Information providers, typically the resource authors, often make information available for the robot to collect by embedding it into HTML resources using the META tag. Increasingly this embedded metadata utilises the Dublin Core (DC) metadata element set. Recently, the World Wide Web Consortium have developed the Resource Description Framework (RDF) in order to provide a standard XML-based metadata syntax for use on the Web (W3C 1999). In the future we are likely to see remotely held RDF-based resource descriptions gathered into information gateways using RDF-aware robots.

In the future RDF is also likely to become a standard record syntax for passing results back from the server to the client although currently such usage is highly experimental.

CROSS REFERENCE Harvesting chapter

Conclusions

This chapter has attempted to highlight the need for information gateways to use standards. In part this allows them to provide information in formats that are both human-readable and machine re-usable. More generally, it allows information gateways to become part of a seamless mesh of interconnected networked resources. Adherence to existing standards, at all levels of the information gateway, provides a basis for co-operation, both between information gateways and across information gateways and other services.

References

Cao, Y., 1997. Combine: A Distributed Open System Architecture for WWW Resource Harvesting and Indexing.
<URL: http://www.ub2.lu.se/combine/>

RFC 1913, 1996, Architecture of the Whois++ Index Service. (C. Weider, J. Fullton and S. Spero). Internet Engineering Task Force, Network Working Group, February.
<URL: ftp://ftp.isi.edu/in-notes/rfc1913.txt>

Allen, J. and Mealling, M., 1998, The architecture of the Common Indexing Protocol (CIP). FIND Working Group, Internet-Draft, 18 November.
<URL: ftp://ftp.isi.edu/internet-drafts/draft-ietf-find-cip-arch-02.txt>

Internet Scout Project, 1999, Project Isaac Architecture Overview for Collaborators.
<URL: http://scout.cs.wisc.edu/research/arch/index.html>

Panotzki, P., 1996, Complexity of the Common Indexing Protocol.
<URL: http://www.bunyip.com/research/papers/1996/cip/cip.html>

Dublin Core Metadata Initiative, 1999.
<URL: http://purl.org/dc/>

World Wide Web Consortium, 1999, Resource Description Framework (RDF) Model and Syntax.
<URL: http://www.w3.org/TR/REC-rdf-syntax/>

PRIDE Project, 1999, Requirements and Success Factors.
Work Package 2 of Telematics for Libraries project PRIDE (LB 5624).
<URL: http://www.ukoln.ac.uk/metadata/pride/wp2/d221/>

Glossary

ANSI

American National Standards Institute

BSI

British Standards Institute

CEN

European Committee for Standardization

CIP

Common Indexing Protocol

HTTP

Hypertext Transfer Protocol

IAFA

Internet Anonymous FTP Archive

IETF

Internet Engineering Task Force

IFLA

International Federation of Library Associations and Institutions

ISO

International Standards Organization

LDAP

Lightweight Directory Access Protocol

NISO

National Information Standards Organisation

RDF

Resource Description Framework

ROADS

Resource Organisation and Discovery in Subject based services

W3C

World Wide Web Consortium

Whois++

A 'lightweight' Internet protocol for information retrieval

Z39.50

An ANSI/NISO developed protocol for information retrieval also known as ISO 23950

Credits

Louis Daly, UKOLN: The UK Office for Library and Information Networking, University of Bath, Bath BA2 7AY, UK. - <l.s.daly@ukoln.ac.uk>

Andy Powell, UKOLN: The UK Office for Library and Information Networking, University of Bath, Bath BA2 7AY, UK. - <a.powell@ukoln.ac.uk>

Michael Day, UKOLN: The UK Office for Library and Information Networking, University of Bath, Bath BA2 7AY, UK. - <m.day@ukoln.ac.uk>