Authors: Lorcan Dempsey, Brian Kelly, Anne Mumford, Ian Peacock, Alan Robiette, and Chris Rusbridge on behalf of CEI

Editor: Lorcan Dempsey

Version 2.0, 27 October 1998

Document creator: Brian Kelly

This document replaces the earlier version:

Lorcan Dempsey, Anne Mumford, Alan Robiette, Chris Rusbridge on behalf of FIGIT (Version 1.0, 26 February 1996).

Acknowledgments

Version 2

The authors gratefully acknowledge the helpful input of Andrew Cormack (University of Wales Cardiff), Matthew Dovey (Oxford University), Jon Knight (University of Loughborough), Alan Lock (TASI, University of Bristol), Andy Powell (UKOLN, University of Bath) and Jane Williams (TASI, University of Bristol). We would also like to thank those who made comments on detail on the pre-release draft.

Version 1

The authors gratefully acknowledge the helpful comments made on earlier drafts by members of FIGIT, the JISC Standards Working Group, and by Martin Hamilton and Jon Knight of Loughborough University. The helpful advice of Bill Tuck, independent consultant, and Brian Green, BIC, is also acknowledged. Kelly Russell provided valuable assistance in preparing the final text.

Introduction

eLib and standards

This document provides recommendations for the selection and use of standards in eLib projects. Use of recommended standards is strongly encouraged where relevant, and projects will be asked to justify alternative or competing choices to the Programme Director.

The document is structured as follows. There is a general discussion of specific standards, which for convenience has been clustered in these areas:

(Note: this document takes a broad view of what a standard is, using it to cover industry and community agreements as well as de facto and de jure standards, as such.)

It is intended that this document will be revised to reflect best practice and new developments. There are several points of contact and discussion:

General discussion of the issues it raises will be carried out on the lis-elib-tech Mailbase list.
Please send any queries about the use of standards in eLib projects to Chris Rusbridge, Programme Director (c.a.rusbridge@warwick.ac.uk).
Please send any editorial or content comments about the document itself to elib-standards@ukoln.ac.uk.

eLib projects should note that the following sources of advice are available to them in relation to standards issues:

UK Web Focus (UKOLN) <URL:http://www.ukoln.ac.uk/>: A post funded by the JISC to provide advice and referral about emerging web technologies.
UK Interoperability Focus (UKOLN) <URL:http://www.ukoln.ac.uk/>: A post funded by JISC and the British Library Research and Innovation Centre to provide consensus-making and other services to support interoperability between library and information systems. (Beginning late 1998.)
TASI (ILRT) <URL:http://www.tasi.ac.uk/>: A JISC-funded service which provides technical advice and consultancy on the digital creation, storage and networked delivery of image-related information.
The Arts and Humanities Data Service <URL:http://www.ahds.ac.uk/>: A service funded by JISC working on behalf of the academic community to support the management of scholarly digital resources.

Flavours of formats and standards

The use of standards can pose several well-known problems:

Several versions of a standards may be in use. An 'interpreter' may not be able to process a file if the 'generator' has used an earlier version of the relevant standard, for example.
Sometimes, suppliers may offer 'value added' versions of standards in their implementations. This may constrain any user of the data to use a particular supplier's software.
A standard may not be well specified and may be differently implemented in software.
Some standards may have more features than are likely to be used in practice; inter-working problems may arise where different subsets are used in different implementations.

For these reasons, it is common to have an additional layer or layers of agreement above the 'base' standards. A set of functions or elements may be defined by an agreeing group together with any limits on parameters (character sets permitted, header information, set of standards which are allowed) and any agreed extra features. Examples of specifications come from both standards groups and industry or research consortia.

Working together

It is important that eLib projects work together in relevant areas to ensure interoperability and avoid unnecessary divergence. It is expected for example that 'clumps' or document delivery projects will be able to interwork in relevant areas. The eLib office will advise on these issues.

Recommendations Summary

This section summarises the main standards and recommendations of this report.

Relevant standards for basic applications: telnet, FTP, SMTP/MIME, Internet media types
Comments: In general eLib projects will work in an Internet environment and use Internet applications. Any project with a particular reason to opt for X.400 or non-standard encodings such as uuencode must address the problem of gatewaying to SMTP/MIME mail. FTP is the preferred method of file-access where security may be a concern, otherwise NFS, SMB and HTTP are viable alternatives.

Relevant standards for search and retrieve: Z39.50, LDAP, WHOIS++
Comments: Projects are encouraged to use Z39.50 v.3 to provide access to bibliographic databases, and to adhere to the MODELS Interoperability Profile (and to its successors which will be aligned with international activity). Projects implementing Z39.50 in other domains will be prepared to demonstrate to the Programme Director that they are conforming to relevant agreements. Other search services are expected to provide Z39.50 access where possible. Subject gateways using the ROADS software are accessible via WHOIS++.
Consensus: One of the roles of the Interoperability Focus will be to provide a consensus-making forum for Z39.50 and ILL profile activities.

The proposed Resource Discovery Network Centre (JISC Circular 10/98) will take a role in coordinating standard approaches for the ANR subject gateways. The ROADS poject provides a framework for discussing search and retrieve issues for gateways using WHOIS++.

Relevant standards for document requesting: Z39.50 item order, ILL, HTTP, EDIFACT.
Comments: Standard document requesting systems are still not in routine production use.
Consensus: One of the roles of the Interoperability Focus will be to provide a consensus-making forum for Z39.50 and ILL profile activities.

User interface

Comments: It is anticipated that end-user access to services will typically be through the Web. Public eLib services should be able to be delivered to common Web browsers.

Data interchange

Relevant standards for raster graphics formats: JPEG, TIFF, GIF, PNG, Group 4 fax, CGM
Comments: Vector graphics are not supported by many web browsers. JPEGs are supported by most browsers and are an accepted format for photographic-type images. A loss-less format such as GIF or PNG should be used where JPEG causes distortion (for example on line drawings). Due to a patent on the compression algorithm, the effects of which seem uncertain, PNGs were introduced. Note that PNGs cannot be animated.
Consensus: TASI provides advice in this area. TASI recommends being careful when interchanging between lossless and lossy formats. TASI recommends that lossy formats are terminal formats (end use) any further manipulation should be carried out on non-terminal formats (working copies of the original or as close to the original as possible).

Relevant standards for page description: PostScript Level 2, PDF
Comments: Postscript may be viewed with the publicly available Ghostscript viewer. PDF is a proprietary format, but a freeware plugin and viewer is available.

Relevant standards for structured documents: SGML, HTML, XML
Comments: Vendor specific HTML extensions should not be used. XML is of strategic importance and progress in relevant application areas should be monitored. Projects should not develop their own DTDs unless there are strong reasons for so doing.
Consensus: CEI will work with projects on the adoption of common document type definitions.

Relevant standards for moving images/3-D: MPEG, AVI, GIF89A, QuickTime, RealVideo, VivoActive, VRML
Comments: MPEG-1 offers an accepted cross-platform solution to digital video with audio in most cases. Proprietary formats may be required in the case of streaming video (RealVideo and VivoActive).

Relevant standards for audio formats: AU, MP3, WAV, RAM/RM, MIDI/MODs
Comments: AU is commonly used across the Internet and is supported on a number of platforms. This format is suitable for small samples. For larger samples MP3 offers good reproduction with high compression. The proprietary RAM/RM format may be required for streaming audio.

Relevant standards for electronic document delivery: Projects should consider the GEDI proposals.
Consensus: The JEDDS project is a joint initiative between eLib and several Australian and New Zealand organisations which is exploring a standards based integrated solution.

Metadata

Relevant standards for resource description: Dublin Core, WHOIS++ templates, US-MARC, TEI headers, other community or domain specific approaches.
Comments: eLib strongly recommends that resources should have associated metadata. The use of Dublin Core is recommended unless there are strong reasons to use an alternative approach. Users of community-specific approaches will be prepared to demonstrate to the Programme Director that they conform to relevant agreements.
The subject gateways use WHOIS++ templates.
Consensus: The eLib Access to Network Resources projects discuss issues of common concern through the cousns Mailbase discussion list.

The proposed Resource Discovery Network Centre (JISC Circular 10/98) will take a role in coordinating standard approaches for the subject services.

Relevant standards for collection and service description: There is a variety of application-domain specific approaches, and it is not yet appropriate to make a recommendation.
Consensus: The Interoperability Focus will have a role in assisting the development of shared approaches within the UK.

Relevant standards for resource identification: (URN), PURL, DOI, SICI
Comments: eLib projects are expected to provide a URL for public services, and a URN when these become stable. Given the current state of development no hard recommendation is made for a specific persistent identifier.

Security, authentication and payment services

Comments: These are still immature technologies and few definite recommendations can be made. It may be wise to monitor CERT and vendor security announcements to keep security up to par.

Relevant standards for e-mail privacy: PGP

Further work: eLib projects cannot determine their own authentication structure; it should be a national (or at least a JANET) activity. The ATHENS service should be used by projects in the immediate future pending a longer term distributed standards-based approach.

1. Applications Services

Introduction

The JANET/SuperJANET network which serves the UK higher education community adopted the Internet (TCP/IP) protocols as the preferred standard for basic data transmission towards the end of 1993.

This section covers the following applications areas:

Basic applications

Terminal access

Telnet is the preferred protocol for terminal access across the IP network.

File access

A number of methods can be used to access and transfer files across the internet. FTP is the preferred method for sensitive material. Other methods include NFS (available on most modern Unix workstations and commercially for PCs and Macs), SMB (called the Lanmanager protocols on PCs) and HTTP, though these methods are less secure.

Electronic mail

The situation is more complicated for electronic mail since two mail protocols are officially supported. These are the Internet SMTP protocol running over TCP/IP, and the ISO standard X.400 (1988) also running over TCP/IP.

At present the Internet standard, Standard Mail Transfer Protocol (SMTP), is the prevalent mail protocol used in the academic community. Although SMTP itself does not support non-text message parts, the MIME extensions to SMTP provide many of the same facilities as are defined in X.400 (1988). Since most popular Internet mail software now exists in versions which support MIME, and many campuses have implemented MIME as a local standard, this has become the most popular approach to the provision of multi-media mail facilities.

Any project with a particular reason to opt for X.400 or non-standard encodings such as uuencode must address the problem of gatewaying to SMTP/MIME mail.

Search and request applications

Search and retrieve - Z39.50

Z39.50 is an information retrieval protocol, which originated in the bibliographic world though is now used more widely. The protocol has facilities for managing queries and returning results. Through the use of various structured formats for searching and exchange, Z39.50 applications have an understanding of the semantics of the data they handle.

Z39.50 is the preferred protocol for making bibliographic servers available on the network. The 'clump' projects are implementing Z39.50 based applications which will allow cross-searching of catalogues and other resources. eLib projects should implement Z39.50 version 3 where possible, and are encouraged to work together to ensure that their implementations are interoperable.

Bibliographic applications should use the MODELS Interoperability Profile. International discussions are likely to see a convergence between this profile and other profiles in use in the bibliographic community. Z39.50 is also being used in the archival, geospatial and museum domains: projects in these areas should be able to demonstrate that they conform to appropriate domain-specific agreements where they exist.

Z39.50 is recommended as the general search protocol where relevant.

Search and retrieve - directory services

Directory services allow one to manage access to distributed collections of metadata. They allow users to search for resources (people, network resources, etc.) by some combination of attributes.

A number of dedicated White and Yellow Pages type directory service protocols are in use on the Internet, such as CCSO nameserver, WHOIS, RWHOIS, WHOIS++, finger, and LDAP (the Lightweight Directory Access Protocol). There is quite a lot of interest in LDAP, and Netscape and Microsoft are incorporating it in their products.

WHOIS++ is being used by some of the subject gateways within eLib which use the ROADS software. This implements the Common Indexing Protocol to provide a framework for query routing.

The Harvest system provides a framework for experiment with communicating servers, which distribute indexing and collection in various ways.

Document request services

Items will be requested from several sources: libraries, document suppliers, publishers, and other emerging providers. It would already be useful if there were a standard way of communicating requests, however they originate. This will especially be the case in some of the service scenarios proposed within eLib projects, where requests may need to be sent to several potential sources before they are satisfied.

However, on inspection, it is clear that the request is only one part of a whole process, which will require a range of transactions. The range of these transactions and the extent to which they mesh with commercial transactions will vary from project to project.

The ILL (Interlibrary Loan) protocol (ISO 10160/1) was developed to handle the multiple linked transactions involved in document request operations involving several parties. It is conceptually similar to EDI agreements (see below) and includes provision for: definition of required data elements, definition of a set of messages and their relationships, and a syntax for structuring the messages.

The system to system communication of structured messages allows a greater range of operations to be automated, and manual or mixed procedures for tracking, recalls and so on to be automated. Its use in interactive services for the request of documents requires further investigation.

However, the ILL protocol has not been widely implemented. There is now renewed interest in it as distributed applications begin to be built. BLDSC is developing an ART/ILL-protocol gateway.

An alternative approach has been proposed through the Z39.50 item order extended service, for one to one operation. Several US and European projects and services are looking at the implementation of this approach, but again production services do not seem to be common.

Several systems have also implemented simpler e-mail based request systems.

A possible alternative might be EDI. Electronic data interchange (EDI) is commonly defined as the application-to-application transfer of business documents between computers. EDI could provide a solution using internationally established standard transaction messages that have been tailored for the book and serials sector by Book Industry Communication (BIC) in the UK or by EDItEUR, the European Book Sector EDI Group managed by BIC. The EDItEUR messages, based on EDIFACT, are being developed in close collaboration with BISAC and SISAC in the United States, who plan to migrate to them in 1997. The advantages of EDI include its wide usage and support and its ability to deal with the whole supply chain, inluding reconciliation of invoices. EDI is being implemented by library system vendors and a range of suppliers.

There should be a concerted approach between projects with document request components to ensure inter-working of request management systems.

The World Wide Web

The World Wide Web (often referred to as "the Web") means different things to different people. The typical end user will probably regard a web application as a service which is accessible using a web browser. An application developer could regard a web application as one which is based on simple delivery of HTML files, or as one in which a non-web application is integrated with a web server.

This brief section gives a high-level view of the web. Details on other aspects of the web are given elsewhere in the document.

The Web as a delivery mechanism

It is anticipated that end-user access to services will typically be through the Web. Some services will be directly Web-based. A growing number will be provided through backend services accessed via CGI.

CGI, the Common Gateway Interface, is a protocol for executing programs on a web server which can return results to the browser as HTML. Usually CGI programs are used in conjunction with HTML forms so that the program may take input from the user via the web browser.

A greater variety of helper applications will also be developed. Developers should ensure that end-user services can be delivered to common web browsers.

It is not suggested that the Web will replace more specialist applications such as WHOIS++, SQL or Z39.50, rather that for some interim period users will interact with distributed services through a forms interface and have results returned for processing by their web browser. Distributed object approaches may alter this model.

Software available on the client

Services available on the Web should normally be accessible using "standard" Web browsers. It is difficult to define what is meant by a "standard" browser. To interpret this as meaning that services should be designed for old versions of browsers could mean that needed functionality would have to be omitted and that the processing power and functionality available to large numbers of users of the service would not be exploited.

eLib projects should adopt a pragmatic approach to this issue. New technologies should not be used for their own sake. On the other hand new technologies which are needed in order to provide needed functionality should not be ignored if the technologies are not universally available.

eLib projects need to be aware of the issues involved in maintaining client software within institutions. It is not always easy to install the latest versions of software across large number of desktop machines.

Note that it may not be possible to assume that browser plugin software is widely and easily used, even if the software is freely available.

Techniques such as Content Negotiation and server-side interrogation of the user agent may be used to check the software available on the client machine and to deliver documents which can be used correctly at the client.

eLib projects should ensure that they monitor the browser software, version and client machines which access their services.

Web server issues

Many eLib services will make use of the web to deliver their services. The services may involve the provision of simple HTML resources, the provision of resources in other formats, the use of CGI to run software on the server or the use of gateways to provide access to other services, such as backend database software.

eLib services should be aware of the resource implications needed if the service were to move to another platform (server software, such as CERN httpd server to Apache httpd server, or server hardware, such as Unix to Windows NT). HTTP/1.1 is the latest implementation of HTTP, offering a number of advantages over HTTP/1.0 (such as persistent connections). HTTP/1.1 is not yet fully supported by browsers and servers.

Web programming issues

A variety of Web programming languages and Web programming models can be employed. CGI (Common Gateway Interface) scripts are often employed to provide integration with back-end systems, such as databases. Interaction with the end-user can be provided by client-side scripting languages such as JavaScript and VBScript. Java can be used to provide both back-end integration and interaction with the end user. ActiveX can also be used to provide interactive services.

Dynamic HTML

Dynamic HTML (DHTML) may be regarded as an extension of the capabilities of Web scripting languages. DHTML is based on a Document Object Model (DOM) which enables scripting languages to access anything on a Web page. Although DHTML enables richly functional cross-platform applications to be developed, backwards compatibility will be an issue. In addition Netscape's and Microsoft's implementations of DHTML are currently not fully compatible.

ActiveX

ActiveX was introduced by Microsoft in December 1995 as a rival to Java applets. However ActiveX does not have the cross-platform appeal of Java and its security model is much weaker than Java's. For these reasons the use of ActiveX is not currently recommended.

JavaScript

JavaScript is a scripting language, initially developed by Netscape. Microsoft's implementation of JavaScript, JScript, is not fully compatible with JavaScript. In addition different versions of the Netscape browser provide support for different implementations of JavaScript. More recently JavaScript has been standardised by ECMA (European Computer Manufacturer's Association) and is becoming known as ECMAScript.

2. Data Interchange Standards

Some General Issues

There are many formats for exchanging and storing information. Some of these cover information which can be viewed but not edited. Other formats are concerned with the information (numbers, text, images) but make no reference to how the information is to be viewed or laid out. In the latter case the way the information appears to the user is a function of how it is viewed, for example on paper or on the screen. Some formats are appropriate for text and others for pictures. Some offer compact methods of storage and others less so. It is important to appreciate these differences when choosing a format for storage, particularly for archiving purposes where the eventual use to which the information is to be put may not be known.

One could group format types in various ways. It is useful in the context of the eLib programme to identify the following types:

Simple ASCII text with printing characters and no formatting other than spaces, tabs and line structure. This is unlikely to be of major importance within the eLib programme and is not discussed further.
Proprietary 'closed' word processing formats. These are not encouraged within the context of the eLib programme and are not discussed further.
Graphics formats, for example TIFF, GIF and PNG.
Page description formats, for example PostScript and PDF.
Files containing information about layout and content, for example RTF and TeX.
Structured information where layout is a separate exercise, for example SGML.
Formats for document delivery
Moving images, 3D information and audio.

In addition this section includes some treatment of:

Character sets
Compressed data.

Graphics Formats

Images are stored generally in one of two fundamental formats, a raster format or a vector format. Raster formats essentially store the pixels in the raster lines of the image and may contain any number of colours (fixed by the individual format). The colour space of a format is the model that internally represents the colours within the image (up to a set maximum) based on combinations of some primitives, such as triples of red, green and blue (the RGB colour space) or triples of hue, saturation and value (the HSV colour space). TASI (Technical Advisory Service on Images)recommends that the images remain in the capture colour space and are only converted as needed. Raster formats are the most popular to be found on the Internet, but can suffer from a number of disadvantages: 'jagged' edges, poor resolution, the treatment of text information as image data and the inability to scale without distortion. A number of common raster formats are described below.

GIF

The Graphics Interchange Format (GIF) format defines a protocol which supports the hardware independent, online transmission of Raster graphics data (i.e. images). It uses a version of the LZW compression algorithm for its compression.

GIF is defined in terms of data streams which in turn are composed of blocks and sub-blocks representing images and graphics, together with the essential control information required in order to render the resultant image on the target output device. The format is defined on the assumption that an error-free transport level protocol is used for communication i.e. no error detection facilities are provided.

GIF has recently gained in popularity owing to its extensive use in the Web.

Although GIF is subject to a patent claim by Unisys, they have granted a limited, non-exclusive, royalty-free license for its use in certain computer software viewers.

JPEG

The need for standards which can represent grey scale and colour images in an economical and efficient way was acknowledged by the setting up of the Joint Photographic Expert Group (JPEG). This is a joint committee of ISO and CCITT. The JPEG standard is an ongoing project within ISO and CCITT as more efficient algorithms are sought for future versions of the standard.

The ISO standards covering the JPEG format define a number of different types of JPEG: "baseline JPEG", "lossless JPEG" and the recently introduced "JPEG-LS". "Baseline JPEG" is by far the most popular format in use at present. "Lossless JPEG" has become virtually obsolete and it remains to be seen how "JPEG-LS" (a "lossless" format) will be adopted for use on the Internet.

"Baseline JPEG", although able to achive visual indistinguishability from its original photo, is mathematically "lossy". The lossy compression is designed to take advantage of weaknesses in the human visual system to store detailed pictures such as "real-life" scenes (rather than, say, line-art). The degree of loss of data can be controlled by choice of parameters which are defined in the JPEG specification.

PNG

The Portable Network Graphics (PNG - 'ping') format is designed to be a successor to GIF. It was developed during 1995 and has been adopted by CompuServe to replace GIF. From the web perspective, PNG has a number of advantages over GIF including: alpha channels (variable transparency), gamma correction (cross-platform control of image brightness), two-dimensional interlacing and slightly better compression. PNG supports pallet-based images (8-bit), up to 48-bit ('truecolor') RGB or 16-bit greyscale. Unlike baseline JPEG, saving and re-saving a PNG image will not degrade its quality, making it a useful format for storage at intermediate stages of editing. Final edits of photographic images are probably better converted to JPEG. In constrast to GIF, multiple image support is not a feature of PNG. However. a PNG-like extension format called MNG is currently in development.

TIFF

Aldus Corporation designed and made public the Tagged Image File Format (TIFF) in 1986. Although initially targeted at desktop publishing applications, it has been widely implemented on many computing platforms and has become a de-facto industry standard format.

TIFF is a a tag-based format for storing and interchanging raster images. Typically TIFF describes data that comes from scanners, frame grabbers and paint programs. It is capable of describing bilevel, grayscale, pallet-color and full-color image data in several color spaces. There are four basic TIFF internal formats, bitmap, grey-scale, pallete-color and RGB-color. Each of these internal formats supports its own options for compression.

TIFF is widely used but it should be noted that there are problems with different versions and different flavours with people adding their own tags which interpreters do not necessarily understand.

PhotoCD

PhotoCD is a proprietary format devised by Kodak for the digital storage of constrained resolution images on CD. These images can be viewed at a number of resolutions and can be manipulated using image processing software. This format tends to use the PhotoYCC colour space.

Group 4 fax

CCITT have developed a series of compression mechanisms for transmitting black and white images across telephone lines which are used by fax machines. The standards are officially known as CCITT Recommendations T.4 and T.6 but are more commonly known as Group 3 and Group 4 compression respectively. Group 4 fax is likely to be used in some eLib document delivery projects.

Vector formats

The alternative to raster formats, vector formats, can result in smaller files and better representation. CGM is the International Standard for storage and exchange of 2D graphical data. Although initially a vector format, it has been extended to include raster capabilities. Formats likely to be heard of include the WMF (Windows Meta File) format found on PCs that supports Version 1 of CGM:1992 and the PGML vector format that has recently been submitted to the W3C. Common web browsers do not currently contain native support for vector formats.

Page Description Formats

Introduction

There are document formats which seek to preserve the image of an entire printed page. The page can be stored using a page description language (PDL) where the text and graphics are stored together with output information such as layout, font, linestyle etc. This means that more information is stored than within a Raster format. PDLs are designed for presentation of complete, formatted, final-form (non-revisable) page images on output printing devices. 'Virtual paper' is a good metaphor for PDLs. Most PDLs are oriented toward presentation of pages on laser printers. PostScript and the Portable Document Format (PDF) are described below as these are likely to be important within the eLib programme.

PostScript

PostScript is a Page Description Language (PDL) designed by Adobe Systems Inc. PostScript is the most successful of the commercial PDLs and has had a heavy influence on the final appearance of the Standardised Page Description Language (SPDL, an ISO standard). It offers the advantages of potentially high resolution colour output - that is, it is as good as a printed paper copy - and this means more flexibility in that the scale can be changed without loss of information.

As the 'language' part of PDL suggests, PostScript is a true interpretive programming language. Its Level-1 definition includes over 270 operators, which go far beyond basic graphics presentation (definition and maintenance of 'dictionaries', boolean and arithmetic operators, etc.). The more recently released Level-2 definition contains over 420 operators.

PostScript uses programming language constructs and paradigms: procedures, variables, conditional logic, etc. This creates powerful expressive capabilities. The trade off is that, compared to more object-oriented graphics formats, a PostScript graphics file is very difficult and impractical to edit or modify. Although device-independent, the PostScript imaging model demands Raster devices for presentation. The language is implemented on powerful onboard micro processors on many Raster devices (PostScript demands a lot of memory and computational power to interpret).

Encapsulated PostScript (EPS) is a (large) subset of PostScript which allows storage of information in the PostScript language but excludes any size or positioning information. This means that a part of a page can be brought in to another document. This is most frequently used for the inclusion of graphics within documents where these have been produced a different package than the one used for producing the text.

PostScript Level 2 is likely to be useful within the eLib programme for delivery of high quality page descriptions which are for printing only (although equivalence cannot be guaranteed across printers). Where page fidelity is needed together with the ability to annotate and interact with a document then PDF, discussed below, is likely to be the format needed.

Portable document format and Acrobat

Adobe have further developed the PostScript concept to define their Portable Document Format (PDF) which links with a suite of software called Acrobat. PDF and Acrobat are likely to be important within eLib. This position has been strengthened by the fact that there is a freely available, reader, the latter version of which have support for motion video.

PDF extends PostScript Level 2 to allow the addition of links within and between documents, annotations, thumbnails of pages, and chapter outlines which link to specific pages for access. The basic metaphor again is the page. This can be very attractive to publishers who wish to define a house style or who wish to have an online version of a paper journal. One such example is the Electronic Publishing Journal from Wiley which is described in Smith et al (1994).

The Acrobat software has a number of modules. These are:

Acrobat Reader
Acrobat PDF Writer to interface to other software
Acrobat Exchange which allows some editing and permits annotations, links and security controls to documents
Acrobat Search for searching documents
Acrobat Type Manager which ensures suitable use of fonts and resolution for devices
Acrobat Distiller for conversion from PostScript or EPS
Adobe Acrobat for Workgroups to enable co-operative working on documents
Acrobat Catalog for creating indexes for subsequent use in the Search package
Acrobat Capture for interfacing to scanners and enabling the scanned page to be converted into PDF. This does character recognition and also maps the printed font into an Adobe font

This software is available for the PC, Macintosh and Unix platforms. Adobe have indicated that they intend to extend PDF to encompass SGML files, in order to be able to preserve structure within the file.

Formats containing layout and content

Another storage possibility is storing the layout of a document and the contents of that document separately. This is the approach taken by the Open Document Architecture (ODA) standard. In ODA the layout of the document is stored. This might include pages, title areas, places for pictures etc. The standard also allows the logical structure of the document to be defined. This may be chapters, paragraphs etc. which are then linked to the layout of the document. Alongside this the standard allows various content architectures to be positioned into places on the 'page' (this page could be a piece of paper or a screen and the layout may vary depending on the output medium). ODA standardises a number of content architectures. One of these is the Computer Graphics Metafile standard. Another is a Raster format based on the CCITT fax standard (see Annex A).

It is unclear what future there is for the ODA standard. It is not recommended for use in the eLib programme.

Rich Text Format (RTF), which can be output from and input to many word processing packages, is another example of this form. RTF is not a portable as many believe with problems even between different versions of the same software. It is possibly useful in private correspondence but not for general use in eLib projects.

LaTeX/TeX, used by many mathematicians and computer scientists, is another example of marked up text which contains information about both structure (paragraphs, mathematical symbols) and layout. As mentioned above, TeX is often used within other formats to enable the inclusion of mathematical symbols.

Structured information

Structured information contains embedded mark-up (often known as tags) which describes the information but which does not include any layout information.

SGML

The Standard Generalised Mark-up Language (SGML) provides a meta-language, or syntax, for writing rigorous, descriptive definitions of documents. It is independent of any system, device, language or application and allows individuals or groups of people in user communities to write their own types of documents within a standard framework. The information may include information beyond text and this may be image data stored in fax format or may be a CGM file. This standard separates document definition from subsequent access and viewing and allows information be accessed in ways not predicted at the time of mark-up.

Each SGML document contains three parts. The first is an SGML declaration which describes the environment in which the document needs to be processed and may include information about which character sets are to be used. The second part is the Document Type Definition (DTD) which describes the logical model for the document and defines references to entities which may be referenced, such as a fax image or CGM file. The third part is the document stream itself.

The SGML standard has an associated standard called the Document Style Semantics and Specification Language (DSSSL) which gives rules of presentation and style for the logical document components, for example headers, footers, chapter headings etc., which are defined in the document. The document may then be output via a page description language such as PostScript.

SGML is flexible. There is little restriction - too little many would argue - on the mark-up used. The standard is beginning to be taken on and used in publishing applications, by the US Department of Defence and by providers of some experimental online information services. It is gaining momentum in the marketplace especially as the multimedia standard called HyTime, which is an SGML application, seems to be being taken up by a number of key players in the market. The Text Encoding Initiative, a set of document type definitions produced within the academic community for the encoding of humanities research datasets, is also providing some impetus.

It is anticipated that SGML will be of use to several projects, even where materials may output to XML for delivery. Projects are encouraged to work together to agree or, where necessary, develop document type definitions. Several existing DTDs are likely to be relevant.

Text Encoding Initiative

The Text Encoding Initiative (TEI) is an international research project which aims to develop and disseminate guidelines for the encoding and interchange of machine readable texts within the humanities. These guidelines are available and offer a comprehensive mark-up capability using SGML which has application beyond the humanities.

HTML

Hypertext Markup Language (HTML) is simply a DTD which prescribes formats for presentation and display. Hypertext documents in the World Wide Web are written in HTML. eLib projects will make heavy use of HTML and should use HTML 3.2 and HTML 4.0. Microsoft, Netscape and other vendor-specific extensions are deprecated.

XML

The Extensible Markup Language (XML) is an application profile of SGML and was developed to address deficiencies in HTML. The extensible in XML obviates the need to propose additions to HTML or to use existing, but non-ideal, methods of markup. XML allows a markup structure to be tailored around the nature of the data to be represented. Previously SGML was the primary method of defining new document types, but although powerful, the syntax of SGML is complex. XML therefore introduces two new degrees of freedom to document creation on the web: freedom from an inflexible document type (HTML) and freedom from the complexity associated with defining new document types. Designed for ease of implementation and interoperability with HTML and SGML, the goal is that XML will enable its documents to be served, requested and processed in the way that HTML documents currently are.

An XML document essentially defines a data object that is composed of XML entities. An entity, like a variable, may store either text or binary data. Text within a document will either be a part of the data described or part of the markup. The markup encodes a definition of the document structure so that an XML processor will have access to the appropriate content and structure.

XML is a data object and has no concept of layout; in fact XML documents may not be designed for graphical rendering, being used primarily as a 'data-container'. On the Web however, it is likely that most XML documents will require a graphical representation and XSL was developed for this purpose. XSL, the XML Style Sheet Language, provides a method of defining layout, i.e. how a rendered XML document should appear in an XML browser.

Currently, browsers have only limited native support for XML (e.g. Internet Explorer 4.0 supports CDF, an XML application). Further support should be seen in the releases of Internet Explorer version 5 and Netscape version 5. To view XML documents with current browsers, an intermediate client plug-in program may be used, using Javascript, Java, ActiveX or similar. Another approach is server-side conversion of XML to HTML.

A number of XML applications (formalised XML document structures) exist. Popular applications include CDF (Channel Definition Format) developed by Microsoft for use with push technologies, CML (Chemical Markup Language) developed at Nottingham University, MathML (Maths Markup Language) a recent W3C recommendation and RDF (Resource Description Framework) a framework for describing metadata applications.

XML-related development is currently very active. This includes support technologies, from which XLink (providing a rich hyperlinking mechanism) and XPointer (providing access to components of XML resources) have recently emerged as well as application level technologies.

Moving images

MPEG

The Moving Picture Expert Group is another joint ISO and CCITT group who aim to develop international standards for compression, decompression, processing and coded representation of video and associated audio. "MPEG video" currently consists of two standards, MPEG-1 and MPEG-2. Another, MPEG-4, is in the process of development. The one-time standard for HDTV, MPEG-3, was dropped since it was realised that MPEG-2 could cover the requirements.

The algorithms used for encoding and decoding may be implemented in software or in hardware (e.g. for decoding there are MPEG-decoder cards for PC's and dedicated hardware such as CD-i).

The MPEG-1 standard was released in 1992 optimised for the delivery of audio and video over low bit-rate channels (around 1.5Mbits/s) such as CD-ROMs (including Video-CD and CD-i) and digital telephone networks (T1's, Video on Demand and so on). The standard defines a system of dealing with video (based on H.261 [CCITT 1998-1990] and JPEG) and audio (based on MUSICAM technology). MPEG-1 was optimised to work at resolutions of 352x240 pixels at 30 frames/sec (NTSC) or 352x288 pixels at 25 frames/sec (PAL). MPEG-1 resolution is not however limited to these sizes and may go as high as 4095x4095 pixels at 60 frames/sec. Interlaced video applications (for example broadcast television) are not directly supported by MPEG-1. MPEG-1 is the standard used by a number of common (software-based) MPEG utilities. The strength of MPEG-1 is its high compression ratio with relatively high quality.

MPEG-2 supports a wider range of resolutions and bit-rates than MPEG-1 and was designed with a different purpose in mind, though is 'back-compatible' in that MPEG-2 decoders should play-back MPEG-1 streams. The MPEG-2 standard was finalised in 1994 and addressed the issue of interlaced-video and scalability. This creates the potential to represent high quality digital video - higher quality images at higher bit rates. MPEG-2 streams at lower MPEG-1 bit rates will have lower quality than an MPEG-1 stream, but at its specified bit-rates (between 3-10Mbits/sec), MPEG-2 at 720x486 pixels NTSC delivers true broadcast quality video. MPEG-2 is used chiefly by broadcast and cable companies who require broadcast quality digital video.

GIF89a

The GIF89A format allows multiple images to be encoded within a single file with a specified frame-flick rate. GIF89A's are supported without plug-ins by many web browsers and only loads once. The limitations include a maximum of 256 colours; audio is not supported. Photographs are better compressed by the JPEG method used in MPEGs.

AVI and QuickTime

AVI (Audio Video Interleaved), a.k.a VFW (Video for Windows) is a Microsoft proprietary container format. Typically, the quality is moderate, low VHS standard at 15fps with mono sound in a quarter screen window. However, different codecs can obtain different results, for example, Intel's Indeo4 can produce MPEG1 quality.

QuickTime was developed by Apple. Players or plugins are widely available for various platforms. QuickTime 3 opens MPEG. DV and AVI. The more fully-featured QuickTime tools are still limited to the Mac platform such as vector graphics, animation and 3D sprites.

Streaming Video

RealVideo ia a proprietary format optimised for streaming(i.e. played while the file is still downloading) over the internet. Real Networks claim near-CD quality audio. The quality is downgraded from a QuickTime movie. Performance is better with fast connections (higher than 56k) but it will work at 28.8.

VivoActive is another proprietary streaming format. Vivo claims FM-quality audio. Independent reviews claim good video quality for a streaming format.

MBONE

Mention should also be made of the MBONE, although it is a medium not a format. The MBONE stands for IP Multicast Backbone on the internet. IP-Multicast is the class-D (224.0.0.0 - 239.255.255.255) addressing scheme in IP. IP Multicast-based routing allows distributed applications to achieve time-critical "real-time" communications over wide area IP networks, such as audio/visual teleconferencing. Special routers ("mrouters") distribute multicast data-streams efficiently to subscribed hosts. Further information on the JANET MBONE service can be found at <URL:http://www.ja.net/documents/mbone.html>.

Three dimensional information

Many eLib projects will only be concerned with 2 dimensional pictures (even if they represent a 3 dimensional scene).

Various packages produce 3 dimensional files as output. Some of these relate to the CAD area though are more generally used for file exchange. IGES, AutoCAD DXF and PHIGS Archive files are examples of formats which exist but are probably of little relevance to the eLib programme.

Perhaps of more interest is VRML (Virtual Reality Modelling Language). This seems to be emerging as the standard for 3D objects within the Web. VRML is built on the Open Inventor technology from Silicon Graphics. VRML provides a graphical mark-up allowing links to other VRML and HTML resources. There are some public domain viewers.

Audio

A selection of sound format files popular on the internet are described below.

AU

A sound format developed by Sun/NeXT and possibly the most popular format for samples. Many utilities on a number of platforms support playing AU files. It is not highly compressed.

MP3

MP3 stands for MPEG audio layer 3. It is a highly compressed format (typically 1:12 for stereo files) which works by utilising weaknesses of human audio perception. Defined within the MPEG-1 and MPEG-2 standards, this format is popular where high quality (CD audio quality) reproduction of long samples is required. MP3s are also amenable to streaming and are supported by software on a number of platforms.

WAV

The full name for this format is "RIFF WAVE" and is a format developed by Microsoft and IBM. Its inclusion on Microsoft Windows machines has made it popular. Like AU files, WAVs are not highly compressed.

RAM/RM

The RealAudio streaming audio format. This provides (depending on the version) from AM-quality over 14.4 kbps upto near-CD quality over ISDN and LAN connections.

MIDI/MODs

Two types of format are popular for playing audio sequences, digital modules and MIDI. A digital module is a music file typically with the extension MOD, S3M, FAR or MTM. These formats contain musical patterns and a selection of sound samples, so that music can be reproduced from the patterns with the appropriate sample (instrument). MIDI files are essentially patterns and sequence information to drive a MIDI device. MIDI files may be played on a computer via a program or plugin which will emulate a number of the instruments defined in the MIDI standard.

Document delivery formats

The GEDI agreements are recommended for use in electronic document delivery projects.

This is the format for Generic Electronic Document Interchange (GEDI), overseen by ISO TC46/SC4 (the committee responsible for the standardization of computer applications in information and documentation). The working draft (Jan 1998) of this proposal suggests that files should be interchanged as images, using the TIFF, JPEG and PDF data interchange formats, and describes associated metadata. Information about GEDI can be tracked on the OII Standards Pages <URL:http://www2.echo.lu/oii/en/oiistand.html>

Character sets

If data is stored as a page image these characters can be stored in Raster format along with the text. This may be appropriate where there is no requirement (ever) to discover the meaning of the symbols (for example chemical or mathematical symbols) or text (which may include diacritics or a range of character sets, for example Kanji).

There are some other solutions. One is to use an extended character set which moves beyond the standard printing characters. ISO have been working in this area for many years as a number of projects require standards for character sets to underpin their work. ISO 646 was the first of these standards and defines 128 representations of characters for digital storage (using 7 bits to define each character). ASCII is the American definition of ISO 646 (that is it has the dollar as its currency symbol). ISO 2022 moves on from that in allowing the same number of characters to be defined at any one time but permits character set switching giving sufficient access to Latin alphabets but does not address non Latin alphabets at all. In recognising this need, ISO developed first of all a series of 8 bit character sets in ISO 8859 and more recently a 32 bit character set in ISO 10646. ISO 8859 defines a series of nine 256-character alphabets, each of which consists of the US ASCII characters in the first 128 positions and a series of accented characters or non-Roman characters in the second 128 positions.

Between them the ISO 8859 character sets cover the great majority of European languages, including both Roman and Cyrillic alphabets for eastern European languages, together with Greek, Hebrew and Arabic. The far greater number of characters in the ISO 10646 standard is intended to address all of the world's written languages, including the wide range of characters and ideograms used in Asian languages, but its use in practical applications is at present negligible and it should be regarded as a standard for the future.

It is also possible to use some form of 'markup' which uses printing characters in a pre-defined way which enables the interpreter to recognise that this is a special symbol. An example of such markup is used in the text processing language called TeX. This language is used by many mathematicians and computer scientists when writing papers as it allows the use of symbols such as integral sign, summation symbols and so on. XML will be very important in this area.

Work is currently taking place in the Internet standardisation process to incorporate chemical symbols within MIME and Web standards. HTML 4.0 also addresses the need for extended character sets.

Character sets for textual files transferred via FTP or mail should normally be restricted to those specified in the MIME standard, i.e. US-ASCII for 7-bit text and ISO 8859-n (n = 1-9) for 8-bit text. The use of ISO 8859-1, also commonly known as ISO Latin-1, is particularly commended wherever applicable in view of its widespread availability on end-user devices such as terminals, personal computer systems and printers. Use of other ISO 8859-n variants may well require provision of additional fonts and/or special software (e.g. for right-to-left languages) on many such devices.

Compression

Compression should be mentioned in this section due to the space (and bandwidth & time) advantages of storing (and transferring) compressed formats. An archiving utility will represent a directory hierarchy and its contents in a single file from which the original hierarchy and contents may be cloned. A compression utility utilises an algorithm to encode a file into a smaller one (typically 30-60% smaller) whilst retaining enough information to re-build the original file.

Different tools may use different algorithms and encodings, which means that the compressed data must be un-compressed with the complement of the compression tool. A number of common utilities exist to allow archiving and/or compression. Traditionally many were born as platform-specific and have tended to remain popular on that platform. A number of popular, cross-platform tools are described below with a bias towards those that originated on Unix servers.

compress

A widely used and available Unix compression utility. Compressed files are recognised by their .Z extension.

gzip

Initially a Unix compression utility but now available for other platforms. Compressed files are recognised by their .gz extension.

jar

A Java archive, a feature of JDK1.1 designed to be a platform independent format to transfer collections of Java files such as class files. Jar is very similar to tar (below) though is based around zip. Jar files have the extension .jar.

tar

An archiving utility commonly found on Unix systems. Archived files are recognised by their .tar extension. Often tar files are compressed and end up having a extensions such as tar.gz, .tgz or .tar.Z.

zip

An archiving and compression utility. Very common on PCs (pkzip, WinZip) but with counterparts on many platforms (including Unix zip). Zip files have the extension .zip. Note that zip and gzip are not synonymous.

3. Metadata: Resource Identification and Description

The preceding discussion has dealt mainly with standards for storing 'data'. Here we discuss 'metadata' which we use inclusively to refer to names, locations, descriptive and evaluative data which facilitate identification, discovery, selection, location, use and exploitation of resources. Metadata will support programmatic and human use of data.

A wide variety of metadata models exist. Particular subject, professional or other communities may have more or less well developed metadata models. For example, within the library and bibliographic community an elaborate apparatus of metadata for (largely print) bibliographic items exists. Approaches exist for the description of geospatial data, archival collections, electronic texts, museum objects, social science data sets, and so on. These vary in scope, implementation, and design. At the same time, several simpler formats have emerged which support generic resource discovery. These include the Dublin Core, and the WHOIS++ templates in use by several of the eLib subject gateways. There is also considerable activity surrounding metadata for rights management.

This is an area of rapid evolution and a full account is not desirable here.

Resource identification

Network developments - uniform resource identifiers

The URL (Uniform Resource Locator) underpins the use of the Web. A bibliographic analogue is a library shelf mark. Locators are not permanent (objects may move, or may be accessed differently) and an object may have several locators.

There is clearly a need for another identifier which is not contingent on such accidental features as location or access method. A URN (Uniform Resource Name) is such an identifier; it is a persistent object identifier, assigned by a 'publisher' or some authorising agent. A bibliographic analogue is the ISBN. It will allow resources to be referenced without regard to their location or access method. It is anticipated that 'publishers' will create URNs for the resources they produce or authorise other agencies to do so. There will have to be a registry process, for both publisher names, and URNs. The use of URNs assumes a resolution service, which dynamically resolves URNs to URLs. The IETF URN Working Group is currently defining a persistent identifier for information resources known as the Uniform Resource Name. The URN and the more familiar URL together make up the set of resource identifiers known as Uniform Resource Identifiers (URIs) that are used to identify and locate information on the Web. The requirements for URLs are defined by RFC 1737.

The Digital Object Identifier is a system developed by Bowker and CNRI (the Corporation for National Research Initiatives) in the US, following a Request for Proposals for technologies to identify digital content issued by the Association of American Publishers. The DOI system has three components: the identifier, the directory and the database. The system allows identifiers to be assigned at various levels, and for other systems (e.g. SICI, ISSN) to be included. The directory is a distributed system based on CNRI's Handle system (one proposed solution to the URN requirement) and provides a mapping from DOIs to URLs. URLs may point directly to objects, or to a 'response screen' which provides further details about the object. Individual publishers may maintain databases which contain data about objects. A DOI foundation has been set up to manage and develop the system. The DOI is currently being trialled. It has initially been aimed at the 'traditional' publishing industry, and there are plans to use the DOI as the basis for copyright management systems.

eLib services should be prepared to adopt persistent identifiers when they are stabilised.

Bibliographic identifiers

A widely applied standard apparatus exists for the identification of monographs, the ISBN.

In the case of serials, items need to be identified at (at least) three different levels: title, issue, and article. The ISSN caters for the first of these, although it is only available on recent serials.

The Serial Item and Contribution Identifier (Z39.56 or 'sicky'), has been developed to assist in the unique identification of serial 'Items' (including issues) and 'Contributions' (including articles). This is a US national standard initiated by SISAC which is beginning to be widely adopted in the serials industry. The SICI extends the ISSN with other data derived from issue and article details. The SICI may be created by the originator of the published object (publisher) or by some third party (subscription agent, document supplier).

eLib projects would benefit from the ability to uniquely identify articles, however that is achieved. The SICI seems to be the most fruitful approach here but presents implementation issues.

Resource description

Specialist approaches: community, discipline, domain

As noted above a range of specialist approaches is in use within particular professional or disciplinary domains. Quite often, there are elaborate apparatuses for content encoding which include a metadata component. There are major ongoing standards and consensus making initiatives. Examples include EAD and ISAD(G) for archives, TEI for electronic texts, and FGDC for geospatial data. Several eLib projects will be using specialist approaches of this or other types. Where this is the case, they should be prepared to establish to the Programme Director's satisfaction that they are adhering to best practice within their domain.

Generic resource description for resource discovery

WHOIS++ templates

The IAFA (Internet Anonymous FTP Archive) Group of the IETF (now disbanded) produced recommendations for the description of resources on anonymous FTP archives. A number of objects are identified ('user', 'organisation', 'siteinfo', 'document', 'image', and so on), and templates consisting of multiple attribute-value pairs defined for each. These have been adapted for use in the WHOIS++ protocol. They are used by services in the UK and elsewhere which make use of the ROADS system, an eLib project which provides a set of tools for managing information gateways. They are also used in other contexts.

Dublin Core

The Dublin Core is a simple metadata element set which has generated considerable international interest. It was developed to support resource discovery on the web and is still under development through an international series of workshops. At the time of writing, agreement has been reached on a simple set of 15 unqualified elements, and on how to encode Dublin Core in HTML 2.0 and HTML 4.0. Agreement on a set of qualifiers which refine the meaning of the elements is almost complete. In an interesting recent development, it has been decided to deepen the data model on which Dublin Core rests, and this work is now ongoing, to be formally expressed in RDF (Resource Description Framework, a W3C initiative described below).

eLib projects should not seek to develop new approaches but build on work being done elsewhere. Projects should associate Dublin Core metadata with all substantive web pages belonging to their project following best practice as described on the Dublin Core home page

RDF

RDF stands for the Resource Description Framework. RDF is an activity coordinated by the World Wide Web Consortium.

RDF is a specialist XML application which provides a generic framework for describing metadata. A range of application areas are currently being addressed.

Bibliographic resource description

Libraries have concentrated on the creation and exchange of metadata which describes monograph resources. More or less widely used agreements exist for description (ISBD), cataloguing (AACR2), classification (LC, DDC, UDC, ...), subject headings (LCSH, ...), and so on. Agreements exist for the designation of such data in machine-readable form (*-MARC, CCF, ...) and for its encoding for exchange (ISO 2709). These are not further discussed here.

However, various areas are outstanding and need further attention.

Serials data

There is no consistent approach to the description of serials, at the level of the title, the issue or the article. Nor is there a consistent approach to the recording of holdings data. Libraries typically record brief data about titles, in a variety of ways. Serials data is sometimes integrated into the catalogue, sometimes not. Table of contents providers, abstracting and indexing services, and other data suppliers record descriptions about articles in different ways.

Several eLib projects propose resource sharing scenarios, or 'distributed' document delivery services. Given the data that is currently available, these systems would rely on a high level of manual intervention. Automation will require improvement of available data and, in some proposed scenarios, the matching of records (e.g. matching of records returned from BIDS against a local holdings file, the holdings file of another library, and so on).

US-MARC defines a format for serial holdings data and several approaches exist for description of articles. This is an area in which CEI encourages projects to work together towards consensus and is supporting further investigation. BIC (Book Industry Communication) has also initiated some work in this area.

UKOLN has commissioned several supporting studies for eLib in this area. Further information can be found found at the MODELS supporting studies page.

Terms of availability

A variety of approaches exist to describing the terms of availability of a resource. This is an area which requires more work. Again, eLib projects are encouraged to seek consensus with other projects in this area. A report on terms of availability data, prepared by David Martin and Mark Bide, is available at the MODELS supporting studies page.

'Collections'

A distinction between an 'item' and a 'collection' is common in several contexts. Different communities of interest and metadata models address levels of aggregation and granularity in different ways.

Within the library community there has been limited effort in describing collections, or the collections of metadata which represent them (catalogues). Schemes such as Conspectus, developed by the Research Libraries Group, exist, but are not widely deployed. However it seems obvious that a corollary of making catalogues and other resources more widely available is the creation of metadata which would allow a user, or user agent, to select them as resources of potential interest. This saves the time of the user, but may also assist sensible use of network resources.

With regard to the collection itself there is little consistent guidance in terms of special collections, subject strengths, ILL or external reader policies, and so on. With regard to the databases which represent collections (catalogues) there is not much experience of describing them in ways which would assist selection or query routing. A similar situation obtains with other databases (e.g. abstracting and indexing services). These issues are the subject of another eLib supporting study which is under preparation at the time of writing. It has been recognised as an important area needing attention. The Interoperability Focus will provide support in this area.

'Services' or 'applications'

Distributed services will benefit from well understood approaches to machine-readable descriptions of machine interfaces. Several approaches exist. CEI will work with projects to identify a preferred approach.

Classification of network resources

No recommendation is made here.

Resource evaluation

Various proposals exist in the Web world for resource evaluation. No recommendation is made here.

4. Security, authentication and payment services

Security overview

A simple summary of the state of security on the Internet at the current time is that it is inadequate, and that systems designed to improve the situation are immature. A variety of systems for doing commerce over the Internet do exist, but are not generally widely implemented. This section discusses some of the background involved.

It is recommended that eLib projects monitor CERT and vendor security announcements and to implement relevant security measures. eLib projects should also address the most appropriate level of security for their project.

A commercial transaction

Imagine you are in a large company, bidding for a multi-million pound contract, where the bid has to be sent by email. In view of the possibility of further litigation, and to protect your bid, you and the bid's recipient will need to be confident of several things:

no other bidder (i.e. no-one else) can intercept and read your bid; only the intended recipient can read it
the version read is exactly the version sent, i.e. it has not been tampered with in transit
the recipient can be certain it comes from you
you cannot later deny you sent the bid
the recipient cannot later deny the bid was received
the times of various parts of the transaction can be reliably certified
these facts can be proved in a court.

You may also wish to ensure that another bidder cannot tell whether or not you have made the bid, by watching email traffic.

This little scenario contains many of the important issues in email security. Standard email provides no confidence in any of these issues, as with relatively little ingenuity, everything can be intercepted, faked, re-played, or denied.

One quite often sees services offered on the Internet asking you to quote your credit card numbers in the clear. You should think carefully before doing this, as it is potentially very insecure (although not necessarily much less secure than many other aspects of the credit card system, such as telephone use, touch-tone phone use in Hotels, etc.).

Not all of the issues raised by the scenario above are required all of the time. Often, one only wants a simple minimum of protection from casual prying eyes. And systems designed to cope with the case described (where high costs including sending the transaction via specially equipped notary systems can be justified) are likely to be very different from those designed to cope with micro-transactions, where small fractions of a penny might be charged for accessing some information, and the transaction cost must be even more minute.

Government controls

While good network citizens may be very concerned that the Internet is insecure, governments have different concerns. Most of the suggested improvements to Internet security are highly secure, much more than is needed by the average citizen, but perfect for those who might wish to hide illegal activities, or even transmit espionage. Consequently, many governments have classified encryption as a weapons technology, and apply controls. The most painfully obvious of these are the US export controls, which are much more severe and far-reaching than is generally realised.

It is not legal to export software products containing strong encryption technology from the US without a licence, usually granted only for systems constrained in some way so that they cannot be used for other than limited purposes: authentication is permitted, for example, but encryption is not. Nevertheless, legal versions of most cryptographic systems can be obtained from sources outside the US.

France also has severe controls, and any encryption without a licence is prohibited.

Symmetric and public key encryption

In symmetric encryption, the same key is used to encrypt and decrypt the information. This means that the key is a shared secret, and the information is compromised if either end is compromised; the key must also be transported by some other means. You need a different key for each pair of correspondents, as well, or the likelihood of the secret becoming known increases greatly. Shared password systems are not secure!

Public key, or asymmetric systems work very differently. In such a system, I have a private key (usually for de-crypting) and a public key (usually for encrypting). The normal way they are used is as follows: to send a secure message to me, you encrypt it with my public key (which is widely available); it can only be de-crypted with my private key, which I keep very secure. The most widely used public key cryptosystem is called RSA, after the initials of its authors (Rivest, Shamir and Adelman); the RSA algorithms are protected by patent in the US but not elsewhere in the world.

Authentication

Usernames

The standard means of authentication in computer systems has been the username/password combination. To identify yourself to a shared computer such as a Unix system or a Novell Netware server, you provide a username you have been assigned, and a password which (usually) is known only to you. The system checks the password against a database it holds; usually the password is encrypted in the database (in the case of Unix, the encryption is now quite "weak", with the huge advances in computing performance; it cannot easily be decrypted, but dictionary attacks, where huge numbers of passwords from a dictionary are encrypted, are now easy and quite common).

For this reason the choice of password is important - anything found in an English dictionary can be guessed. Passwords consisting of randomly chosen characters are almost as bad because they are difficult to remember and tend to be written down. A good password should look random, but have some meaning which can aid its recall.

This username/password scheme provides relatively little protection in most cases. Computer systems are highly vulnerable to hacker attack, and passwords can often be quite easily cracked. Passwords can travel over the networks in clear text form. Terminal sessions are often abandoned while still logged in, or temporarily empty offices (with logged in terminal sessions) can be entered.

While the protection offered by these schemes is usually enough for local use, it may not be enough to convince peers at another site, who know nothing of the security procedures in place at your site. So it forms an inadequate long term basis for authentication.

One-time password systems, such as Bellcore's S/Key (being standardised by the IETF OTP working group), provide much better protection. This is because each password may only be used once - in order to authenticate yourself a second time, you will need a new password. The one-time passwords are typically generated by dedicated hardware or password calculator programs, which are available for most Macs, PCs and Unix boxes.

Whilst one time password systems provide protection against passwords being grabbed during the initial authentication phase for protocols such as Telnet and FTP, they do not protect against network traffic monitoring - which may yield further passwords and sensitive information. There has been some deployment of network protocols which use strong encryption to protect against this danger, notably ssh - which attempts to provide a secure mechanism for remote login, remote command execution and file transfer."

These days, most people use personal computers as their primary interface, without any form of authentication other than access. Many but not all do have some level of username/password authentication through systems such as Novell Netware.

Kerberos

Project Athena at MIT tackled the problem of authentication over an insecure network many years ago very successfully. Unfortunately, the system did not become widely adopted, partly because it was applicable only to users of shared computers and did not support personal computers. The adoption of Kerberos as a basis for providing a security service within the OSF Distributed Computing Environment (DCE) and the intention of Microsoft to include support in the next major operating system releases, suggests that future support by vendors is likely.

It remains appropriate for authentication within a single organisation but is not (yet) recommended as an authentication system for services to be used by arbitrary institutions.

Digital signatures

Digital signatures provide a way that you can be sure a piece of information created by someone has not been altered. They depend on public key encryption.

To form a digital signature, I construct a digest from the information to be signed. This digest is a short piece of information derived from the full information by an algorithm; an example is a checksum, although more complex algorithms such as MD5 are usually used. Then I encrypt this digest with my private key, and append it to the full information, with a specification of the digest algorithm used and information about my public key. Now anyone can decrypt this signature with my public key, create a digest from the full information and compare it with the digest from the signature. If they match, the full information has not been altered since it was signed by me.

Certificates

A certificate is a signed piece of structured information identifying the owner of a public key. If you trust the signer of the certificate and you trust the public key owner to keep the private key secure, then you can trust that something signed with the private key corresponding to the public key given in the certificate actually was signed by that person.

Quite complicated systems may have to be built so that you can be sure you can trust the person who signed the certificate, and further complicated systems (certificate revocation lists, or CRLs) are needed to cover the case where the private key has been compromised, e.g. copied by someone from an insecure computer system! Some of these systems are described in RFC 1422, which covers a hierarchical system based on institutional Certification Authorities (CAs) which would provide certificates for individuals; CAs in turn are certified by Policy CAs, which are certified by the Internet PCA Registration Authority. There have been some delays in setting this up because of concerns about legal liability.

The generally agreed mechanism for distributing certificates has been the X.500 directory system (and certificates are usually formatted according to X.509, with particular options selected, see RFC 1422). Although certificates may be distributed by almost any means, including in the email message itself, a factor in the low penetration of secure systems has been the low acceptance of X.500. It is not certain whether current systems will evolve to a high degree of acceptance.

eLib projects cannot determine their own authentication structure; it should be a national (or at least a JANET) activity. The Advisory Committee on Networking (ACN) has decided to do a study on national requirements for authentication technology.

ATHENS 3

In the absence of a strong authentication system covering UK HEIs, JISC is establishing an interim authentication system called ATHENS 3. Designed by NISS and implemented robustly with replicated security servers, ATHENS 3 provides a common username/password scheme across UK HE. Users can be authorised to use a variety of services or datasets with the same username/password, meaning that life is simpler for the user and for local administrators.

Information about ATHENS 3 is available at <URL:http://www.athens.ac.uk/>.

ATHENS 3 is recommended for new JISC network-based services.

Security systems

Email

Normal Internet mail is fundamentally insecure, and should be treated as such. It is easily spoofed, rather less easily intercepted, and liable to end up on unexpected screens if, for example, mis-directed (copies often sent to postmaster).

PGP

PGP, although its use by individuals is growing rapidly, is not yet an Internet standard and the author was for some time under threat of indictment on a charge of violating US export controls. Recent indications are that the latter threat has now been lifted. Implementations of PGP which are legal in most non-US countries, avoiding the use of the patented RSA libraries and making use of code created outside the US, are available for most common computing platforms.

PGP offers good encryption capabilities, but "weak" authentication. Certificates can be signed by anyone; the idea is to create a web of trust . If I trust A, who has signed a certificate for B, in theory I can trust B and even C who has a certificate signed by B. This is adequate for personal privacy in most cases (normal email is based on much less trustworthy systems), but not adequate for commercial transactions, at least of any size. It would generally be adequate for transmitting credit card information (but ensure date/time and other transaction details are included to avoid a replay attack: just encrypting; my credit card no is xxx; could be dangerous!).

In its simplest use, PGP-encrypted messages are normally sent as ASCII text via SMTP mail. The use of PGP in conjunction with MIME is discussed below under MOSS.

PGP is likely to be adopted by the UK HE community; there are reports on a pilot programme available at <URL:http://tech.ukerna.ac.uk/pgp/secemail/>.

GNUPG

GNUPG is a complete and free GNU replacement for PGP. Because it does not use the IDEA or RSA encryption algorithms it can be used without any restrictions. GNUPG is in most aspects compatible with the OpenPGP draft.

PEM

Privacy-enhanced mail (PEM) has been under development for many years. It has now reached the status of an Internet standard, which includes both encryption and an authentication structure of certificates. Few implementations exist, and very few are commercially available outside the US. The authentication structure is based on a hierarchy of Certification Authorities (CAs), rooted at the Internet CA and with intermediate Policy CAs. There have been legal problems in setting up the ICA, and few PCAs exist.

PEM is only suitable for standard Internet mail text messages, and is not compatible with MIME encoding. This partly explains its low penetration, as mailer development has moved towards MIME with its ability to transmit non-text messages.

The PEM RFCs are:

Privacy Enhancement for Internet Electronic Mail: Part I: Message Encryption and Authentication Procedures RFC 1421
Privacy Enhancement for Internet Electronic Mail: Part II: Certificate-Based Key Management RFC 1422
Privacy Enhancement for Internet Electronic Mail: Part III: Algorithms, Modes, and Identifiers RFC 1423
Privacy Enhancement for Internet Electronic Mail: Part IV: Key Certification and Related Services RFC 424

Full PEM should not be used by eLib projects as such because of the previous recommendations concerning PGP.

MOSS

MIME Object Security Services (MOSS) provides a generalised framework for encrypting and/or authenticating body parts of MIME messages, including non-text body and multipart MIME messages. RFC 1848 specifies MOSS and is on track to become an Internet standard. RFC 1847 specifies the multipart/signed and multipart/encrypted MIME types. MOSS can in principle be implemented for any encryption technology: the best approach to doing this using PGP is the subject of two current drafts, <draft-elkins-pem-pgp-02.txt> and <draft-kazu-pgp-mime-00.txt>, of which the work by Michael Elkins seems currently to be the preferred approach. This is an area of active development which will be monitored by CEI.

Web

As a medium designed for the publication of documents to a wide audience, the World Wide Web was not equipped with many facilities for controlling access to documents. As the web has evolved, the need for stronger security facilities and access controls has become highlighted.

The stateless nature of HTTP/1.0 constrasts with protocols such as FTP where user authentication occurs at the login process and authentication persists until logout. HTTP/1.0 was designed so that servers did not have to retain any information from one connection to the next, this means that authentication information, if required, must also be included with each request. As long as the same authentication applies to a set of documents (for example, below some directory path in the URL) then the browser may store and re-issue the appropriate credentials at each request. Any such system using the same credentials is vulnerable to replay attacks, but one-time password systems such as S/Key would not be appropriate for the large number of requests issue for a typical set of web pages. In HTTP/1.0 the username/password combination is encoded but not encrypted, making it vulnerable to network sniffing.

The persistent connections offered by HTTP/1.1 are still not the kind of session over which strong authentication can be deployed, although HTTP/1.1 introduces an improved method called digest authentication. This method incorporates calculating an MD5 digest based upon the username, password and a value returned by the server (called a nonce). Though this improves the situation digest authentication still doesn't provide a strong authentication mechanism. Stronger security can only be incorporated via cryptographic methods. This includes Netscape's Secure Socket Layer (SSL), a TCP-based security mechanism described in the TCP section. SHTTP is no longer in widespread use.

HTTP-NG should support more secure authentication and encryption.

TCP/IP

Security was not a major concern in the development of the basic TCP/IP Internet protocol suite, but is being addressed in the development of the IP version 6 (RFC 1883). Unfortunately, although the security features of IPng (see RFC 1825 and related documents) are mandatory, they are still subject to the US export restrictions. Development and deployment are likely to be protracted.

The Secure Socket Layer protocol developed by Netscape has been adopted by the IETF under the name Transport Layer Security (TLS). The model is based on the idea of providing security at a low level (the transport layer) which provides an encrypted TCP connection between client and server machines. Protocols and applications above TCP in the protocol stack are thus secure.

Arrangements must be made for SSL channels to pass through firewalls and other proxies which may also provide routes for unwelcome visitors. Since the channel is encrypted, there is no method of monitoring what is passing though the firewall.

SSL provides confidentiality and authentication of request and response messages. it can be used to exchange certificates to authenticate the server and client machines, however these assume the presence of (commercial) third party Certificate Authorities which may not be appropriate within the UK. No record is kept of each authentication so non-repudiation is not possible. The main problems with SSL are the low level of security available in the export version and the difficulty of interacting with application-specific intermediaries such as proxies and caches. Despite its strong commercial support, SSL is not a complete solution to web security.

Payment systems

Given the fundamental insecurity of the Internet, it is perhaps surprising that so much Internet-based commerce does take place. Recently there has been a flurry of announcements of schemes for commerce over the Internet, ranging from the simple to the technologically very complex smart card systems. A good reference to the current state of play is an article in the June 1995 issue of Byte magazine.

At this time no recommendation is possible on payment systems to be used by eLib projects. It is desirable that a study be done to determine if such a recommendation can be made.

One concern is that current systems are widely diverse, and generally require both buyer and seller to use the same system. There is the possibility that each buyer may have to use many payment systems in order to do transactions with many different sellers, unless schemes can be devised to provide clearance of funds between payment systems.

Clifford Neuman suggests "Recently proposed, announced and implemented Internet payment systems can be grouped into three broad classes: electronic currency systems, credit-debit systems, and systems supporting secure presentation of credit card numbers."

In the first group he identifies systems such as Digicash and NetCash; in the second, NetBill, First Virtual and NetCheque. The main contender in the third group above is Secure Electronic Transactions (SET).

First Virtual is interesting as it requires no encryption, is moderately cheap, and is operating now. It is very much a trust-based system, somewhat akin to shareware, but might prove a good start for reasonable-sized transactions.

Authors: Lorcan Dempsey, Brian Kelly, Anne Mumford, Ian Peacock, Alan Robiette, and Chris Rusbridge on behalf of CEI

Editor: Lorcan Dempsey

Version 2

Version 1

Further reading

Working together

User interface

Data interchange

Metadata

Security, authentication and payment services

Introduction

Further reading

Further reading

Further reading

Further reading

The Web as a delivery mechanism

Software available on the client

Web server issues

Web programming issues

Dynamic HTML

ActiveX

JavaScript

Further reading

TIFF

Group 4 fax

Vector formats

Introduction

Further reading

Further reading

MPEG

Document delivery formats

Compression

Further reading

Further reading

Specialist approaches: community, discipline, domain

Generic resource description for resource discovery

WHOIS++ templates

Dublin Core

RDF

Further reading

Serials data

Terms of availability

'Collections'

'Services' or 'applications'

ATHENS 3

Further reading

PGP

GNUPG

PEM

MOSS

Further reading

Further reading