eLib standards guidelines

Lorcan Dempsey, Anne Mumford, Alan Robiette, Chris Rusbridge on behalf of FIGIT

Version 1.0, 26 February 1996

General

Introduction

This document provides recommendations for the selection and use of standards in eLib projects. Use of suggested standards is strongly encouraged where relevant, and projects will be asked to justify alternative or competing choices to the Programme Director.

It is intended that this document will be continually revised to reflect best practice and new developments. General discussion of the issues it raises will be carried out on lis-elib, a mailbase list. Please send any queries about the use of standards in eLib projects to Chris Rusbridge, Programme Director. Please send any comments about the page itself to Lorcan Dempsey, UKOLN. A selective summary appears at the end of the document.

The discussion focuses on the following areas:

: data communications services; data interchange standards; metadata; search and retrieve services; security, authentication and payment services

Flavours of formats and standards

A significant problem with many standards is that several versions are in use. An 'interpreter' may not be able to process a file if the 'generator' has used an earlier version of the relevant standard, for example. There may also be problems where suppliers offer 'value added' versions of standards in their implementations. This may constrain any user of the data to use a particular supplier's software. There may be problems caused by standards not being well specified and writers of software interpreting them differently. Finally, some standards may have more features than are likely to be used in practice; inter-working problems may arise where different subsets are used in different implementations.

This problem is usually addressed by specifying a set of functions or elements which are permitted by an agreeing group together with any limits on parameters (character sets permitted, header information, set of standards which are allowed) and any agreed extra features. Examples of specifications come from both standards groups and industry or research consortia.

One of the most widely accepted specifications in the electronic publishing area is that adopted under the CALS programme from the US Department of Defence (the meaning of the acronym is currently Continuous Acquisition and Logistics Support but this has changed over the years). CALS specifies a suite of standards suitable for the online storage of information which concerns both the production and maintenance aspects of military equipment. It therefore involves standards for engineering design and manufacture as well as electronic publishing. The intention is that the whole process from design to maintenance should be available and documented electronically. The standards include IGES and STEP for the engineering side and Standard Generalised Mark-up Language (SGML) and Computer Graphics Metafile (CGM) for the publishing and documentation aspects. Accompanying the set of selected standards are a series of specifications as to how the standards will be used.

Another example is the Group on Electronic Document Interchange (GEDI) proposals selected by a consortium of library and document supply groups. It includes a header format and recommended page formats including TIFF. This approach is being refined in practice to reflect experience gained and changing circumstances and is gaining acceptance in a number of projects. The GEDI profile is maintained by Pica.

More formal methods have been developed within the International Standards Organisation (ISO) in the form of Internationally Standardised Profiles (ISP) which specify the way standards will be used by application groups. Examples include profiles for CGM, ODA and a set of standards for medical imaging.

For these reasons it is important that eLib projects work together in relevant areas to ensure interoperability and avoid unnecessary divergence.

Data communications standards

Introduction

The JANET/SuperJANET network which serves the UK higher education community adopted the Internet (TCP/IP) protocols as the preferred standard for basic data transmission towards the end of 1993. The consequence of this decision is that earlier strategies based on an evolution towards ISO OSI protocols have been superseded; with few exceptions the ISO standards are now of little practical interest to higher education institutions.

Terminal access and file transfer

Telnet is the preferred protocol for terminal access across the IP network, and FTP the corresponding file transfer protocol. For terminal access there is a gateway offering protocol conversion between telnet and X.29; the address of this is telnet.ja.net. For file transfer the gateway is ft-relay.ac.uk; in addition to converting between Internet FTP and the obsolescent Blue Book file transfer protocol, this gateway also converts between either of the latter protocols and ISO FTAM.

Electronic mail

The situation is more complicated for electronic mail since two mail protocols are officially recommended and supported. These are the Internet SMTP protocol running over TCP/IP, and the ISO standard X.400 (1988) also running over TCP/IP. The reason for preferring the later, 1988, version of X.400 for use by the academic community is that it supports non-text formats for mail messages or parts of mail messages; the earlier version, X.400 (1984), is restricted to text-only messages. X.400 (1984) is permitted in the JANET community but not encouraged.

At present the Internet standard, Standard Mail Transfer Protocol (SMTP), is the prevalent mail protocol used in the academic community. Although SMTP itself does not support non-text message parts, the recently developed MIME extensions to SMTP provide many of the same facilities as are defined in X.400 (1988). Since most popular Internet mail software now exists in versions which support MIME, and many campuses have implemented MIME as a local standard, this has become the most popular approach to the provision of multi-media mail facilities.

In the commercial world X.400 is more widely used (normally, to date, in the older 1984 version), although with the rapid growth of commercial interest in the Internet SMTP is gaining ground there also. The academic community will clearly retain an interest in X.400 both for its own sake and for inter-working with commercial X.400 networks.

The United Kingdom Educational Resources Networking Association (UKERNA) maintains a mail protocol conversion service which provides gatewaying between SMTP, both versions of X.400 and the obsolescent Grey Book mail protocol. The location of the gateway is mail-relay.ja.net. With regard to the eLib programme the recommendation is that projects should use SMTP, with MIME for handling non-text message parts, as the primary standard, since this gives access to the largest number of academic users. Encodings other than those defined in the MIME standard are discouraged, e.g. use of methods such as uuencode/uudecode and BinHex for transmission of non-text enclosures.

Any project with a particular reason to opt for X.400 or non-standard encodings such as uuencode must address the problem of gatewaying to SMTP/MIME mail.

Delivery Mechanisms

It is likely that the World Wide Web will be the main delivery mechanism for the eLib programme. Developments within the Internet standards processes for new Internet Media Types, for example chemical symbols for MIME and the Web, should be watched by those involved in electronic journals and document delivery. Other developments which are ongoing and are relevant include the adoption of CGM as a vector format for MIME and the Web and the inclusion of the PNG format. Both of these are described below.

Reference to the GEDI agreements was made above. Document delivery projects should consider current GEDI practice. The GEDI profile is used in Ariel and in the projects and services that make use of it.

See below for a detailed discussion of Security Issues.

Data Interchange Standards

Some General Issues

Types of data interchange standards

There are many formats for exchanging information. Some of these cover information which can be viewed but not edited. Other formats are concerned with the information (numbers, text, pictures) but make no reference to how the information is to be viewed, or laid out. In the latter case the way the information appears to the user is a function of how it is viewed, for example on paper or on the screen. Some formats are appropriate for text and others for pictures. Some offer compact methods of storage and others less so. It is important to appreciate these differences when choosing a format for storage, particularly for archiving purposes where the eventual use to which the information is put may not be known. (See Annex A for a detailed discussion of compression methods).

Beyond ASCII text

Within the eLib programme there is a need for the storage and representation of a range of characters and symbols not available in ASCII. These can be stored using extended character sets but software and file formats need to be able to handle them. Most of the ISO standards have had to address these issues and allow the selection of extended character sets.

If data is stored as a page image these characters can be stored in Raster format along with the text. This may be appropriate where there is no requirement (ever) to discover the meaning of the symbols (for example chemical or mathematical symbols) or text (which may include diacritics or a range of character sets, for example Kanji).

There are some other solutions. One is to use an extended character set which moves beyond the standard printing characters. ISO have been working in this area for many years as a number of projects require standards for character sets to underpin their work. ISO 646 was the first of these standards and defines 128 representations of characters for digital storage (using 7 bits to define each character). ASCII is the American definition of ISO 646 (that is it has the dollar as its currency symbol). ISO 2022 moves on from that in allowing the same number of characters to be defined at any one time but permits character set switching giving sufficient access to Latin alphabets but does not address non Latin alphabets at all. In recognising this need, ISO developed first of all a series of 8 bit character sets in ISO 8859 and more recently a 32 bit character set in ISO 10646. ISO 8859 defines a series of nine 256-character alphabets, each of which consists of the US ASCII characters in the first 128 positions and a series of accented characters or non-Roman characters in the second 128 positions.

Between them the ISO 8859 character sets cover the great majority of European languages, including both Roman and Cyrillic alphabets for eastern European languages, together with Greek, Hebrew and Arabic. The far greater number of characters in the ISO 10646 standard is intended to address all of the world's written languages, including the wide range of characters and ideograms used in Asian languages, but its use in practical applications is at present negligible and it should be regarded as a standard for the future.

It is also possible to use some form of 'markup' which uses printing characters in a pre-defined way which enables the interpreter to recognise that this is a special symbol. An example of such markup is used in the text processing language called TeX. This language is used by many mathematicians and computer scientists when writing papers as it allows the use of symbols such as integral sign, summation symbols and so on. SGML is an increasingly important markup language (see further discussion of SGML below) The precise syntax and semantics of the markup are specified in the Document Type Definition (DTD which is discussed in more detail below).

Work is currently taking place in the Internet standardisation process to incorporate chemical symbols within MIME and Web standards. HTML 3 is also addressing the need for extended character sets.

Character sets for textual files transferred via FTP or mail should normally be restricted to those specified in the MIME standard, i.e. US-ASCII for 7-bit text and ISO 8859-n (n = 1-9) for 8-bit text. The use of ISO 8859-1, also commonly known as ISO Latin-1, is particularly commended wherever applicable in view of its widespread availability on end-user devices such as terminals, personal computer systems and printers. Use of other ISO 8859-n variants may well require provision of additional fonts and/or special software (e.g. for right-to-left languages) on many such devices.

Document format types

Introduction

One could group format types in various ways. It is useful in the context of the eLib programme to identify the following types:

Simple ASCII text with printing characters and no formatting other than spaces, tabs and line structure. This is unlikely to be of major importance within the eLib programme and is not discussed further.
Proprietary 'closed' word processing formats. These are not encouraged within the context of the eLib programme and are not discussed further.
Raster formats, for example TIFF, GIF and PNG.
Page description formats, for example PostScript and PDF.
Files containing information about layout and content, for example ODA, RTF and TeX.
Structured information where layout is a separate exercise, for example SGML and LaTeX.
Storing graphics, for example image data through TIFF and vector data through CGM.
Moving images and 3D information.
Container documents - based on object-oriented technologies, these documents move towards user interface and program integration functions.

All but the first two of these are discussed together with examples in the text below.

Raster Formats

These involve storing the pixels, or dots, which make up the Raster image and may be monochrome, greyscale or colour. The disadvantage is that the page consists of a series of dots (and sometimes, for example with the fax formats, only black and white ones) at a fixed resolution. The information contained in the document, i.e. the content of the text and graphics, has been lost. However we do have a page which can be attractively laid out and include pictures, symbols and tables. It is at a fixed, and frequently fairly low, resolution.

GIF

The Graphics Interchange Format (GIF) format defines a protocol which supports the hardware independent, online transmission of Raster graphics data (i.e. images). It uses a version of the LZW compression algorithm for its compression. (see Annex A)

GIF is defined in terms of data streams which in turn are composed of blocks and sub-blocks representing images and graphics, together with the essential control information required in order to render the resultant image on the target output device. The format is defined on the assumption that an error-free transport level protocol is used for communication i.e. no error detection facilities are provided.

GIF has recently gained in popularity owing to its extensive use in the Web.

Although the GIF is subject to a patent claim by Unisys, they have granted a limited, non-exclusive, royalty-free license for its use in certain computer software viewers.

PNG

The Portable Network Graphics (PNG - 'ping') format is designed to be a successor to GIF. It has been developed during 1995 and has been adopted by CompuServe to replace GIF. It is a 24 bit (allowing lots of colours) lossless Raster format which offers compression rates greater than GIF. A toolkit for writing and reading PNG files is freely available as is the specification. eLib projects which are intending to store and display graphical images should look at the specification further.

TIFF

Aldus Corporation designed and made public the Tagged Image File Format (TIFF) in 1986. Although initially targeted at desktop publishing applications, it has been widely implemented on many computing platforms and has become a de-facto industry standard format..

The TIFF definition is based on the concept of 'tags'. Tags simply provide information about the Raster image (one of the tags is a pointer to the compressed content of the image itself). Examples range from such critical information as the compression type, size, and bit order of the compressed image, to purely information items such as author, date and time, source software, etc. TIFF revision 5.0 defined some 45 tags.

TIFF is widely used but it should be noted that there are problems with different versions and different flavours with people adding their own tags which interpreters do not necessarily understand.

As noted above, the TIFF specification is used within the GEDI set of selected standards.

PhotoCD

PhotoCD is a proprietary format devised by Kodak for the digital storage of high resolution images on CD. These images can be viewed at a range of resolutions and can be manipulated using image processing software.

Group 4 fax

Page Description Formats

Introduction

There are document formats which seek to preserve the image of an entire printed page. The page can be stored using a page description language (PDL) where the text and graphics are stored together with output information such as layout, font, linestyle etc. This means that more information is stored than within a Raster format. PDLs are designed for presentation of complete, formatted, final-form (non-revisable) page images on output printing devices. 'Virtual paper' is a good metaphor for PDLs. Most PDLs are oriented toward presentation of pages on laser printers. PostScript and the Portable Document Format (PDF) are described below as these are likely to be important within the eLib programme.

PostScript

PostScript is a Page Description Language (PDL) designed by Adobe Systems Inc. PostScript is the most successful of the commercial PDLs and has had a heavy influence on the final appearance of the Standardised Page Description Language (SPDL, an ISO standard). It offers the advantages of potentially high resolution colour output - that is, it is as good as a printed paper copy.- and this means more flexibility in that the scale can be changed without loss of information.

As the 'language' part of PDL suggests, PostScript is a true interpretive programming language. Its Level-1 definition includes over 270 operators, which go far beyond basic graphics presentation (definition and maintenance of 'dictionaries', boolean and arithmetic operators, etc.). The more recently released Level-2 definition contains over 420 operators.

PostScript uses programming language constructs and paradigms: procedures, variables, conditional logic, etc. This creates powerful expressive capabilities. The trade off is that, compared to more object-oriented graphics formats, a PostScript graphics file is very difficult and impractical to edit or modify. Although device-independent, the PostScript imaging model demands Raster devices for presentation. The language is implemented on powerful onboard micro processors on many Raster devices (PostScript demands a lot of memory and computational power to interpret).

Encapsulated PostScript (EPS) is a (large) subset of PostScript which allows storage of information in the PostScript language but excludes any size or positioning information. This means that a part of a page can be brought in to another document. This is most frequently used for the inclusion of graphics within documents where these have been produced a different package than the one used for producing the text.

PostScript Level 2 is likely to be useful within the eLib programme for delivery of high quality page descriptions which are for printing only (although equivalence cannot be guaranteed across printers). Where page fidelity is needed together with the ability to annotate and interact with a document then PDF, discussed below, is likely to be the format needed.

Portable document format and acrobat

Adobe have further developed the PostScript concept to define their Portable Document Format (PDF) which links with a suite of software called Acrobat. PDF and Acrobat are likely to be important within eLib. This position has been strengthened by the fact that there is a public domain reader.

PDF extends PostScript Level 2 to allow the addition of links within and between documents, annotations, thumbnails of pages, and chapter outlines which link to specific pages for access. The basic metaphor again is the page. This can be very attractive to publishers who wish to define a house style or who wish to have an online version of a paper journal. One such example is the Electronic Publishing Journal from Wiley which is described in Smith et al (1994).

The Acrobat software has a number of modules. These are:

Acrobat Reader
Acrobat PDF Writer to interface to other software
Acrobat Exchange which allows some editing and permits annotations, links and security controls to documents
Acrobat Search for searching documents
Acrobat Type Manager which ensures suitable use of fonts and resolution for devices
Acrobat Distiller for conversion from PostScript or EPS
Adobe Acrobat for Workgroups to enable co-operative working on documents
Acrobat Catalog for creating indexes for subsequent use in the Search package
Acrobat Capture for interfacing to scanners and enabling the scanned page to be converted into PDF. This does character recognition and also maps the printed font into an Adobe font

This software is available for the PC, Macintosh and Unix platforms. Adobe have indicated that they intend to extend PDF to encompass SGML files, in order to be able to preserve structure within the file.

Formats containing layout and content

Another storage possibility is storing the layout of a document and the contents of that document separately. This is the approach taken by the Open Document Architecture (ODA) standard. In ODA the layout of the document is stored. This might include pages, title areas, places for pictures etc. The standard also allows the logical structure of the document to be defined. This may be chapters, paragraphs etc. which are then linked to the layout of the document. Alongside this the standard allows various content architectures to be positioned into places on the 'page' (this page could be a piece of paper or a screen and the layout may vary depending on the output medium). ODA standardises a number of content architectures. One of these is the Computer Graphics Metafile standard. Another is a Raster format based on the CCITT fax standard (see Annex A).

It is unclear what future there is for the ODA standard. It is not recommended for use in the eLib programme.

Rich Text Format (RTF), which can be output from and input to many word processing packages, is another example of this form. RTF si not a portable as many believe with problems even between different versions of the same software. It is possibly useful in private correspondence but not for general use in eLib projects.

LaTeX/TeX, used by many mathematicians and computer scientists, is another example of marked up text which contains information about both structure (paragraphs, mathematical symbols) and layout. As mentioned above, TeX is often used within other formats to enable the inclusion of mathematical symbols.

Structured information

Structured information contains embedded mark-up (often known as tags) which describes the information but which does not include any lay out information. Discussion here focuses on SGML.

SGML

The Standard Generalised Mark-up Language (SGML) provides a meta-language, or syntax, for writing rigorous, descriptive definitions of documents. It is independent of any system, device, language or application and allows individuals or groups of people in user communities to write their own types of documents within a standard framework. The information may include information beyond text and this may be image data stored in fax format or may be a CGM file. This standard separates document definition from subsequent access and viewing and allows information be accessed in ways not predicted at the time of mark-up.

Each SGML document contains three parts. The first is an SGML declaration which describes the environment in which the document needs to be processed and may include information about which character sets are to be used. The second part is the Document Type Definition (DTD) which describes the logical model for the document and defines references to entities which may be referenced, such as a fax image or CGM file. The third part is the document stream itself.

The SGML standard has an associated standard called the Document Style Semantics and Specification Language (DSSSL) which gives rules of presentation and style for the logical document components, for example headers, footers, chapter headings etc., which are defined in the document. The document may then be output via a page description language such as PostScript.

SGML gives the most flexibility of our options. There is little restriction - too little many would argue - on the mark-up used. The standard is beginning to be taken on and used in publishing applications, by the US Department of Defence and by providers of some experimental online information services. It is gaining momentum in the marketplace especially as the multimedia standard called HyTime, which is an SGML application, seems to be being taken up by a number of key players in the market. The Text Encoding Initiative, a set of document type definitions produced within the academic community for the encoding of humanities research datasets, is also providing some impetus.

It is anticipated that SGML will be a key standard for eLib. Projects are encouraged to work together to agree or, where necessary, develop document type definitions. Several existing DTDs are likely to be relevant.

Text Encoding Initiative

The Text Encoding Initiative (TEI) is an international research project which aims to develop and disseminate guidelines for the encoding and interchange of machine readable texts within the humanities. These guidelines are available and offer a comprehensive mark-up capability using SGML which has application beyond the humanities.

HTML

Hypertext Mark-up Language (HTML) is simply a DTD which prescribes formats for presentation and display. Hypertext documents in the World Wide Web are written in HTML. eLib projects will make heavy use of HTML and should use HTML 2 and HTML 3 when it is stable. Netscape and other vendor-specific extensions are deprecated.

Storing graphics

Pictures can be stored using Raster formats as described above. There are however severe limitations with using this approach as the diagrams can have 'jagged' edges and may lack adequate detail due to poor resolution. The use of vector graphics can result in much smaller files and better representation. There are current moves to get the CGM standard incorporated into the standard Web tools for these reasons.

The CGM is the International Standard for storage and exchange of 2D graphical data. Although initially a vector format, it has been extended in 2 upwardly compatible extensions to include Raster capabilities and provides a very useful format for combined Raster and vector images.

CGM also has profile rules and a Model Profile to attempt to solve the problem of flavours of standards. Four Internationally Standardised Profiles (ISPs) are being developed for CGM. These are being used as the basis for defining the way that CGM will be used within MIME compliant email and within Web. Some initial work on registering CGM for use within MIME and the web has recently commenced.

Beyond the page image

Much of the discussion above has focused on formats concerned with the storage of information which we currently can view on paper. The online form allows us to be 'free' of the paper metaphor and to store and access a range of types of information which enhance understanding. These include moving images, three dimensional information and sound.

Moving images

The Moving Picture Expert Group is another joint ISO and CCITT group who are addressing the compression of both video and audio. Like JPEG (see Annex A), MPEG uses the discrete cosine transformation for encoding frames. Every tenth frame is transferred in full with the following nine frames being transmitted as 'changes' to that reference frame. The compression technique utilises recent work on colour perception to ensure that only the information which is most relevant to the human visual system is transmitted. Compression and decompression chips are available. The JISC, through the New Technologies Initiative, has just (October 1995) funded hardware for a service at the University of Manchester for MPEG compression. This service will be available nationally to convert from other formats to MPEG.

Other formats which might be appropriate for the storage of moving images include Quicktime Movies and Video for Windows.

Three dimensional information

Many eLib projects will only be concerned with 2 dimensional pictures (even if they represent a 3 dimensional scene).

Various packages produce 3 dimensional files as output. Some of these relate to the CAD area though are more generally used for file exchange. IGES, AutoCAD DXF and PHIGS Archive files are examples of formats which exist but are probably of little relevance to the eLib programme.

Perhaps of more interest is VRML (Virtual Reality Modelling Language). This seems to be emerging as the standard for 3D objects within the Web. VRML is built on the Open Inventor technology from Silicon Graphics. VRML provides a graphical mark-up allowing links to other VRML and HTML resources. There are some public domain viewers.

Sound

No recommendation is made in this area.

Container documents

The advent of distributed object technologies has introduced new types of documents which will become increasingly important. OpenDoc and OLE are the principal approaches here, but it is too early to make any recommendations.

Metadata: resource identification and description

The preceding discussion has dealt mainly with standards for storing 'data'. Here we discuss 'metadata' which we use inclusively to refer to names, locations, descriptive and evaluative data which facilitate identification, discovery, selection, location and use of resources.

As a simplifying convenience, we can identify two main domains of interest to eLib in which an apparatus for the creation and exchange of metadata has been created. One is in the context of Network Information Discovery and Retrieval systems on the Internet. This is of relatively recent origin and is still the subject of research and development: it is not yet clear in many cases what is actually required. The other is within the library and bibliographic community where an elaborate array of metadata for (largely print) bibliographic items exists. These areas are not explained fully here, and are considered below only where they represent an issue that eLib needs to address.

The points of contact between these two domains are growing in number.

Resource identification

Network developments - uniform resource identifiers

The URL (Uniform Resource Locator) is fundamental to recent advances in use of the network, underpins the use of the Web, and will increasingly be used in other contexts. A bibliographic analogue is a library shelf mark. Locators are not permanent (objects may move, or may be accessed differently) and an object may have several locators. URLs have been defined for the basic Internet services (e.g. HTTP, telnet, FTP, ...). URLs for Z39.50 resources are being defined.

There is clearly a need for another identifier which is not contingent on such accidental features as location or access method. A Uniform Resource Name is such an identifier; it is a persistent object identifier, assigned by a 'publisher' or some authorising agent. A bibliographic analogue is the ISBN. It will allow resources to be referenced without regard to their location or access method. It is anticipated that 'publishers' will create URNs for the resources they produce or authorise other agencies to do so. There will have to be a registry process, for both publisher names, and URNs. The use of URNs assumes a resolution service, which dynamically resolves URNs to URLs. Neither URNs or such resolution services are in routine production use and are still the subject of research and development.

eLib projects should be able to supply a URL for public services, and be prepared to adopt URNs when they are stabilised.

Bibliographic identifiers

A widely applied standard apparatus exists for the identification of monographs, the ISBN.

In the case of serials, items need to be identified at (at least) three different levels: title, issue, and article. The ISSN caters for the first of these, although it is only available on recent serials.

The Serial Item and Contribution Identifier (Z39.56 or 'sicky'), has been developed to assist in the unique identification of serial 'Items' (including issues) and 'Contributions' (including articles). This is a US national standard initiated by SISAC which is beginning to be widely adopted in the serials industry. The SICI extends the ISSN with other data derived from issue and article details. The SICI may be created by the originator of the published object (publisher) or by some third party (subscription agent, document supplier). Recently, Elsevier and some other publishers have developed the Publisher Item Identifier which fits into the SICI framework.

Considerable work has also been done on automatically deriving article identifiers from the bibliographic data in an article. The Universal Standard Bibliographic Code (USBC) is a notable example here.

eLib projects would benefit from the ability to uniquely identify articles, however that is achieved. The SICI seems to be the most fruitful approach here but presents implementation issues. This is one area where further work on standards will be carried out by FIGIT.

Resource description

Network resource description

This is still an area of research and experiment. Several approaches have evolved independently, each of which involves a selection of attributes and an encoding. Some of the more important are described here.

Text Encoding Initiative headers

The TEI is a major international project with input from those concerned with the creation, use and exchange of electronic texts in the humanities. It has produced a framework for the documentation and interchange of electronic texts based on SGML. The guidelines describe a TEI header, which includes such data as title, edition, size, publication, description of the source document, revisions and so on. The header was designed with reference to library descriptive standards (for example International Standard Bibliographic Description and the Anglo-American Cataloguing Rules, 2nd edition). Such headers will be part of the files they describe; however provision is also made for 'independent headers' that can be exchanged separately. The relationship between such headers and Machine Readable Catalogue (MARC) records is described in the Guidelines. The headers are not yet widely deployed, but their importance will grow with the implementation and take-up of the TEI Guidelines.

MARC records

US-MARC now caters for the description of network resources, and includes, among other things, a field for the URL. In the Netherlands, the Royal Library and Pica have adapted Pica+. Each of these MARC formats is now being used in projects to 'catalogue' network resources. There do not seem to be any steps to adapt UK-MARC in similar ways, although it is proposed that UK- and US-MARC converge over the next three years.

IAFA/WHOIS++ templates

The IAFA (Internet Anonymous FTP Archive) Group of the IETF (now disbanded) produced recommendations for the description of resources on anonymous FTP archives. A number of objects are identified ('user', 'organisation', 'siteinfo', 'document', 'image', and so on), and templates consisting of multiple attribute-value pairs defined for each. These are very simple templates which the authors recognise will need to be refined in the light of implementation experience. They are not only used to describe files on FTP archives. They are used by Bunyip Information Services, in ALIWEB, by NetEc , and in the Dutch InfoServices Project. In the UK, they have been adopted by SOSIG (Social Science Information Gateway) and OMNI (Organised Medical Networked Information), and are proposed for use in the other eLib subject-based network services. Tools to support these services are being developed in the ROADS (Resource Organisation and Discovery in Subject-based services) project, also funded by the ISSC within the eLib framework. Bunyip and other are working on WHOIS++ templates based on the IAFA structure.

Some other approaches

There are a number of other approaches.

One of potential interest is the Summary Object Identifier Format which is defined within the Harvest project, based on IAFA templates and Bibtex. This is composed of data automatically extracted from resources by a number of methods. URCs (Uniform Resource Characteristics) are being developed but there is yet no consensus about their structure or purpose: it may be that this work will concentrate on structure and syntax rather than semantics of particular element sets.

eLib projects should not seek to develop new approaches but build on work being done elsewhere. It is not appropriate to recommend a definite metadata format here. The eLib Access to Network Resources projects discuss issues of common concern through COUSNS (Committee of UK Subject-based Network Services). There is a discussion list for this: cousns@mailbase.ac.uk (a mailbase list).

Bibliographic resource description

Libraries have concentrated on the creation and exchange of metadata which describes monograph resources. More or less widely used schemes exist for description (ISBD), cataloguing (AACR2), classification (LC, DDC, UDC, ...), subject headings (LCSH, ...), and so on. Schemes exist for the designation of such data in machine-readable form (*-MARC, CCF, ...) and for its encoding for exchange (ISO 2709). These are not further discussed here.

However, various areas are outstanding and need further attention.

Serials data

There is no consistent approach to the description of serials, at the level of the title, the issue or the article. Nor is there a consistent approach to the recording of holdings data. Libraries typically record brief data about titles, in a variety of ways. Serials data is sometimes integrated into the catalogue, sometimes not. Table of contents providers, abstracting and indexing services, and other data suppliers record descriptions about articles in different ways.

Several eLib projects propose resource sharing scenarios, or 'distributed' document delivery services. Given the data that is currently available, these systems would rely on a high level of manual intervention. Automation will require improvement of available data and, in some proposed scenarios, the matching of records (e.g. matching of records returned from BIDS against a local holdings file, the holdings file of another library, and so on).

US-MARC defines a format for serial holdings data and several approaches exist for description of articles. This is an area in which FIGIT encourages projects to work together towards consensus and is supporting further investigation. BIC (Book Industry Communication) has also initiated some work in this area.

Terms of availability

A variety of approaches exist to describing the terms of availability of a resource. This is an area which requires more work. Again, eLib projects are encouraged to seek consensus with other projects in this area.

Collections

At the level of describing the catalogue itself, or the collection(s) it represents, as a resource, there has been limited library effort. Schemes such as Conspectus, developed by the Research Libraries Group, exist, but are not widely deployed. However it seems obvious that a corollary of making catalogues and other resources more widely available is the creation of metadata which would allow a user, or user agent, to select them as resources of potential interest. There is little consistent guidance in terms of special collections, subject strengths, ILL or external reader policies, and so on.

Classification of network resources

For historical reasons, several UK services use UDC (Universal Decimal Classification) to classify network resources. Other eLib services should not feel constrained by this precedent.

Resource evaluation

Various proposals exist in the Web world for resource evaluation. No recommendation is made here.

Search, request and retrieve services

Search and retrieve services

Web

It is anticipated that end-user access to services will typically be through the Web. Some services will be directly Web-based. A growing number will be provided through backend services accessed by CGI interfaces. A greater variety of helper applications will also be developed. Developers should ensure that end-user services can be delivered to common web browsers.

It is not suggested that the Web will replace more specialist applications such as WHOIS++, SQL or Z39.50, rather that for some interim period users will interact with distributed services through a forms interface and have results returned for processing by their web browser. Effectively what is happening is that the web is providing added value terminal and file transfer facilities.

Z39.50

Z39.50 is an information retrieval protocol, which originated in the bibliographic world. The protocol has facilities for managing queries and returning results. It also includes a mechanism for switching between query languages. This technology is strategically important for at least two reasons:

End-users cannot effectively use endlessly proliferating different user interfaces. Z39.50 does not prescribe a standard user interface; it offers a standard way for a particular user interface to communicate with servers. The interface may be implemented in association with a standalone client, or may be part of an existing product, such as an OPAC. This is the type of application that is currently doing most to drive Z39.50 development. Of course many of these services will be delivered to users through HTTP-Z39.50 gateways.
It will support communication between applications. Examples of such links are between a union catalogue and circulation systems to determine availability, or between a search system and local holdings files, to determine location. Applications like this are not yet common but will become more important.

Through the use of various structured formats for searching and exchange, Z39.50 applications have an understanding of the semantics of the data they handle making them potentially quite powerful.

Z39.50 is the preferred protocol for making bibliographic servers available on the network, and for constructing the inter-working applications discussed above. eLib projects should implement Z39.50 version 3 where possible, and are encouraged to work together to ensure that their implementations are interoperable. It is recognised that the deployment of Z39.50 is at very early stages in the UK and FIGIT will consider how to support interoperability and further implementation as it begins to be used more widely in eLib projects.

Z39.50 is an ANSI standard. A parallel ISO standard exists in Search and Retrieve (ISO 10162/3). This has not been widely deployed and its use is not encouraged in eLib projects.

Document request services

Items will be requested from several sources: libraries, document suppliers, publishers, and other emerging providers. It would already be useful if there were a standard way of communicating requests, however they originate. This will especially be the case in some of the service scenarios proposed within eLib projects, where requests may need to be sent to several potential sources before they are satisfied.

However, on inspection, it is clear that the request is only one part of a whole process, which will require a range of transactions. The range of these transactions and the extent to which they mesh with commercial transactions will vary from project to project.

The ILL (Interlibrary Loan) protocol (ISO 10160/1) was developed to handle the multiple linked transactions involved in document request operations involving serveral parties. It is conceptually similar to EDI agreements and includes provision for: definition of required data elements, definition of a set of messages and their relationships, and a syntax for structuring the messages.

The ILL protocol seems to have much to offer requesting services, especially as these become more distributed. The system to system communication of structured messages allows a greater range of operations to be automated, and manual or mixed procedures for tracking, recalls and so on to be automated. Its use in interactive services for the request of documents requires further investigation.

However, the ILL protocol has not been widely implemented. There is now renewed interest in it as distributed applications begin to be built.

An alternative approach has been proposed through the Z39.50 item order extended service, for one to one operation. Several US and European projects and services are looking at the implementation of this approach, but again no production services exist at the time of writing.

Several systems have also implemented simpler e-mail based request systems.

Another possible solution might be EDI, using internationally established standard transaction messages that have been tailored for the book and serials sector by Book Industry Communication (BIC) in the UK or by EDItEUR, the European Book Sector EDI Group managed by BIC. The EDItEUR messages, based on EDIFACT, are being developed in close collaboration with BISAC and SISAC in the United States, who plan to migrate to them in 1997. The advantages of EDI include its wide usage and support and its ability to deal with the whole supply chain, inluding reconciliation of invoices. EDI is being implemented by library system vendors and a range of suppliers.

This is an area where FIGIT plans some exploratory work.

There should be a concerted approach between projects with document request components to ensure inter-working of request management systems.

Directory services

These are a special case of search and retrieve services which allow one to manage distributed collections of metadata. They allow users to search for resources (people, network resources, etc.) by some combination of attributes.

A number of dedicated White and Yellow Pages type directory service protocols are in use on the Internet, such as CCSO nameserver, WHOIS, RWHOIS, WHOIS++, finger, SOLO, and LDAP. No clear favourite has emerged yet, but much work is being done - particularly within the IETF.

WHOIS++ is being used by some of the access to network resources within the eLib framework. The Harvest system provides a framework for experiment with communicating servers, which distribute indexing and collection in various ways.

It is not appropriate (yet) to make any recommendations in this area. WHOIS++ is being used those eLib access to network resources services who are using the ROADS software.

Security, authentication and payment services

Security overview

A simple summary of the state of security on the Internet at the current time is that it is inadequate, and that systems designed to improve the situation are immature and in general untried. A variety of systems for doing commerce over the Internet do exist, but are not generally widely implemented. This section discusses some of the background involved.

A commercial transaction

Imagine you are in a large company, bidding for a multi-million pound contract, where the bid has to be sent by email. In view of the possibility of further litigation, and to protect your bid, you and the bid's recipient will need to be confident of several things:

no other bidder (i.e. no-one else) can intercept and read your bid; only the intended recipient can read it
the version read is exactly the version sent, i.e. it has not been tampered with in transit
the recipient can be certain it comes from you
you cannot later deny you sent the bid
the recipient cannot later deny the bid was received
the times of various parts of the transaction can be reliably certified
these facts can be proved in a court.

You may also wish to ensure that another bidder cannot tell whether or not you have made the bid, by watching email traffic.

This little scenario contains many of the important issues in email security. Standard email provides no confidence in any of these issues, as with relatively little ingenuity, everything can be intercepted, faked, re-played, or denied.

One quite often sees services offered on the Internet asking you to quote your credit card numbers in the clear. You should think carefully before doing this, as it is potentially very insecure (although not necessarily much less secure than many other aspects of the credit card system, such as telephone use, touch-tone phone use in Hotels, etc.).

Not all of the issues raised by the scenario above are required all of the time. Often, one only wants a simple minimum of protection from casual prying eyes. And systems designed to cope with the case described (where high costs including sending the transaction via specially equipped notary systems can be justified) are likely to be very different from those designed to cope with micro-transactions, where small fractions of a penny might be charged for accessing some information, and the transaction cost must be even more minute.

Government controls

While good network citizens may be very concerned that the Internet is insecure, governments have different concerns. Most of the suggested improvements to Internet security are highly secure, much more than is needed by the average citizen, but perfect for those who might wish to hide illegal activities, or even transmit espionage. Consequently, many governments have classified encryption as a weapons technology, and apply controls. The most painfully obvious of these are the US export controls, which are much more severe and far-reaching than is generally realised.

It is not legal to export software products containing strong encryption technology from the US without a licence, usually granted only for systems constrained in some way so that they cannot be used for other than limited purposes: authentication is permitted, for example, but encryption is not. Nevertheless, legal versions of most cryptographic systems can be obtained from sources outside the US.

France also has severe controls, and any encryption without a licence is prohibited.

Symmetric and public key encryption

In symmetric encryption, the same key is used to encrypt and decrypt the information. This means that the key is a shared secret, and the information is compromised if either end is compromised; the key must also be transported by some other means. You need a different key for each pair of correspondents, as well, or the likelihood of the secret becoming known increases greatly. Shared password systems are not secure!

Public key, or asymmetric systems work very differently. In such a system, I have a private key (usually for de-crypting) and a public key (usually for encrypting). The normal way they are used is as follows: to send a secure message to me, you encrypt it with my public key (which is widely available); it can only be de-crypted with my private key, which I keep very secure. The most widely used public key cryptosystem is called RSA, after the initials of its authors (Rivest, Shamir and Adelman); the RSA algorithms are protected by patent in the US but not elsewhere in the world.

Authentication

Usernames

The standard means of authentication in computer systems has been the username/password combination. To identify yourself to a shared computer such as a UNIX system or a Novell Netware server, you provide a username you have been assigned, and a password which (usually) is known only to you. The system checks the password against a database it holds; usually the password is encrypted in the database (in the case of UNIX, the encryption is now quite weak, with the huge advances in computing performance; it cannot easily be decrypted, but dictionary attacks, where huge numbers of passwords from a dictionary are encrypted, are now easy and quite common).

This username/password scheme provides relatively little protection in most cases. Computer systems are highly vulnerable to hacker attack, and passwords can often be quite easily cracked. Passwords travel over the networks in clear text form. Terminal sessions are often abandoned while still logged in, or temporarily empty offices (with logged in terminal sessions) can be entered.

While the protection offered by these schemes is usually enough for local use, it will often not be enough to convince peers at another site, who know nothing of the security procedures in place at your site. So it forms a completely inadequate basis for authentication.

One-time password systems, such as Bellcore's S/Key (being standardised by the IETF OTP working group), provide much better protection. This is because each password may only be used once - in order to authenticate yourself a second time, you will need a new password. The one-time passwords are typically generated by dedicated hardware or password calculator programs, which are available for most Macs, PCs and Unix boxes.

Whilst one time password systems provide protection against passwords being grabbed during the initial authentication phase for protocols such as Telnet and FTP, they do not protect against network traffic monitoring - which may yield further passwords and sensitive information. There has been some deployment of network protocols which use strong encryption to protect against this danger, notably ssh - which attempts to provide a secure mechanism for remote login, remote command execution and file transfer."

These days, most people use personal computers as their primary interface, without any form of authentication other than access. Many but not all do have some level of username/password authentication through systems such as Novell Netware.

Kerberos

Project Athena at MIT tackled the problem of authentication over an insecure network many years ago, very successfully. Unfortunately, the system has not been widely adopted, partly because it was applicable only to users of shared computers, and did not support personal computers. An extended version of Kerberos is capable of operation beyond the confines of one campus or institution, but does not appear likely to be more successful than the older version, despite being included in the OSF Distributed Computing Environment (DCE) specifications.

Digital signatures

Digital signatures provide a way that you can be sure a piece of information created by someone has not been altered. They depend on public key encryption.

To form a digital signature, I construct a digest from the information to be signed. This digest is a short piece of information derived from the full information by an algorithm; an example is a checksum, although more complex algorithms such as MD5 are usually used. Then I encrypt this digest with my private key, and append it to the full information, with a specification of the digest algorithm used and information about my public key. Now anyone can decrypt this signature with my public key, create a digest from the full information and compare it with the digest from the signature. If they match, the full information has not been altered since it was signed by me.

Certificates

A certificate is a signed piece of structured information identifying the owner of a public key. If you trust the signer of the certificate and you trust the public key owner to keep the private key secure, then you can trust that something signed with the private key corresponding to the public key given in the certificate actually was signed by that person.

Quite complicated systems may have to be built so that you can be sure you can trust the person who signed the certificate, and further complicated systems (certificate revocation lists, or CRLs) are needed to cover the case where the private key has been compromised, e.g. copied by someone from an insecure computer system! Some of these systems are described in RFC 1422, which covers a hierarchical system based on institutional Certification Authorities (CAs) which would provide certificates for individuals; CAs in turn are certified by Policy CAs, which are certified by the Internet PCA Registration Authority. There have been some delays in setting this up because of concerns about legal liability.

The generally agreed mechanism for distributing certificates has been the X.500 directory system (and certificates are usually formatted according to X.509, with particular options selected, see RFC 1422). Although certificates may be distributed by almost any means, including in the email message itself, a factor in the low penetration of secure systems has been the low acceptance of X.500. It is not certain whether current systems will evolve to a high degree of acceptance.

eLib projects cannot determine their own authentication structure; it should be a national (or at least a JANET) activity. The Advisory Committee on Networking (ACN) has decided to do a study on national requirements for authentication technology.

Security systems

Email

Normal Internet mail is fundamentally insecure, and should be treated as such. It is easily spoofed, rather less easily intercepted, and liable to end up on unexpected screens if, for example, mis-directed (copies often sent to postmaster ).

PGP

PGP, although its use by individuals is growing rapidly, is not yet an Internet standard and the author was for some time under threat of indictment on a charge of violating US export controls. Recent indications are that the latter threat has now been lifted. Implementations of PGP which are legal in most non-US countries, avoiding the use of the patented RSA libraries and making use of code created outside the US, are available for most common computing platforms.

PGP offers good encryption capabilities, but weak authentication. Certificates can be signed by anyone; the idea is to create a web of trust . If I trust A, who has signed a certificate for B, in theory I can trust B and even C who has a certificate signed by B. This is adequate for personal privacy in most cases (normal email is based on much less trustworthy systems), but not adequate for commercial transactions, at least of any size. It would generally be adequate for transmitting credit card information (but ensure date/time and other transaction details are included to avoid a replay attack: just encrypting; my credit card no is xxx; could be dangerous!).

In its simplest use, PGP-encrypted messages are normally sent as ASCII text via SMTP mail. The use of PGP in conjunction with MIME is discussed below under MOSS.

PGP is likely to be adopted by the UK HE community; there are reports on a pilot programme available at URL:http://tech.ukerna.ac.uk/pgp/secemail/.

PEM

Privacy-enhanced mail (PEM) has been under development for many years. It has now reached the status of an Internet standard, which includes both encryption and an authentication structure of certificates. Few implementations exist, and very few are commercially available outside the US. The authentication structure is based on a hierarchy of Certification Authorities (CAs), rooted at the Internet CA and with intermediate Policy CAs. There have been legal problems in setting up the ICA, and few PCAs exist.

PEM is only suitable for standard Internet mail text messages, and is not compatible with MIME encoding. This partly explains its low penetration, as mailer development has moved towards MIME with its ability to transmit non-text messages.

The PEM RFCs are:

Privacy Enhancement for Internet Electronic Mail: Part I: Message Encryption and Authentication Procedures RFC1421
Privacy Enhancement for Internet Electronic Mail: Part II: Certificate-Based Key Management RFC1422
Privacy Enhancement for Internet Electronic Mail: Part III: Algorithms, Modes, and Identifiers RFC1423
Privacy Enhancement for Internet Electronic Mail: Part IV: Key Certification and Related Services RFC 424

Full PEM should not be used by eLib projects as such because of the previous recommendations concerning PGP.

MOSS

MIME Object Security Services (MOSS) provides a generalised framework for encrypting and/or authenticating body parts of MIME messages, including non-text body and multipart MIME messages. RFC 1848 specifies MOSS and is on track to become an Internet standard. RFC 1847 specifies the multipart/signed and multipart/encrypted MIME types. MOSS can in principle be implemented for any encryption technology: the best approach to doing this using PGP is the subject of two current drafts, <draft-elkins-pem-pgp-02.txt> and <draft-kazu-pgp-mime-00.txt>, of which the work by Michael Elkins seems currently to be the preferred approach. This is an area of active development which will be monitored by FIGIT.

Web

The security of the web system is weak; the security capabilities built in to the first versions of the HTTP protocol were inadequate and rarely implemented. The stateless nature of web protocols makes it difficult to implement security, which should be cheap enough to re-check on every request.

It is possible to extend web protocols to create pseudo-sessions, by providing an identifier after an initial authentication transaction. This identifier would associate later transactions with the earlier authentication. This is adequate for many purposes such as access to BIDS, for example. Unfortunately it does interfere with caching protocols.

SSL

Netscape have devised a protocol called Secure Session Layer (SSL). This is not strictly a web protocol, having nothing to do with HTTP; it is a session layer between HTTP and TCP. It is not an Internet standard, but is being offered to the IETF standardisation process; it is implemented in the latest versions of the Netscape browsers and secure versions of the Netsite web servers.

SSL might eventually be redundant when security is built in to later versions of TCP/IP, as expected. Authentication is based on a public key certificate system.

S-HTTP

Secure HTTP is a developing proposal which now integrates another proposal called SHEN. It uses a certificate-based authentication structure, and does extend the basic web protocol. It is implemented by a few browsers, and Netscape have said they will support it. It is being used in the CommerceNet pilot in the US.

TCP/IP

Security was not a major concern in the development of the basic TCP/IP Internet protocol suite, but is being addressed in the development of the so-called IP next generation (IPng - officially referred to as IP version 6, cf. RFC 1883). Unfortunately, although the security features of IPng (see RFC 1825 and related documents) are mandatory, they are still subject to the US export restrictions. Development and deployment are likely to be protracted.

Payment systems

Given the fundamental insecurity of the Internet, it is perhaps surprising that so much Internet-based commerce does take place. Recently there has been a flurry of announcements of schemes for commerce over the Internet, ranging from the simple to the technologically very complex smart card systems. A good reference to the current state of play is an article in the June 1995 issue of Byte magazine.

At this time no recommendation is possible on payment systems to be used by eLib projects. It is desirable that a study be done to determine if such a recommendation can be made.

One concern is that current systems are widely diverse, and generally require both buyer and seller to use the same system. There is the possibility that each buyer may have to use many payment systems in order to do transactions with many different sellers, unless schemes can be devised to provide clearance of funds between payment systems.

Clifford Neuman suggests "Recently proposed, announced and implemented Internet payment systems can be grouped into three broad classes: electronic currency systems, credit-debit systems, and systems supporting secure presentation of credit card numbers."

In the first group he identifies systems such as Digicash and NetCash; in the second, NetBill, First Virtual and NetCheque. Visa/Microsoft and Mastercard/Netscape are working on different systems in the third group.

First Virtual is interesting as it requires no encryption, is moderately cheap, and is operating now. It is very much a trust-based system, somewhat akin to shareware, but might prove a good start for reasonable-sized transactions.

Selective summary

This section brings together the main standards and recommendations of this report.

Data communications services

Relevant standards: TCP/IP, telnet, FTP, SMTP/MIME, Internet media types, X.400 (1988)

Comments: In general eLib projects will work in an Internet environment and use Internet applications. Any project with a particular reason to opt for X.400 or non-standard encodings such as uuencode must address the problem of gatewaying to SMTP/MIME mail.

User interface

Comments: It is anticipated that end-user access to services will typically be through the Web. Public eLib services should be able to be delivered to common Web browsers.

Search, request and retrieve services

Relevant standards for search and retrieve: Z39.50

Comments: Production Z39.50 systems are not yet in place in the UK. Projects are encouraged to use Z39.50 v.3 to provide access to bibliographic databases.

Consensus: eLib implementors of Z39.50 should collaborate to ensure interworking.

Relevant standards for document requesting: Z39.50 item order, ILL, HTTP, EDIFACT.

Comments: Standard document requesting systems are still not in place. FIGIT will work with services through the MODELS project to explore preferred future directions.

Metadata

Relevant standards for resource description: US-MARC, IAFA, TEI headers

Comments: This is an area in which there is still much research and development and where it is premature to suggest one preferred approach.

Consensus: The eLib Access to Network Resources projects discuss issues of common concern through COUSNS (Committee of UK Subject-based Network Services). There is a discussion list for this: cousns@mailbase.ac.uk (a mailbase list).

Relevant standards for resource identification: URL, (URN), SICI

Comments: eLib projects are expected to provide a URL for public services, and a URN when these become stable. URLs for the eLib projects will preferably be in the eLib domain.

Further work: eLib will benefit from the existence of persistent identifiers of information objects, print or electronic. Investigation of the deployment of URNs, PURLs and SICIs will be taken forward.

Data interchange

Relevant standards for vector graphics: CGM

Relevant standards for raster graphics: JPEG, TIFF, GIF, PNG, Group 4 fax

Relevant standards for page description: PostScript Level 2, PDF

Relevant standards for structured documents: SGML, HTML

Comments: Vendor specific HTML extensions should not be used.

Consensus: FIGIT will work with projects on the adoption of common document type definitions.

Relevant standards for moving images/3-D: MPEG, VRML

Relevant standards for electronic delivery: Projects should consider the GEDI proposals.

Consensus: The JEDDS project is a joint initiative between eLib and several Australian and New Zealand organisations which is exploring a standards based integrated solution.

Security, authentication and payment services

Comments: These are still immature technologies and few definite recommendations can be made.

Relevant standards for e-mail privacy: PGP

Further work: eLib projects cannot determine their own authentication structure; it should be a national (or at least a JANET) activity. The Advisory Committee on Networking (ACN) has decided to do a study on national requirements for authentication technology. UKERNA is investigating the use of PGP on JANET. Issues here are of generic interest and FIGIT will work with the ACN and other relevant bodies to move things forward.

Acknowledgements

The authors gratefully acknowledge the helpful comments made on earlier versions by members of FIGIT, the JISC Standards Working Group, and by Martin Hamilton and Jon Knight of Loughborough University of Technology. The helpful advice of Bill Tuck, independent consultant, and Brian Green, BIC, is also acknowledged. Kelly Russell provided valuable assistance in preparing the final version.

Annex A

Compression

Compression is carried out to reduce the volume of data being transferred or stored. This is particularly relevant to image data (be this an image of a picture or a raster image of text) which can be very inefficient and time consuming to transfer across networks.

Compressing and decompressing graphical images is currently a topic of considerable interest. There have been a number of advances in algorithms for compression motivated by the increased use of networks and the need to transfer image data at speed. The commercial success of fax compression (for bi-tonal images) suggests that grey scale and colour compression techniques could also become quite important. The potential of transferring video information economically without loss of viewing definition; and the requirement for transferring high definition TV images; have all added to the pressure to develop standards for compression and decompression.

There are two basic types of compression which it is worth noting here. These are known as 'lossy' and 'lossless'. Lossless compression is where the file is reduced in size but the image can be reconstructed on delivery. Lossy compression means that the compression has resulted in some loss of information in the image. The loss may not be apparent to the viewer as many techniques are based on the human visual system and are prepared to sacrifice those parts of the picture which cannot be seen at a particular resolution by the human eye. This is fine if the purpose is to only view the picture. Within the context of the eLib programme it may be important to store the uncompressed image within an archive and distinguish that from the delivery of a compressed image. The loss of fidelity in images may limit future uses of the information if the archive copy is a compressed image.

The above methods can be used in conjunction with further compression of files - perhaps using one of the standard Unix utilities, for example using tools like zip and gzip - however this has little effect.

Compression Methods

Raster files tend to be very large. This is particularly true if the image is stored in colour. As a result, various compression techniques have emerged which allow the image to be reduced in size. As discussed above, this may result in some loss of information ('lossy' compression). Some techniques which are in wide use are described below.

Run-length encoding

Run-length encoding facilitates compression by giving a count followed by a recurring value. For example, rather than replicating a red pixel 20 times, the run length encoding value for red is stored once together with the count number of 20. This scheme is typically used for storing image data which contains long runs of similar colours.

Huffman encoding

The Huffman codes use the probability of occurrence of a value as the basis for defining shorter codes. The codes are defined in advance in the original scheme although Gallager has proposed an adaptive Huffman coding scheme. The Computer Graphics Metafile standard uses Huffman codes within the character encoding related to the incremental coding for point lists.

LZW Compression

The first version of this encoding method was proposed by Lempel and Ziv. It takes advantage of the potential for the repetition of string information within the data; this results in it being a useful method for encoding image data. The scheme works by building a table of potential string values together with a knowledge of the permissible word length of the file. Each string is considered in turn by the 'compressor' which extends the string by one value on each iteration and looks to see if the string occurs in the table. If it does not, the string is added to the table and the code stored in the file.

There has been some recent discussion regarding this algorithm and whether it is covered by a Unisys Corporation patent (US Patent 4558302). This was originally with regard to the potential inclusion of LZW compression in the CGM and Posix standards and subsequently regarding the use of the method within GIF files being distributed via the web. This compression technique was removed from the extensions to the CGM standard in 1992 for this reason. The resolution of this discussion is that commercial software and shareware will require a licence to use this compression technique. Freeware and decompression software used in viewers will not require a licence.

CCITT Fax

CCITT have developed a series of compression mechanisms for transmitting black and white images across telephone lines which are used by fax machines. The standards are officially known as CCITT Recommendations T.4 and T.6 but are more commonly known as Group 3 and Group 4 compression respectively. CCITT have adopted a formulation of the Huffman coding scheme as its basis and, as most people see in their day-to-day businesses, it does afford an acceptable level of reproduction of the original image. Group 4 is more efficient and is adopted within many document exchange applications. CALS for example, uses Group 4 for document image exchange. Unlike Group 4, Group 3 does have some information regarding synchronisation which enables error detection and is used in most fax machines.

JPEG

The need for standards which can represent grey scale and colour images in an economical and efficient way was acknowledged by the setting up of the Joint Photographic Expert Group (JPEG). This is a joint committee of ISO and CCITT. The JPEG standard is an ongoing project within ISO and CCITT as more efficient algorithms are sought for future versions of the standard.

The JPEG standard offers both 'lossless' and 'lossy' compression, although we will confine ourselves here to the lossy compression which comprises most of its use. The lossy compression is designed to take advantage of weaknesses in the human visual system and to store pictures which are designed to be viewed by humans (as opposed to machines). The degree of loss of data can be controlled by choice of parameters which are defined in the JPEG specification.

The JPEG compression method takes an image and subdivides it into units of 8 by 8 pixels. Each block has a discrete cosine transformation applied to it which resolves the block into two-dimensional cosine functions. The result is that the 64 pixel values in one block are transformed into 64 coefficients to these cosine functions. The compression works on the basis that not all 64 two-dimensional cosine functions are equally important in the reconstruction of the pixel block. The human eye is unable to differentiate at the pixel level when a whole image is being viewed i.e. the coefficients are a more accurate representation than can be assimilated visually. On average it has been shown that only about 5 of the coefficients (out of 64) are actually needed to give an apparently exact replication of the original picture. If a slight degradation of the image is permitted then this can be reduced further. The reduction in the coefficients is known as 'quantization'. The values now have to be encoded and the standard requires that a Huffman encoding is available in all systems. The application is responsible for controlling the quantization and which Huffman tables are used.

Fractal Compression

Fractal compression methodology has been developed for both still and moving pictures by Michael Barnsley and the technology is marketed through Iterated Systems. This proprietary format is widely used within packages. It has the advantage of allowing pictures to be zoomed without loss of resolution as happens with other formats.

Quadtrees

An image can be subdivided into a number of distinct areas where each area represents a single colour (or grey scale value). Each area is recursively divided into sub areas until the division is completed (representing a single pixel value) or until a predefined level of division achieved. The values associated with the sub-areas are then stored rather than individual pixel values.

[Index of Other Reports and Papers] [Meta-index of Papers, Reports and Circulars]

[ Top of Page ] - [ Up ]

The Electronic Libraries Programme (eLib) was funded by the Joint Information Systems Committee (JISC)
Page version: 1;
Web page maintained by UKOLN Systems Team and hosted by UKOLN - feedback to systems@ukoln.ac.uk .
Site last revised on: Tuesday, 24-Nov-1998 14:21:19 UTC
DC Metadata

eLib standards guidelines

Lorcan Dempsey, Anne Mumford, Alan Robiette, Chris Rusbridge on behalf of FIGIT

General

Introduction

Further reading

Flavours of formats and standards

Data communications standards

Introduction

Terminal access and file transfer

Electronic mail

Delivery Mechanisms

Further reading

Data Interchange Standards

Some General Issues

Types of data interchange standards

Beyond ASCII text

Document format types

Introduction

Raster Formats

GIF

PNG

TIFF

PhotoCD

Group 4 fax

Page Description Formats

Introduction

PostScript

Portable document format and acrobat

Further reading

Formats containing layout and content

Structured information

SGML

Text Encoding Initiative

HTML

Further reading

Storing graphics

Beyond the page image

Moving images

Three dimensional information

Sound

Container documents

Metadata: resource identification and description

Resource identification

Network developments - uniform resource identifiers

Bibliographic identifiers

Further reading

Resource description

Network resource description

Text Encoding Initiative headers

MARC records

IAFA/WHOIS++ templates

Some other approaches

Bibliographic resource description

Serials data

Terms of availability

Collections

Classification of network resources

Resource evaluation

Search, request and retrieve services

Search and retrieve services

Web

Z39.50

Further reading

Document request services

Further reading

Directory services

Further reading

Security, authentication and payment services

Security overview

A commercial transaction

Government controls

Symmetric and public key encryption

Authentication

Usernames

Kerberos

Digital signatures

Certificates

Further reading

Security systems

Email