A review of metadata: a survey of current resource description formats

A review of metadata: a survey of current resource description formats
Work Package 3 of Telematics for Research project DESIRE (RE 1004)

Title page
Table of Contents

Previous - Next

IAFA (Internet Anonymous FTP Archive) templates were designed by the IAFA working group of the IETF (Internet Engineering Taskforce) and guidelines were published in the form of an Internet draft in July 1995. (Peter Deutsch, Alan Emtage, Martijn Koster, M Stumpf. Publishing information on the Internet with anonymous FTP. Internet Draft. (working draft now expired). <URL:http://info.webcrawler.com/mak/projects/iafa/iafa.txt>). Template formats were drawn up for the various categories of information present on FTP archives: images, documents, sounds; services such as mailing lists and databases; as well as mailing list archives, usenet archives, datasets and software packages. Bunyip are now leading development of a whois++ White Pages directory system, Digger, which uses whois++ templates, a variation on the IAFA templates.

Constituency of use

Much of the driving force behind the development of the templates came from private companies, in particular from Bunyip as part of their development of Internet navigational tools and directory services; and from Martijn Koster at Nexor as a personal initiative. The aim of the IAFA template designers was to construct a record format which could be used by FTP archive administrators to describe the various resources available from their own archives.

IAFA templates were designed to facilitate effective access to FTP (file transfer protocol) archives by means of describing the contents and services available from the archive. Over the last few years many organisations wanting to allow access to their data, whether documents, datasets, images or software, have made them available as archives accessed by anonymous FTP. The IAFA template format has now been developed for use with the whois++ protocol, chiefly through the instigation of Bunyip who are developing directory service software conformant to this protocol.

The original intention was that each FTP site administrator would be responsible for ensuring that IAFA templates were available for each file on their archive. This information would be available for individuals visiting the archive and also, if FTP archive sites followed a common set of indexing and cataloguing guidelines, then it would be possible for software (such as Harvest) to automatically pick up the records. This is in fact happening in some implementations of the IAFA/whois++ templates, although in others records are being created centrally. The recently developed directory service software, whois++, allows search and retrieval of databases created in this way, and also offers the possibility of searching across multiple databases. (RFC 1835 P. Deutsch, R. Schoultz, P. Faltstrom, C. Weider. Architecture of the whois++ service. <URL:ftp://ds.internic.net/rfc/rfc1835.txt>). Experimental work is being done using the Common Indexing protocol (CIP) which gathers together a 'centroid' or summary from a number of database to form an 'index server'. The index server contains an index of all unique attribute values contributed by the centroids, and searches can be referred from one index server to another by interlinking the servers in a mesh. (RFC 1914 P. Faltstrom, R. Schoultz, C. Weider. How to interact with a whois++ mesh IETF Proposed standard protocol, February 1996)

Supporters of IAFA templates have widened the original aim, and the intention now is to devise a record format simple enough to be generated by the wide variety of individuals and organisations involved with creating resources on the Internet, whether on web servers or FTP archives. The underlying philosophy is that it must be the information providers who create metadata records if indexing of the Internet is to be a viable proposition. Given the instability of network resources the alternative of centrally creating records would be a high cost option.

Ease of creation

The main advantage of the IAFA templates is that they are easy to create. IAFA templates are designed for use in a distributed system of record creation and storage so the simplicity of the records has been an underlying criteria in the design. Also they have been designed in relation to the objects they are trying to describe and are not hidebound by practices relating to non-electronic data.

Format issues

Designation

Resource types are identified by template type, and within each template type there are recommended attribute names to identify appropriate data elements. Template types which describe 'document like objects' (i.e.Document, Dataset, Mailing list archive, Usenet archive, Software package, Image, Video) all contain the same recommended attributes; other template types (Service, Mirror, Site configuration, Logical archives) contain their own specific attributes. The simplicity of the record structure is paramount, there is no allowance for identification of subfields, nor for 'qualifiers' to be attached to attributes.

Each record can only have one template type, but any of the other data elements can be repeated. It is intended that template types and data elements should be extensible, although extensions would not be inter-operable unless agreed between implementations.

Every time an individual or organisation occurs in a record there are a number of common data elements required to describe them e.g. name, address, telephone number, e-mail address. These logically grouped data elements are termed clusters in the guidelines and can be used to save indexing time by creating the details once then referring to them by a unique handle. The IAFA guidelines define the content of clusters for both individuals and organisations. Clusters of data elements can be identified by a unique handle although it is dependent on the implementation how the cluster information is incorporated into the record. Further proposals to extend the use of clusters have been circulated by Bunyip as part of the development of more detailed White Pages whois++ templates for use with the whois++ protocol. These proposals suggest definitions of further clusters at a lower level for names, phone numbers and addresses. In addition it is proposed that all clusters would include record management details.

Each record and cluster within the database is identified by a string of characters and/or digits unique within the system on which it resides.

Within the IAFA definition the repitition of attributes is achieved through the mechanism of variants. The first occurrence of an attribute is variant-1, the second variant-2. Related groups of attributes that are repeated are linked by the variant number e.g.

class-v1

class-scheme-v1

class-v2

class-scheme-v2

Within the whois++ schema the order in which attributes are stored is significant and links are maintained in this way.

Encoding

Records are held in simple ASCII text format. The syntax and semantics of data element names and values has been restricted to facilitate automated collection and indexing . Data elements are defined as attribute/value pairs and are of variable length. Attributes, record start and finish and continuation lines are recognised by the structure of the text and by insertion of defined 'special' characters. So for example continuation lines are signified by the first character being '+' or '-'; and records are delimited by blank lines.

Effort has been made to ensure the templates are 'human readable' which means less processing is required to make the data understandable. This helps to ensure there is a low entry cost to implement the templates. Attribute names are therefore written in full.

Content

Basic descriptive elements

The content is deliberately limited in detail in order to ensure the record is simple to create. The content includes simplified bibliographic fields (title, author, publisher, language). It is possible to distinguish personal and corporate authors by the choice of either the user or organization cluster.

The IAFA templates distinguish persons and organisations by appending the USER (person) cluster or the organization cluster to a particular element eg

Author-(USER*)
Publisher-(ORGANIZATION*)

Subject description

There is provision in all templates for a free text description of the resource to be included. In addition there is a keywords attribute for additional subject terms. Within ROADS usage of the templates further fields have been added to allow for subject classification and subject classification scheme to be added.

URIs

URLs are used for location of resources.

Resource format and technical characteristics

There are a number of different template types defined within the guidelines to describe the variety of network resources available:

Document
Dataset
Mailing list archive
Usenet archive
Software package
Image

Other template types are designed for use in the context of FTP archives to provide information about a particular FTP site:

Site configuration information
Logical archives configuration
Service (e.g. on-line catalogues, information servers)
Mirror (details of sites which mirror files including information on frequency of update from the source)

The configuration files would be relevant for the automatic collection of records, and in a broader context, the service template would be used to describe free-standing resources.

Templates for 'document like objects' include attributes for the size, format, character set and method of access. The guidelines set down that different versions of the same resource are described as variants. If a resource has 'the same intellectual content' it is taken to be the same resource regardless of language or text format (ASCII, Adobe, Postscript, etc.).

Host administrative details

The content of the record is designed to take advantage of the context in which the record will be used, so URL and e-mail links to authors and publishers are included.

Provenance/source

A source attribute can be used to describe details of the original form of the object.

Terms of availability/copyright

Templates for services include attributes for authentication, registration, charging policy, access policy, access times, and access policy. Templates for 'document like objects' include a copyright attribute.

Administrative metadata

The content includes detailed record management information including the date a record was created, the date for review as well as details of the creator of the record. This allows for the development of automated record maintenance procedures. It allows system administrators to keep track of rapidly changing resources and allows for quality checks to be carried out at regular intervals.

Rules for construction of these data elements

The guidelines acknowledge that the content of particular fields must be standardised to allow for effective indexing and retrieval. The following data elements have rules defined for the form of content as specified :

e-mail addresses: RFC 822
host names: RFC 1034
host IP addresses: defined in guidelines
numeric values: defined in guidelines
dates/times: RFC 822 amended by RFC 1123
telephone numbers: defined in guidelines
latitude/longitude: defined in guidelines
personal names: BibTex (see separate entry for BibTex)
formats of resource: RFC 1521

The diverse locations of these rules, and the relative lack of detail compared to traditional cataloguing manuals, will inevitably lead to inconsistencies in practice. It remains to be seen whether the indexing and retrieval software can ameliorate the inconsistencies or whether 'simplified cataloguing rules' will need to be drawn up.

Multi-lingual issues

The IAFA guidelines state that text within the template is assumed to be in English using the standard ASCII character set although, using the whois++ protocol, it is possible to change character set within a template for a particular attribute pair by means of a system message. In the European context a more sophisticated means of character set negotiation is needed, but this could be overcome by having an agreed character set between particular clients and servers not subject to on-line negotiation.

Ability to represent relationships

There is no means of indicating a parent/child relation between documents (analytics), nor to link documents with 'continued as' or 'replaced by'. However these links could be provided by the keyword and subject descriptor searching and not be built into the record structure.

Protocol issues

IAFA/whois++ templates are associated with the whois++ directory service protocol (RFC 1835. P. Deutsch, R. Schoultz, P. Faltstrom and C. Weider. Architecture of the whois++ service. IETF, August 1995 <URL:ftp://ds.internic.net/rfc/rfc1835.txt>). This protocol fits closely with the IAFA template structure in that it passes attribute/value pairs and allows limits on search by template type, attribute, value or handle.

Progress towards international standardisation

The documentation for IAFA templates is in the form of an Internet Draft 'Publishing Information on the Internet with Anonymous FTP' (Peter Deutsch, Alan Emtage, Martijn Koster, M Stumpf. Publishing information on the Internet with anonymous FTP. Internet Draft. (working draft now expired). <URL:http://info.webcrawler.com/mak/projects/iafa/iafa.txt>). This document is a working draft which has no status as a standard, however it is a well developed exploration of a metadata record format specifically designed for Internet use. Both Bunyip and ROADS project workers are putting effort into a revised form to incorporate developments with the whois++ template. Convergence of the IAFA and whois++ template structures is likely as more implementations interoperate in a whois++ mesh (RFC 1914 P. Faltstrom, R. Schoultz & C. Weider. How to interact with a whois++ mesh. IETF Proposed standard protocol, February 1996). Implementation of services using the templates should also provide impetus to further development and modification of the template, and will also provide justification for progress along the standards track.

As yet there is no agreed mechanism for controlling amendments and additions to the template structure. Establishing a means to communicate and control changes to the templates would be an essential step in the move towards a standard. Until then the tendency is for attributes to proliferate and for the overall structure to remain unstable.

Implementations

There are now several implementation using IAFA/whois++ templates. The ALIWEB search system was the first to implement IAFA templates and it did so in the context for which they were originally designed. ALIWEB was set up as an experimental approach to providing access to FTP archives. Although ALIWEB was technically successful, the effort required to encourage FTP administrators to create records describing their archives could not be sustained and ALIWEB was integrated into the already established CUI W3 Catalog in order to encourage information providers. The future of ALIWEB remains uncertain but at present it is operational and is mirrored at various sites world-wide. ALIWEB is at <URL:http://web.nexor.co.uk/public/aliweb/aliweb.html>.

Bunyip are now leading development of a whois++ White Pages directory system. Within the eLib framework, so far three projects SOSIG (Social Science Information Gateway) <URL:http://sosig.ac.uk> and OMNI (Medical Information Gateway) <URL:http://omni.ac.uk> and ADAM (Art, Design, Architecture and Media) <URL:http//adam.ac.uk> are using the ROADS software. ROADS uses IAFA templates for description of resources, and the current release (version 1 in beta test Oct 1996) incorporates the whois++ protocol.

Within the UK there are also other implementations. The Internet Parallel Computing Archive (IPCA) at the University of Kent uses IAFA templates for a database containing information on parallel computing (David Beckett. IAFA templates in use as Internet metadata. <URL:http://www.w3.org/pub/Conferences/WWW4/Papers/52/>). At the University of Manchester, a volunteer effort NetEc provides a database of resources in economics using the IAFA template as the basis for the record structure <URL:http://cs6400.mcc.ac.uk/NetEc.html>.

Next Table of Contents

Page maintained by: UKOLN Metadata Group
Last updated: 10-Jun-1998

IAFA/WHOIS++ Templates

Environment of use

Documentation