Metadata: background and issues paper

AC96/2-1

DOCUMENT DETAILS

Document title	Metadata issues paper
Author(s)	Lorcan Dempsey
Version	0.1
Date	11-Nov-96
Status	Work in progress: draft for discussion

DOCUMENT HISTORY

Version	Date	Status	Description	Distribution
0.1	11-Nov-96	Work in progress	Draft for Discussion	UKOLN Advisory Committee

Note

This document has been prepared at the request of the UKOLN Advisory Committee. It is very much a work in progress. It relies heavily on several of the reports and papers produced as part of UKOLN’s metadata project work and listed in the references.

INTRODUCTION

However, our feeling is that at this point ‘metadata’ as a descriptive term has become so debased by overuse (and means so many different things in different communities and contexts) that it is now virtually meaningless without extensive qualification; unfortunately, it has also become a very fashionable term. The very vagueness of the term metadata today makes it easy to offer sophisticated-sounding proposals about using metadata in various ways which seem to be almost impossible to reduce to practice, or which are extremely pedestrian when actually implemented. Clifford Lynch, Avra Michelson, Craig Summerhill, Cecilia Preston (Lynch et al 1995)

What a supreme irony that those who proclaim and pursue vision are the least likely to attain it. … And – often – those who are later considered visionary were earlier considered nerds. Robert Venturi (Venturi, 1996)

This document aims to outline some of the issues surrounding the design and deployment of metadata, with special reference to current UK developments. It seems to me that it is possible to advance a partial vision in relation to a partial view of metadata: that there should be unified access to the UK scholarly resource, for example. This is partial because of the constrained view it takes of the functions of metadata, looking only at resource discovery, and of the user community, focusing only on scholarly resources. A much larger vision sees metadata as pervasively disseminated throughout the network to describe people, services, software components and data, in support of self-describing, dynamically reconfigurable distributed systems and services (later examples should make this clearer).

To attempt a grander vision, though, would be to fall foul of the charges laid in the first quote above and to submit to the hubris implied in the second. For example, to have promoted a future view of developments three years ago might have ignored the utterly transforming influence of the Web. We are still in a technical construction phase in which the visionary nerd can have major unanticipated influence, in which many theoretical issues remain unresolved and in which organisational and business issues are yet to be addressed. At the same time, significant commercial and research interest is now focused on issues in this area. (Of particular interest here might be Netscape’s work on Resource Description Messages (Hardy 1996))

Accordingly, the larger vision is not attempted here. An opening section approaches a definition by way of example. In subsequent sections, the focus is narrowed to a particular ‘type’ of metadata, that which describes ‘information and document-like resources’. This choice is determined by its presumed readers’ interests, but, more importantly, by its author’s competences.

Within this scope, the paper notes some directions and suggests some areas in which concerted UK attention would be useful. It provides significant background material so that there is enough context for the reader to relate these issues to wider developments and trends and to have some sense of some of the environments in which these discussions are taking place. It suggests some priorities for attention and action.

WHAT IS METADATA?

‘Metadata is data which describes attributes of a resource’: this is the routine definition, though it doesn’t take us very far. This section provides some examples of metadata and its use before proposing a fuller definition.

An approach by example

What does metadata look like?

Many resources include self-descriptive data. Documents typically carry descriptive data: title, author, maybe an abstract, and so on. In a digital version these may be marked by some form of tagging. There may be provision for including some form of metadata in the file. For example an HTML document allows a limited amount of data to be placed in the Head. A title can be put in and conventions exist or are being defined for fuller data. Alta Vista has some recommendations, which if followed, will enable it to recognise data for harvesting into its search engine. In an SGML context, there are agreements for the encoding of article headers or, within the Text Encoding Initiative, of quite rich metadata for electronic texts. Some image files, PNG or TIFF, may also contain some form of structured, descriptive data.
An interesting document type is an Internet e-mail message which contains simple metadata in attribute value pairs in the mail header. Programs can use this data to provide threaded discussion archives, searches for names or topics, and so on.
The Electronic Libraries Programme has funded several so-called subject-based information gateways. These provide databases of resource descriptions which facilitate discovery of resources in particular subject areas. Typically, the records provide enough descriptive information about network services (often web-sites, but also news-groups, individual documents, and so on) to allow some precision in searching and some (human) judgment of relevance before committing to retrieving a resource. Their records are rather like those in abstracting and indexing services, with some additional material for technical and service characteristics.
Archivists, records offices, data archives, businesses, and government organisations are interested in ‘records’. Records are not merely data; they need to satisfy requirement for evidence, what ‘it means for written testimony about an act in the past to be considered trustworthy in the future’ (Bearman, 1995). Such data may be critical for items such as patient records, data sets, business transactions and legal documents as well as scholarly materials. ‘Records require associated metadata which allows a user to audit that the records are comprehensive, identifiable, complete, and authorised’ (Bearman, 1995). Data about how to extract information content, about provenance, use history, ownership, terms and conditions of use established by the creator, and so on, needs to be considered in the light of indefinite future use. Preservation raises major metadata issues.
Statistical data sets need to include supporting documentation to make them useful and usable, which covers such aspects as survey design, processing and analysis, and the data set itself.
A database management system contains a data dictionary which preserves the integrity of the database by constraining the operations that can take place on data. Metadata is stored in a data dictionary.
Z39.50 is an information retrieval protocol. Z39.50 servers can provide data about the databases they make accessible and the facilities they support through an ‘explain’ database. This might include something about terms of availability or more technical data about supported searches, and so on to allow a client to make sensible decisions.
It is widely expected that future network systems will be based on distributed objects. Monolithic applications will be disaggregated into components. New applications will be built from multiple autonomous components specialised for particular tasks, put together as and when required. Discussion of the network computer, Java, and ActiveX are bringing this scenario closer - at least in the magazines. Ideally, one would want an application to discover a component when required, and not to have to have advance hardwired knowledge of all possible requirements. CORBA (Common Object Request Broker Architecture) is one proposed framework for managing interactions between distributed objects. Metadata is an important feature of the CORBA architecture. All objects in a CORBA compliant system must be self describing using a C++-based language: the Interface Definition Language. These descriptions are stored in the Interface Repository, and describe the operations and data types supported by the object. Entries in the repository store enough information about an object to allow other objects to interact with it.

What types of things do users need to know about resources?

A user needs to discover the existence of resources. This is the primary current focus of discussions: resource discovery. However, much of the discussion assumes a flat space of unrelated resources.
A user may need to know rather more about a resource than some basic description if it is to be useful within a business or research context. Variously, its provenance, archival history, and various forms of intellectual responsibility; its integrity and authenticity; its relationship to other resources; particular domain-specific features; and so on.
A user as provider or potential provider needs to know about intellectual property rights attaching to a resource, about levels of use made of it.
A user may need to know whether a resource is fit for use at various levels: whether it is possible to extract its information content with available tools; whether it can be rendered with available equipment; whether a document is a textbook or a scholarly monograph; and so on. Additionally, for example, a potential user of an engineering model might need to know what functional goals a device model had, what design assumptions underlay it, and so on.
A user as customer needs to know under what terms and conditions a resource may be made available. As the Internet becomes a more mature environment, this becomes more important: many of the resources of interest will need to be paid for. However, rather more than a mere listing of price may be needed. Terms and conditions may be situational: they may depend on particular characteristics of a user (frequent user; member of staff of a particular institutions; … ) or on particular characteristics of the use (a discount for night-time use to achieve global load balancing; … ) and so on. Terms and conditions metadata are seen as crucial, although poorly developed.
A user as client or agent needs to know about technical interfaces, access protocols, searches and formats supported, etc.
A user as parent may wish to know what ‘rating’ a particular server has received.
A future user ….. who knows what a future user might have expected his or her forebears to provide?

A definition

These examples confirm that we are looking at a diffuse environment of use. Many others could be offered. However, at this stage we can offer a preliminary fuller definition of metadata:

metadata is data associated with objects which relieves their potential users of having to have full advance knowledge of their existence or characteristics.

A user might be a program or a person.

As the proportion of the intellectual record which appears on the network grows appropriate metadata is seen as a central part of a mature information, business and technical environment. In an indefinitely large resource space, users need to have advance knowledge which allows them discover resources, know what terms they are available under, assess their potential usefulness, be assured of their authenticity, and so on. Metadata needs to be directed at human users, but increasingly it needs to be addressed at programmatic users. The ability to store searches and user profiles, to consolidate retrieved results from several resources, to filter and summarise, to pass off some of the drudgery of information seeking to programs will be increasingly necessary. These services may be a prelude to more capable agents, autonomous programs which act on behalf of users in distributed, heterogeneous environments. Metadata will assist effective human use of resources; it will be essential for effective program use of resources. Metadata is knowledge which allows human and automated users behave intelligently.

REFERENCE MODELS

At this stage, it would be nice to step back and present an organised view, or a reference model which outlined the objects and concepts of interest and the relationships between them: an ontology, to use a borrowed philosophical term now gaining currency. A general view is not readily available. This is for one obvious reason: a general approach has not been taken to design and development of metadata formats across domains. Even when we narrow our focus as suggested above, a variety of organisational and discipline-specific initiatives are in place between which there are different levels of mutual knowledge or influence. Different approaches have been taken in different domains; in some cases frameworks are only now being put in place; organising principles may not be explicit. Clearly, certain application areas require very different approaches, but I suspect that there is significant redundancy in several areas. This may not be an issue now, but certainly will be when it comes to interworking across domains and this is being recognised.

A shared ontology, which conceptualised the objects and relationships that needed to be represented in particular metadata formalisms would clarify understanding and facilitate future mapping between domains. Some domain specific metadata approaches are listed below, but first it might be useful to outline a rough sketch of a model generalising the approach of one particular domain, the library community.

Libraries have evolved very full theoretical, technical and organisational apparatuses for resource description and discovery, particularly for books, and there is a full body of experience which can be prospected. However, the aim here is not to suggest that libraries have built on an explicit ontological base for their work (one could argue that the reverse is true), but to give us a handle on some terms and concepts for later discussion and a comparative perspective. (Some readers may prefer to skip the detail of the following section.)

A look at books

Underpinning library cataloguing practice is a three-fold conception of the book:

The copy. This is the actual physical item which is handled and read. Traditionally, cataloguers have not been very interested in the copy per se unless special circumstances -- if it is very old or rare, or has had an interesting owner – mean it is accorded some special attention. Some copy-level is assigned to assist in its effective management as part of a collection: a shelfmark or other locating information and a control number. The technology of print means that it one of multiple identical copies of a publication. The book in hand was equal to all other copies as a representative of the publication. I have a copy of At swim two birds by me as I write.
The publication. For the reason suggested above, the main library interest focuses on the ‘publication’ or ‘title’ of which the copy is a manifestation. Library practice is to start with the published object, manifest in a copy in hand. The core of library metadata is a physical description derived from this copy - author, title, publisher, place and date of publication, dimensions, format, and so on. In theory, the book is self-describing. Author and title are transcribed as on the copy. The physical object is central as an instance of a Publication class. My copy of At swim two birds is the ‘first Four Square edition 1962’.
The work. However a publication is only one possible manifestation of a work. There are many other manifestations of At swim two birds. The work is the intellectual content which may be embedded in a variety of publications. A library cataloguer adds headings or access points which allow the publication description to be retrieved. These headings relate to the work: typically author, title and maybe subject are added. Individual publications may represent author or title names slightly differently; by applying special rules, the aim is to regularise the form in which these are noted in access points. In practice there is often a further stage. To ensure consistency among the access points, an authority list may be used which records preferred regular forms of headings and relationships between them. For example, Flann O’Brien is the author of At swim two birds. But this is only one of the names under which he is known, he is also Myles na gCopaleen and Brian Nolan. An authority list would establish one of these as authoritative and link the others to it. In some cases, a ‘uniform title’ may be assigned to an item where the work has some variety in title (the bible is the obvious case).

So (in theory) the library constructed metadata consists of descriptive data derived from the publication, some copy-specific data, and headings. Headings are supposed to operate at the ‘work’ level. They are metadata which aims to relieve the user of having to know in advance the individual characteristics of all the manifestations of a work, the different versions of an author name, or all the items on a particular subject.

Cataloguing also provides for relationships to be expressed between these three ‘objects’ (work, publication, copy). Barbara Tillett identifies a variety of relationships following a review of cataloguing codes: equivalence (copy, reproduction), derivative (editions, translations, ..), descriptive (commentary, criticism, …), whole-part, accompanying (e.g. parts of a kit), sequential, shared characteristic (same author, publisher, etc.). A variety of linking devices have been developed to express these relationships, influenced, she notes by the technology used to create the catalogue.

So, provision is made for description of individual published objects and for integrating them into the collections of which they are a part by means of collocation of headings and references. These integrating mechanisms work more or less well depending on the intellectual effort applied to their creation. However, there are a number of problems.

Emphasis is given to the published object, starting from its format and physical description. As we move into an environment where, increasingly, ‘content’ may be manifest in several different formats this is a disadvantage and is one of the issues facing those using MARC for cataloguing of electronic resources. The means for bringing together works and indicating relationships do not always work very well and this will create an issue for merging results across many databases. Duplicates will have to be identified. But there is also the ‘Humphrey Clinker problem’, a term coined by OCLC for the fact that a search on a large database like OCLC’s retrieves a large variety of manifestations of the same work. The user is presented with a long listing, based on publication, which has to be read through. The user really wants access at the work level, and an indication of the relationships between publications.

A very provisional table of data elements widening the emphasis to the book world is presented in an Appendix. What it shows is three slightly overlapping categories of data:

Intellectual content data: describing the characteristics of the work itself, without reference to any particular delivery channel.
- Author/other responsibility, title, subject descriptor, narrative description, genre/category/level, review/rating/evaluation, date of creation, ….
Publication or source data: describing the formal and business characteristics of a particular manifestation of the work.
- Identifier, publisher, edition, distributor, terms and conditions, intellectual property rights, format/structure, interface/interchange
Copy-specific: supporting the use of specific manifestations of publications - who owns them, who uses them, where are they, and where there is some special interest (where a copy has some historical or other interest), aspects of provenance, use or marking, integrity/authenticity (in relation to the source).
- Identifier, date of receipt/creation, transaction use, provenance/history, location, owner, integrity

A copy inherits publisher/source and content data. To describe a particular resource one wants to say something about its content, the particular business and technical characteristics of the publisher/source, and something about its status, to assure a potential user of what it is.

This might be generalised as a descriptive model:

Content

Title
Evaluative
Subject
Level
…
‘agent-link’
‘link’

Publication/source

Identifier
Type
Terms and conditions
Format/structure
rface/interchange
‘agent-link’
‘link’

Copy

Identifier
Authenticity
History of use
Format/structure
...
‘agent-link’
‘link’

Agents might be people, organisations or other entities. Agent links might be: author, investigator, editor, translator, designer, technical contact, data centre manager, webmaster, intellectual property rights owner, host, publisher, distributor and so on. Other links can be imagined (is part of, etc).

This is a very hasty outline. The author is aware of obvious deficiencies and would not publish in this form. It derives from a limited ontology (based on the discussion in Dempsey 1990 and Dempsey 1989), influenced by the interesting bibliographic analysis in Heaney (1995) and is not full enough to be persuasive. It is oriented towards the (now famous) document-like object. It is presented as a gesture towards what would be a useful cross-domain exercise: the identification of the objects of interest and the relationships between them prior to any representation in particular metadata models. Any models would have to be tested against the emerging reality of a very complex network world. Much of the work would have to go into defining the link attributes - in what ways should resources be related?

It does have the virtue though of disentangling some of the levels at which objects of interest exist: much current discussion assumes that objects live in a very redundant flat space. Bearman (1995) outlines a similar process based on a much fuller investigation and study of the archival literature.

(Note: We may attempt some work like this in the context of the AHDS/UKOLN Metadata workshops: explore where there is underlying agreement across domains and where there is divergence.)

THE VARIETY OF RESOURCES

Much of the earlier work on resource discovery had a very simple file-based view of a resource. We still use ‘resource’ quite casually thought it can cover a wide spectrum of possibilities. A resource might be a file, or a database, or a record in a database, or the metadata about a database. Increasingly, however, resources will be a complex of data and services which are opaque to a human or robot user. A high level interface may be available which hides the actual way in which the service is provided. A resource may be opportunistic or fugitive: existing only in response to a particular conjunction of events or a particular query. For example, customised or tailored services may appear which are based some underlying database whose characteristics are unknown. Resources will be mutable and dynamic. And again, when we look beyond the information realm, the diversity is significant. A typology of resource types might be interesting.

AN APPROXIMATE TYPOLOGY OF METADATA INITIATIVES

One can suggest an approximate grouping along a metadata spectrum which becomes successively richer in terms of fullness and structure. For purposes of analysis, we propose three bands within this spectrum, which allows us to sketch some shared characteristics across groups of formats. These are discussed under several heads below.

Environment of use

Indexes of extracted material - self describing without created metadata

Band one includes relatively unstructured data, typically automatically extracted from resources and indexed for searching. The data has little explicit semantics and does not support fielded searching.

Currently, this data is created by the web crawlers. Many such services exist, and several global services are in heavy use. If a user is looking for a known item, they can be reasonably effective. Because they are global in scope and operate on limited descriptions they are less effective for discovery. A user may find many resources, but may have to sift through them and will miss many potentially relevant resources because they are not indexed with appropriate terms. Nor, in many cases, is the description full enough to allow the user make relevance judgements in advance of actually retrieving the resource. Typically, crawlers are not selective about the resources they index: they often aim for comprehensiveness at some level within their target area, whether that is the world or some part of it. For these reasons, they have some limitations as discovery services. These issues are well known and such services are seeking to enhance the metadata on which they operate: different services have different conventions to allow authors of web pages to include various categories of metadata which can then be collected. There is also some discussion about a common representation for the exchange of such metadata between indexes and other services and the harvesting of fuller metadata.

Because of the way they are designed crawlers parallel the disorganised nature of the web itself - they operate exclusively at the ‘copy’ level described above. They are not equipped to recognise duplicated files or relationships between resources. Because they rely on documents themselves for indexing they do not provide data about the status of a resource: whether it is fit for purpose or whether it is what it purports to be and so on. They only cover publicly available web pages – less visible resources (whether because they are commercial, or because they are hidden behind CGI interfaces, or others) are less well covered.

Generic discovery services - manually created descriptions

Band two includes data which contains full enough description to allow a user to assess the potential utility or interest of a resource without having to retrieve it or connect to it. The data is structured and supports fielded searching. Typically they are simple enough to be created by non-specialist users, or not to require significant discipline-specific knowledge. Descriptions tend to be of discrete objects and do not capture multiple relationships between objects. Typically, but not essentially, descriptions are manually created, or are manual enhancements of automatically extracted descriptions, and they include a variety of descriptive and other attributes. They may be created to be loaded directly into a discovery service or to be harvested.

Services using these types of formats include OCLC’s NetFirst (based on its own internal format) and the UK Electronic Libraries Programme subject-based information gateways (some of which use their own internal format; some use IAFA templates). Often, these services involve some selectivity in what they describe and may have more or less explicit criteria for selection. For these reasons, they may be expensive to create, again driving an interest in author- or publisher- generated description and automatic extraction techniques such as those piloted by Essence as part of the Harvest.

These services often focus on the server level, creating records for repositories or collections of resources. Interesting recent discussion over the Dublin Core and the eLib subject based services has exposed the tension between simplicity and structure - which is mentioned below in discussion of the records they use.

Rich metadata - domain specific requirements

Our third Band includes fuller descriptive formats which may be used for location and discovery, but also have a role in documenting objects or, very often, collections of objects. Typically, they are associated with research or scholarly activity, require specialist knowledge to create and maintain, and cater for specialist domain-specific requirements. They are expressive enough to capture a variety of relationships at different levels. Developments include the ICPSR SGML codebook initiative to describe social science data sets, the Encoding Archive Description, Content Standards for Digital Geospatial Metadata and Computer Interchange of Museum Information. As might be expected with large international disciplines, there may be more than one initiative in a domain. For example, CIMI proposes a framework for the creation, search and retrieve of metadata, but there are other approaches within the museums community. The Geospatial area is quite well developed, and the Federal Geographic Data Committee (FGDC) has been steering the Content Standards for mapping and geopsatial data. Four characteristics of metadata guided development: discovery (to find what is available), fitness for use (does a data set meet a particular need), access (data needed to acquire a set a data, and transfer (data needed to process and use a set of data) (ref needed). As noted above, there are now metadata initiatives across all major scholarly disciplines as they prepare for effective digital use of their materials. For example, there are several initiatives within different branches of engineering and the IEEE has been organising metadata workshops. There is a biological initiative which is looking at the use of FGDC and MARC for use within the biological sciences (ref needed). And so on.

Comments

It should be clear that these are not watertight categories, especially as implementations may vary. GILS (Government Information Locator System - an approach to providing access to government information in the US based on a Z39.50 profile) and CIMI object descriptions might be considered to be in the middle band for example.

Against this background one can note some trends, especially across the boundaries of these bands. Author or site produced metadata will become more important for many purposes. This may be harvested unselectively, or only from selected sites. An important motivation for this is to overcome some of the deficiencies of current crawlers without a provider incurring the cost of record creation. In some respects, the crawlers will assume some of the characteristics of the middle band. (Alta Vista, for example encourages people to embed metadata. Within the Desire project, the Nordic Web Index, a Nordic ‘search engine’, is being enhanced to be ‘metadata aware’: where it is available it will harvest discovered embedded metadata. More of this kind of thing later.)

At the same time, communities using the richer ‘documentation’ formats will wish to disclose information about their resources to a wider audience. How best to achieve this will have to be worked out: perhaps ‘discovery’ records will be exported into other systems. These trends suggest that the middle band will become more important as a general-purpose access route, maybe with links to richer domain-specific records in some cases.

Format issues

Metadata formats

There is no currently no widely-used standard for data in band 1, though there are moves to develop a shared format for exchange, perhaps based on SOIF (the record syntax used by Harvest, a set of tools for creating distributed indexes of network resources). There is also a trend noted above to enhance the data collected by these services in various ways, making them better suited to discovery.

The middle band metadata used in discovery services tends to be based on simple record structures influenced by RFC-822 style attribute-value pairs. Formats here do not contain elaborate internal structure, do not easily represent hierarchical or other aggregated objects, nor, typically, do they express the variety of relationships which might exist between objects. This is usually by design: there is a necessary trade-off between simplicity and expressiveness. Also, their purpose is to be hospitable to the non-specialist description of information objects of different types and from different domains and so are not concerned with the very specific requirements of any one domain. Of the formats service formats which we examine here, IAFA templates are perhaps the most detailed. There are templates for different types of object (document, user, logical archive, etc.), and there has been some consideration given to ‘clusters’ of data which are likely to be repeated across records and to variants within records.

There has been much recent interest in the Dublin Core, a list of 13 data elements with some qualifiers, which has been developed to act as a simple description format (Dempsey and Weibel 1996). A separate section below looks at this is a little more detail.

Although the bulk of the formats in this range follow an attribute-value pair structure, it has been agreed that an SGML DTD will be developed for the Dublin Core. Because of some similarity of construction and content across formats in this band, conversion between them, though inevitably lossy, is feasible

The documentation band contains some very full frameworks for the description of multiple aspects of objects and collections of objects. In some cases, the frameworks describe metadata objects as one type only of information object: they are concerned with ‘information content’ also. Typically, work is proceeding within an SGML context and the example of the Text Encoding Initiative has been quite influential. Within the social science, museums, archives and geospatial data communities work is progressing on establishing DTDs. These may relate to collection level description, item level description, and allow various levels of aggregation and linkage appropriate to the domain. They cater for a very full range of attributes appropriate to documenting data sets or other resources. These can be distinguished from the range in the middle band by fullness (they go into more detail), structure (they contain richer structuring devices), and specialism (they may be specific to the relevant domain).

It seem likely that specialist users will want to search such data directly, but that to make data more visible to more general ‘discovery’ tools, there may be export of data in some of the simpler formats used in the middle band. Indeed, the Dublin Core has been explicitly positioned as a basis for semantic interoperability across richer formats, although it has not been widely used in this context.

Comments

Various obvious contrasts between the bands are clear. The web crawlers currently operate at a very fine-grained level: they see a world of pages. The services in the middle band face an interesting development challenge: to reconcile the economic and service goals of simplicity and a generic approach with the desire to make descriptive practices responsive to the relatedness of the information world at various levels. The domain specific approaches tend to focus on the description of particular ‘collections’ and to capture some of this relatedness, but are currently in various stages of development and resources may not be yet visible through general purpose tools.

Protocol issues

Middle band discovery services are being delivered through emerging distributed searching and directory approaches on the Internet, notably Harvest, Whois++, LDAP, and Dienst. There is some use of Z39.50 also, notably for GILS.

An unknown factor is the influence of Netscape’s ongoing work based on Harvest technologies. It is working with something called Resource Description Messages, as a framework for search and retrieve of metadata. (Hardy 1996)

Many band three approaches are in early stages. However, there has been some discussion of using Z39.50 for search and retrieve in several cases. In particular, there has been some interest in the Z39.50 profile for access to digital collections .

Implementations

Standards-based resource discovery services are also in early stages. Examination of the descriptions collected in Dempsey and Heery (1996) shows that many formats are still under development or are not widely implemented.

In Band 3, the ‘documentation category’, in particular, communities of users are working towards consensus and in some cases robust interoperating implementations are some time away.

In Band 2, the ‘discovery category’, IAFA/Whois++ templates are in use in several projects, and are deployed in Whois++ directory services. Dublin Core is being piloted in several projects, but an agreed syntax is only now being defined. RFC-1807 is used within the NCSTRL project . SOIF (a record syntax used in Harvest) is widely used as the internal format for Harvest, but there is no agreed ‘content’ definitions. LDIF (the record used in LDAP, the Lightweight Directory Access Protocol) is in a similar position, lacking an agreed set of schema for resource description. LDIF and SOIF have attracted much interest as a result of Netscape’s decision to base its directory server and catalog server products on LDAP and Harvest respectively.

Dublin Core and Warwick Framework

The Dublin Core is a simple resource description format. It has attracted considerable attention recently, partly because of the eloquence and consensus-building activity of it proponent, Stu Weibel, but importantly because it has situated itself as a potential solution for three pressing requirements. The first is to have a generally acceptable simple resource description format which is hospitable to the description of a wide range of resources. Following recent discussions, the Dublin Core (a list of 13 elements with some qualifying structure) is being adapted to take on board some of the concerns of those with an interest in image metadata and to address some structural and content issues. The second target use is to provide a semantic base for metadata embedded in HTML (and subsequently other) documents. The third target use is to provide a base for semantic interoperability between richer metadata domains. Richer record formats might map a core set of data onto Dublin Core to provide a common set of elements for discovery purposes - this might be implemented in various service and technical environments. Dublin Core is attracting significant attention. It should be noted it only looks at one aspect of metadata - simple description - and that a concrete syntax for it has only recently been defined, and not much implemented.

UKOLN and OCLC jointly organised a conference in Spring 1996 to examine various general metadata issues and the Dublin Core in particular. The venue was Warwick and a new requirement was identified and scoped (Dempsey and Weibel, 1996), which resulted in the Warwick Framework proposal.

Satisfying the need for competing, overlapping, and complementary metadata models requires an architecture that will accommodate a wide variety of seperately maintained metadata models. It was concluded that an architecture for the interchange of metadata packages was required. A package is conceived as a metadata object specialized for a particular purpose. A Dublin Core-based record might be one package, a MARC record another, terms and conditions another, and so on. Such discrete packages might be numerous and varied in content and even source. Users or software agents would need the ability to aggregate these discreet metadata packages in a conceptual framework, hence the notion of a container-package architecture.

This architecture should be modular, to allow for differently typed metadata objects; extensible, to allow for new metadata types; distributed, to allow external metadata objects to be referenced; recursive, to allow metadata objects to be treated as 'information content' and have metadata objects associated with them.

Although there is wide agreement that this is a sensible direction, the Warwick Framework has not been implemented at the time of writing, and certain issues remain outstanding. There is still considerable intradomain discussion; the need for inter-domain exchange is recognised but the applications framework for this is not in place. Because of the variety of metadata approaches that have been discussed here, and the variety of metadata requirements which may exceed the current provision of any one format, the Warwick Framework has attracted a lot of interest as a simple but potentially very powerful architectural component.

THE CREATION OF METADATA

The discipline or control exercised over the production of collections of resources will improve as the web becomes a more mature publishing environment. There will be managed repositories of information objects. Such repositories may be managed by information producing organisations themselves, universities for example, by traditional and ‘new’ commercial publishers, or by other organisations (the Arts and Humanities Data Service in the UK, for example, or industrial and other research organisations, archives, image libraries, and so on). This is not to suggest that the existing permissive electronic publishing environment will not continue to exist in parallel. One concern of a managed repository will be that its contents are consistently disclosed and that descriptions are promulgated in such a way that potential users, whoever they might be, are alerted to potentially relevant resources in that repository.

Different repositories will have different requirements and priorities. Examples are a social science data archive, a university web site, a commercial publisher’s collection of electronic journals, an archival finding list, and so on. Objects on a university web-site may be briefly and simply described. A data archive may need extensive documentation.

There will be a variety of metadata creators. These fall into three broad categories: ‘authors’, repository managers, and third party creators. As its importance becomes more apparent, ‘authors’ are likely to create descriptive metadata: a major incentive for this will be agreement about the use of the META tags in HTML documents for embedding metadata which will be harvested by programs. Following various discussions earlier this year, partly prompted by the Warwick metadata workshop, an agreed convention for embedding metadata without breaking existing browsers or robots may be soon available. Descriptive data will be similarly embedded in other objects by those responsible for their creation. Metadata will also be created by repository managers, who have some responsibility for a resource and the data that describes it. Third party creators (including, for example, the eLib information gateways) create metadata for resources which they themselves may not manage or store.

Metadata may sit separately from the resources it describes; in some cases, it may be included as part of the resource. Embedded HTML tags is probably the simplest example of the latter case, but it is common in some of the domain-specific SGML frameworks mentioned above. For example, a TEI (Text Encoding Initiative) header needs to accompany conformant TEI documents. However, independent TEI headers may also exist, which describe documents which may be physically remote.

Metadata, once created may be shared with others. Take for example, author-created metadata embedded in HTML documents. This may be collected by robot or other means. Value will be added to this data at various stages along whatever use chain it traverses: by a local repository manager, by subject-based services, by crawler-based indexing services, by various other intermediary services. These intermediary services might include librarians and others who now invest in current awareness and SDI (selective dissemination of information) services, as well, maybe, as current abstracting and indexing services. Many authors may only provide basic information: typically they will not be conversant with controlled subject descriptor schemes, record all intellectual or formal relationships with other resources, and so on.

A different use chain might be traversed by fuller metadata associated with the scholarly edition of an electronic text, for example. Full documentary metadata would be available to assist in the analysis and use of the text, but a subset might be output to a general purpose discovery service. There might be a link back to the fuller metadata from the shorter record.

A number of factors, including the perceived value of a resource, will determine the relative balance between author-produced, added value and third-party original descriptions in different scenarios. The metadata ecology and economy is still in development.

A CONVERGENCE OF COMMUNITIES?

The last section focused on the development of metadata in the network environment. Within the BIBLINK project UKOLN is looking at metadata developments over a wider range of activities. It is interesting to note the few points of contact that exist between ‘conventional’ publishing activities and the networked world in terms of discussion about resource description, unique identifiers and other issues.

A LOOK AT DISTRIBUTED ACCESS TO METADATA

A variety of approaches

The importance of metadata to future network environments has already been sketched. Programs will handle it in various way depending on the context. Here we look briefly at search and retrieve in a network information environment.

Programs will collect and manipulate data in a variety of ways, passing it off to other applications in the process. Data may be collected and served up in various environments, converted on the fly to appropriate formats: it may be collected by robot, added to a local 'catalogue', or pulled into a subject- based service. The metadata we have been talking about refers to network information resources. This will need to be integrated at some level with the large, albeit highly fragmented, metadata resource for the print literature. There may also be metadata about people, about courses, about research departments and about other objects. Programs might periodically look for resources that match a particular user profile, might search for people with a particular research interest, and so on.

These developments will take place in a rapidly changing distributed environment in which directory protocols (e.g. Whois++, LDAP), search and retrieve protocols (e.g.Z39.50), Harvest, and a variety of other approaches will be deployed. These will be hidden from the user by the web, itself likely to be transformed by the integration of Java and distributed object technologies.

We can identify a scale of interoperability among services. At one extreme, they may be closed or proprietary, and not interoperate with other services. In the area under investigation this is the typical situation. For example, search services tend to be proprietary and closely coupled with the databases they provide access to. Each CD-ROM or each telnet service is differently accessed, and the user is not hidden from this difference. More recently, some shallow level of interworking may have been introduced by making services available through similar web interfaces. However, the burden of selecting between the services and consolidating content rests with the user.

If services can be accessed through standard server interfaces then greater interoperability is possible. Different applications could use the same, or common, client services to access a range of services in a consistent manner. For example a library OPAC, a subject information gateway, an online host, could use Z39.50-based services to access a variety of information sources in a consistent way. Similarly, if they themselves provide Z39.50 server interfaces, their own sources are available to be accessed in a consistent way by other applications. Several of the protocols in this area were mentioned above. The ROADS project is developing a distributed systems framework based on WHOIS++, an Internet search and retrieve protocol initially developed for directory applications. The aim is to provide the eLib subject gateways with the ability to act autonomously with individual web interfaces, while also being able to accessed as a collective unit within a WHOIS++ framework. WHOIS++ will provide a consistent search and retrieve interface and a technique for routing queries to relevant servers.

It might be interesting to look at the subject gateways in a little more detail. They are part of a bigger picture. There are several other discovery services planned. There are also likely to be a variety of cases where services/organisations want to provide access to a depository of materials through a database of descriptions: pre-print archives, training materials, image collections, electronic journal collections. And so on. The provision of a similar type of descriptive metadata is seen as a central component of the AHDS, of other ongoing European and eLib projects, and of a range of other initiatives. Some of these projects/services will want to retain control of their data; some may be happy to let it go to host services. However, whatever organisational patterns emerge it is very clear that all a user's requirements will not be met by one or two centralised services. Even if all of the subject gateway data were on a centralised service, it only provides a partial solution. Users will not merely want to cross-search between the eLib services, they will want to cross-search between them and other services. In fact, the latter is probably a more important requirement in some cases. And some eLib services will want maybe to operate small distributed frameworks to bring in related resources, etc. ROADS is motivated by the reality of this inevitably distributed environment and aims to deliver some components of an integrating systems framework.

But it is clear that not everybody will be looking at WHOIS++. We are at the beginning of this stage, where a common set of protocols and data formats will improve interworking, supplemented by descriptive metadata and a federating layer of middleware which hides underlying protocol and format differences. (UKOLN’s MODELS project is looking at what the components of such middleware might be and how it will be constructed.)

The third level in our simple hierarchy is that of open, distributed objects where objects might be located anywhere on the network, exist as ‘plug and play’ components, publicise information about themselves, and are available for discovery by other objects which may wish to use their services. It will be a while before such services are introduced in this area, although systems and services influenced by distributed object technologies are now beginning to appear.

As noted above, we are talking about metadata at two levels here: the searchable repositories of metadata such as the eLib gateways, and metadata which describes these repositories themselves.

[insert something here from MODELS to expand on this? A few pictures?]

SOME ISSUES AND DIRECTIONS

I hope I have shown some of the complexity of the current situation. This list collects some of the issues that might be addressed in the communities in which we work. UKOLN is already active in some of the areas touched on.

This list is certainly not exhaustive, and is left deliberately open-ended to facilitate discussion. I would propose developing a firmer list on the basis of input from the committee.

In general I favour some high level coordination - however that is managed. Things may be brought together in various ways with current approaches but a lot of effort is being dissipated.

Policy and strategy

Proposals for a Network Information Development Agency or a UK-CNIDR have been considered by the ISSC for some time and some reports have discussed the role of such an agency. In its past year UKOLN has assumed some of the responsibilities such an agency might have and with the UK Web Focus it will assume even more. It is also working in collaboration with the subject-based services and the Arts and Humanities Data Service and others. However, there are still issues that need to be addressed at systemic level, several of them listed below, and which would benefit from some concerted action. JISC services emphasise the service aspect rather than the R&D, yet clearly this is an area requiring R&D: we are moving into new areas. It would be useful if the CEI were to review current capabilities and assess whether the needs which prompted discussion of the Agency still exist. CEI will make a significant investment in metadata activity, either as separately funded work, or in association with other funded services, and commands a significant collective intellectual and service resource. A mechanism which supported effective leverage of that resource in concerted ways would be of benefit. (An important role would be to supply CEI with neutral advice, although some of the concerns about overlap and competition seem to have abated since the last AC meeting.)
The appearance of a greater amount of the Intellectual Record in electronic form and on the networks means that national libraries are taking an interest in the context of their National Bibliography roles. In a broader context, a couple of Nordic national libraries, for example, are proposing to take a periodic snap-shot of their country’s web pages (pull every page) and store it for posterity. UKOLN is involved in an EU project, BIBLINK, coordinated by the BL looking at the equivalent of CIP for electronic data. It would be interesting to explore what plans the BL had in this area and to see whether there was any possible liaison with other resource description initiatives. Similarly, the BL has overlapping interests with other archival organisations and has the significant challenge of deciding how to describe any electronic legal deposit material in such a way that it is available for future inspection. (BL/JISC cooperative activity building on other preservation work?).
The Library Association and the Library and Information Commission have submitted a bid to the Millennium Commission for monies to connect every public library to the Internet and to provide some base level of equipment. It has focused on infrastructure rather than on services, expecting services to follow. If the bid is successful (and even if it is not!) the public library community needs to begin to develop useful services. The reengineering of some business, community and reference services in a metadata/resource discovery environment would be a sensible step and establish network services with greater visibility.
There is considerable scope for cross-sectoral awareness raising. UKOLN will be organising domain-specific metadata workshops with the AHDS during early 1997 which should assist this. A sensible link between this world and the ‘book’ world would be useful, whatever form it takes.

Disclosure

Discussion places emphasis on discovery. However, it might be useful to also emphasise disclosure: the consistent disclosure of institutional resources. Organisations are going to adopt a more managed approach to information provision. Provision of structured metadata which can be picked up and used by several services would confer advantages, both for providers and potential users. Much of this will happen anyway, but there are some organised environments where the process can be shepherded. Some examples:

Individual universities, and the higher education and research system as a whole, will be making more information about their courses, more of their research results, and more training and course materials available on the network in more or less controlled environments. In an increasingly competitive environment, the JISC should have an interest in the organised, effective disclosure of the results of UK research and teaching. The subject services can play an obvious role here, but can also encourage the development of well-managed information repositories within their domains, and provide a focus for collection of metadata.
The CCTA is an important channel for dissemination of government information.
The need to think about CIP in an electronic environment has been mentioned.
The subject gateways are becoming aware of central resources in their areas. One direction would be for them to promote good practice among those resources.
There are two issues here: What is good practice? And what incentive is there for organisations to disclose their resources in this way? The answer to the first involves some investigation but it seems to me that we are almost there (see the section on embedding/harvesting metadata). The answer to the second involves the construction of services which make use of this metadata, and again some pointers occur elsewhere in this list. As with most things, the challenge is to find an organisational framework for moving things forward.

Harvesting/embedding

UKOLN is doing some work in this area:

It is working to enhance ROADS with a robot which will harvest embedded metadata. This is to give the ‘cataloguers’ a head start when they are creating records. This work is taking place in the context of the ROADS and Desire projects.
At the same time it has been participating in the discussions about embedded metadata based on Dublin Core. Metadata is one of the strands we expect the UK Web Focus to take an interest in and as soon as there is agreement we will be looking at ways of promoting sensible strategies for embedded metadata.

These two strands address some chicken and egg situations.

However it would be useful to consider wider contexts for this type of activity and explore what agencies should be pursuing initiatives. For example, the ACN has commissioned a study of robot and search engine technologies. If any UK search engine is launched, then it should be ‘metadata aware’. We will be making representations to this effect when it becomes clear if any initiative is to be launched.

At a more general level, there is scope for some work which looked at ways of cutting down the load imposed by robots. Martin Hamilton and Dan La Liberte have proposed extensions to HTTP to explicitly add support for downloading server indexes.

One can identify some of the components of a toolkit for creating and harvesting metadata in distributed contexts. A project which brought something together and tested it in a natural ‘disclosure’ community would be useful.

Training and awareness

We are involved in several initiatives here. For such an important topic there needs to be more work done here. We hope to build a link with Netskills to promote some aspects of good practice. Much could be done. It would benefit from a wider framework of priorities and objectives (e.g. an iniative to disclose resources of UK universities).

Gap between manual and automated

I have described a variety of approaches which largely depend on manual processes: the creation of the subject gateways, for example, and a variety which depend on automated approaches. The former can be quite costly; the latter quite crude. There are a variety of things which could be attempted once some infrastructure is in place to close the gap between the two. For example, robot generated citation exercises (following links to a certain depth and counting them) could generate sets of core resources from a seed set of known resources. Current subject gateways could be enriched by supplementary databases of indexes harvested from the resources that have been quality assured and described more fully in the main database. There are many other ways in which one could imagine enhancing services which would repay investment (the current subject services do not have significant technical development work at their disposal).

Interworking R&D

An ontology of metadata objects: Some research work which modeled the concepts, objects and relationships of interest across various domains would be very useful (domains: serials publishing, archives, geospatial data, statistical data sets, museums, etc, …). The aim would be to exploit work already done, but to put it in a larger comparative perspective.
A typology of relationships: There has been much discussion within the Dublin Core developers and among the eLib subject gateways about relationships and the need to build links to records. This may really be a component of the last item as might be a typology of resource types.
Assess the usefulness of the Dublin Core for semantic interoperability.: Some work is already taking place here - it will be examined in the context of the UKOLN/AHDS workshops. Much discussion suggests physical databases of Dublin Core records sitting in front of richer resources. It might be interesting to investigate also the feasibility of a Dublin Core attribute set.
Demonstrators: It seems to me there is really a need for one or two large scale projects which look at creating integrated access to a variety of resources based on standard approaches. These will allow various issues to be explored in practice and real experience of the challenges to be gained. Within a MODELS context and from various other strands there have been some suggestions for moving ahead. How best to make something of the ambition of one or two of the US Digital Library Initiative projects is something for discussion. However, even if that does not happen, there are a variety of smaller things which could be done which would show what was possible. UKOLN is developing a simple application which will search Z39.50 and WHOIS++ databases in parallel, and is going to show a unified search for art resources across several different types of services.
Proof of concept for the Warwick Framework.
The MODELS project has produced the notion of ‘clumps’ – virtually aggregated catalogues for the purposes of resource discovery – and some work will be done on this. The notion should be extended to include other metadata resources.

Other technology R&D areas

URN and URN resolution protocols
Not much has been said here about URNs. Clearly, they will do much to resolve certain issues - e.g. bringing together ‘copies’. There are several technologies on the table. However, again, much of the burden will be organisational: creating frameworks and structures. There may be some merit for some groups (e.g. subject gateways) to experiment with PURL servers or one of the URN proposals.
Investigate the concept of a collection. Is there enough commonality to make searching of ‘collection level’ descriptions a feasible option.
Potentially important and could do with more experience of: Harvest and centroids.
Heaney (1995) represents an interesting and persuasive view of how library practice could be adapted to overcome some of the deficiencies of MARC, including a migration path from MARC. Would it be feasible to explore moving in this direction in the context of a nationally funded service like COPAC.

REFERENCES

(Bearman and Sochats 1995) Bearman, D. and Sochats, K. Metadata requirements for evidence. Draft 10/95.

(Dempsey 1989) Dempsey, L. Bibliographic records: use of data elements in the book world. Centre for Bibliographic Management, Bath University Library, 1989. BNB Research Fund Report No. 40.

(Dempsey 1990) Dempsey, L. Bibliographic access: patterns and developments. In: Bibliographic access in Europe: first international conference. Lorcan Dempsey (ed.) Gower, 1990

(Dempsey 1996a) Dempsey, L. Meta Detectors. Ariadne, No. 3, 1996.

(Dempsey 1996b) Dempsey, L. ROADS to Desire: some UK and other European metadata and resource discovery projects. D-Lib Magazine, July/August 1996.

(Dempsey 1996c) Dempsey, L. Kicking up the dust in a chaotic storehouse. The Times Higher Education Supplement, Multimedia, 11 October 1996, p. iv.

(Dempsey and Heery 1996) Dempsey, L. and Heery, R. Metadata: an overview of current resource description practice. (Peer review draft of deliverable for Work Package 3 of Telematics for Research project DESIRE). September 1996.

(Dempsey and Weibel 1996) Dempsey, L. and Weibel, S. The Warwick Metadata Workshop: a framework for the deployment of resource description D-Lib Magazine, July/August 1996.

(Hardy 1996) Hardy, D. Resource Description Messages (RDM): Technical Specification: Draft 1.0b3.

(Heaney 1995) Heaney, M. Object-oriented cataloging. Information Technology and Libraries, September 1995 pp.135-153.

(Heery 1996a) Heery, R. ROADS: Resource Organisation and Discovery in Subject-based Services. Ariadne, No. 3, 1996.

(Heery 1996b) Heery, R. Review of metadata formats. Program,Vol. 30, no.4, October 1996, pp. 345-373.

(Heery 1996c) Heery, R. ROADS templates: how they are used. September 1996.

(Lynch et al 1995) Lynch, C. and Michelson, A. and Summerhill, C and Preston, C. The nature of the NIDR challenge. Draft of April 10, 1995 Note: Latest draft available from

(Orfali et al 1996) Orfali, R. and Harkey, D. and Edwards, J. The essential distributed objects survival guide. John Wiley, 1996

(Venturi 1996) Venturi, R. Iconography and electronics upon a generic architecture: a view from the drafting room. MIT Press, 1996

APPENDIX

Selective book trade elements

Work -
intellectual object Title - published object Copy - Handled object
Author/other responsibility Heading/authority As written
Title Heading/authority As written
Subject descriptor Heading/authority -
Abstract/description Abstract
Genre/category Yes
Evaluation Review, etc
Date of creation Yes Yes Yes
Identifier - ISBN Control no./unique identifier
Publisher/place - Yes
Date of publication Yes
Edition (Edition history)
Distributor
Terms and conditions - Price Loan policy/situational
Intellectual property rights Copyright
Format - Yes
Physical characteristics Publication specific
Date of receipt Yes
Transaction use - - Circulation data
Provenance/history - - Some contexts including historical bibliography
Location - - Building/shelf number

	Work - intellectual object	Title - published object	Copy - Handled object
Author/other responsibility	Heading/authority	As written
Title	Heading/authority	As written
Subject descriptor	Heading/authority	-
Abstract/description	Abstract
Genre/category	Yes
Evaluation	Review, etc
Date of creation	Yes	Yes	Yes
Identifier	-	ISBN	Control no./unique identifier
Publisher/place	-	Yes
Date of publication		Yes
Edition (Edition history)
Distributor
Terms and conditions	-	Price	Loan policy/situational
Intellectual property rights		Copyright
Format	-	Yes
Physical characteristics		Publication specific
Date of receipt		Yes
Transaction use	-	-	Circulation data
Provenance/history	-	-	Some contexts including historical bibliography
Location	-	-	Building/shelf number

Distributed at the UKOLN Advisory Committee meeting 19th November 1996. Not for further circulation. AC/96-1 17

UKOLN Information Strategy V 0.3, Draft for discussion, 11 November 1996. 2

Web page by Isobel Stark, 13th November1996