Document title | Metadata issues paper |
---|---|
Author(s) | Lorcan Dempsey |
Version | 0.1 |
Date | 11-Nov-96 |
Status | Work in progress: draft for discussion |
Version | Date | Status | Description | Distribution |
---|---|---|---|---|
0.1 | 11-Nov-96 | Work in progress | Draft for Discussion | UKOLN Advisory Committee |
Note
This document has been prepared at the request of the UKOLN Advisory Committee. It is very much a work in progress. It relies heavily on several of the reports and papers produced as part of UKOLN’s metadata project work and listed in the references.
However, our feeling is that at this point ‘metadata’ as a descriptive term has become so debased by overuse (and means so many different things in different communities and contexts) that it is now virtually meaningless without extensive qualification; unfortunately, it has also become a very fashionable term. The very vagueness of the term metadata today makes it easy to offer sophisticated-sounding proposals about using metadata in various ways which seem to be almost impossible to reduce to practice, or which are extremely pedestrian when actually implemented. Clifford Lynch, Avra Michelson, Craig Summerhill, Cecilia Preston (Lynch et al 1995)
What a supreme irony that those who proclaim and pursue vision are the least likely to attain it. … And – often – those who are later considered visionary were earlier considered nerds. Robert Venturi (Venturi, 1996)
This document aims to outline some of the issues surrounding the design and deployment of metadata, with special reference to current UK developments. It seems to me that it is possible to advance a partial vision in relation to a partial view of metadata: that there should be unified access to the UK scholarly resource, for example. This is partial because of the constrained view it takes of the functions of metadata, looking only at resource discovery, and of the user community, focusing only on scholarly resources. A much larger vision sees metadata as pervasively disseminated throughout the network to describe people, services, software components and data, in support of self-describing, dynamically reconfigurable distributed systems and services (later examples should make this clearer).
To attempt a grander vision, though, would be to fall foul of the charges laid in the first quote above and to submit to the hubris implied in the second. For example, to have promoted a future view of developments three years ago might have ignored the utterly transforming influence of the Web. We are still in a technical construction phase in which the visionary nerd can have major unanticipated influence, in which many theoretical issues remain unresolved and in which organisational and business issues are yet to be addressed. At the same time, significant commercial and research interest is now focused on issues in this area. (Of particular interest here might be Netscape’s work on Resource Description Messages (Hardy 1996))
Accordingly, the larger vision is not attempted here. An opening section approaches a definition by way of example. In subsequent sections, the focus is narrowed to a particular ‘type’ of metadata, that which describes ‘information and document-like resources’. This choice is determined by its presumed readers’ interests, but, more importantly, by its author’s competences.
Within this scope, the paper notes some directions and suggests some areas in which concerted UK attention would be useful. It provides significant background material so that there is enough context for the reader to relate these issues to wider developments and trends and to have some sense of some of the environments in which these discussions are taking place. It suggests some priorities for attention and action.
‘Metadata is data which describes attributes of a resource’: this is the routine definition, though it doesn’t take us very far. This section provides some examples of metadata and its use before proposing a fuller definition.
What does metadata look like?
What types of things do users need to know about resources?
These examples confirm that we are looking at a diffuse environment of use. Many others could be offered. However, at this stage we can offer a preliminary fuller definition of metadata:
metadata is data associated with objects which relieves their potential users of having to have full advance knowledge of their existence or characteristics.
A user might be a program or a person.
As the proportion of the intellectual record which appears on the network grows appropriate metadata is seen as a central part of a mature information, business and technical environment. In an indefinitely large resource space, users need to have advance knowledge which allows them discover resources, know what terms they are available under, assess their potential usefulness, be assured of their authenticity, and so on. Metadata needs to be directed at human users, but increasingly it needs to be addressed at programmatic users. The ability to store searches and user profiles, to consolidate retrieved results from several resources, to filter and summarise, to pass off some of the drudgery of information seeking to programs will be increasingly necessary. These services may be a prelude to more capable agents, autonomous programs which act on behalf of users in distributed, heterogeneous environments. Metadata will assist effective human use of resources; it will be essential for effective program use of resources. Metadata is knowledge which allows human and automated users behave intelligently.
At this stage, it would be nice to step back and present an organised view, or a reference model which outlined the objects and concepts of interest and the relationships between them: an ontology, to use a borrowed philosophical term now gaining currency. A general view is not readily available. This is for one obvious reason: a general approach has not been taken to design and development of metadata formats across domains. Even when we narrow our focus as suggested above, a variety of organisational and discipline-specific initiatives are in place between which there are different levels of mutual knowledge or influence. Different approaches have been taken in different domains; in some cases frameworks are only now being put in place; organising principles may not be explicit. Clearly, certain application areas require very different approaches, but I suspect that there is significant redundancy in several areas. This may not be an issue now, but certainly will be when it comes to interworking across domains and this is being recognised.
A shared ontology, which conceptualised the objects and relationships that needed to be represented in particular metadata formalisms would clarify understanding and facilitate future mapping between domains. Some domain specific metadata approaches are listed below, but first it might be useful to outline a rough sketch of a model generalising the approach of one particular domain, the library community.
Libraries have evolved very full theoretical, technical and organisational apparatuses for resource description and discovery, particularly for books, and there is a full body of experience which can be prospected. However, the aim here is not to suggest that libraries have built on an explicit ontological base for their work (one could argue that the reverse is true), but to give us a handle on some terms and concepts for later discussion and a comparative perspective. (Some readers may prefer to skip the detail of the following section.)
Underpinning library cataloguing practice is a three-fold conception of the book:
So (in theory) the library constructed metadata consists of descriptive data derived from the publication, some copy-specific data, and headings. Headings are supposed to operate at the ‘work’ level. They are metadata which aims to relieve the user of having to know in advance the individual characteristics of all the manifestations of a work, the different versions of an author name, or all the items on a particular subject.
Cataloguing also provides for relationships to be expressed between these three ‘objects’ (work, publication, copy). Barbara Tillett identifies a variety of relationships following a review of cataloguing codes: equivalence (copy, reproduction), derivative (editions, translations, ..), descriptive (commentary, criticism, …), whole-part, accompanying (e.g. parts of a kit), sequential, shared characteristic (same author, publisher, etc.). A variety of linking devices have been developed to express these relationships, influenced, she notes by the technology used to create the catalogue.
So, provision is made for description of individual published objects and for integrating them into the collections of which they are a part by means of collocation of headings and references. These integrating mechanisms work more or less well depending on the intellectual effort applied to their creation. However, there are a number of problems.
Emphasis is given to the published object, starting from its format and physical description. As we move into an environment where, increasingly, ‘content’ may be manifest in several different formats this is a disadvantage and is one of the issues facing those using MARC for cataloguing of electronic resources. The means for bringing together works and indicating relationships do not always work very well and this will create an issue for merging results across many databases. Duplicates will have to be identified. But there is also the ‘Humphrey Clinker problem’, a term coined by OCLC for the fact that a search on a large database like OCLC’s retrieves a large variety of manifestations of the same work. The user is presented with a long listing, based on publication, which has to be read through. The user really wants access at the work level, and an indication of the relationships between publications.
A very provisional table of data elements widening the emphasis to the book world is presented in an Appendix. What it shows is three slightly overlapping categories of data:
A copy inherits publisher/source and content data. To describe a particular resource one wants to say something about its content, the particular business and technical characteristics of the publisher/source, and something about its status, to assure a potential user of what it is.
This might be generalised as a descriptive model:
Agents might be people, organisations or other entities. Agent links might be: author, investigator, editor, translator, designer, technical contact, data centre manager, webmaster, intellectual property rights owner, host, publisher, distributor and so on. Other links can be imagined (is part of, etc).
This is a very hasty outline. The author is aware of obvious deficiencies and would not publish in this form. It derives from a limited ontology (based on the discussion in Dempsey 1990 and Dempsey 1989), influenced by the interesting bibliographic analysis in Heaney (1995) and is not full enough to be persuasive. It is oriented towards the (now famous) document-like object. It is presented as a gesture towards what would be a useful cross-domain exercise: the identification of the objects of interest and the relationships between them prior to any representation in particular metadata models. Any models would have to be tested against the emerging reality of a very complex network world. Much of the work would have to go into defining the link attributes - in what ways should resources be related?
It does have the virtue though of disentangling some of the levels at which objects of interest exist: much current discussion assumes that objects live in a very redundant flat space. Bearman (1995) outlines a similar process based on a much fuller investigation and study of the archival literature.
(Note: We may attempt some work like this in the context of the AHDS/UKOLN Metadata workshops: explore where there is underlying agreement across domains and where there is divergence.)
Much of the earlier work on resource discovery had a very simple file-based view of a resource. We still use ‘resource’ quite casually thought it can cover a wide spectrum of possibilities. A resource might be a file, or a database, or a record in a database, or the metadata about a database. Increasingly, however, resources will be a complex of data and services which are opaque to a human or robot user. A high level interface may be available which hides the actual way in which the service is provided. A resource may be opportunistic or fugitive: existing only in response to a particular conjunction of events or a particular query. For example, customised or tailored services may appear which are based some underlying database whose characteristics are unknown. Resources will be mutable and dynamic. And again, when we look beyond the information realm, the diversity is significant. A typology of resource types might be interesting.
One can suggest an approximate grouping along a metadata spectrum which becomes successively richer in terms of fullness and structure. For purposes of analysis, we propose three bands within this spectrum, which allows us to sketch some shared characteristics across groups of formats. These are discussed under several heads below.
Band one includes relatively unstructured data, typically automatically extracted from resources and indexed for searching. The data has little explicit semantics and does not support fielded searching.
Currently, this data is created by the web crawlers. Many such services exist, and several global services are in heavy use. If a user is looking for a known item, they can be reasonably effective. Because they are global in scope and operate on limited descriptions they are less effective for discovery. A user may find many resources, but may have to sift through them and will miss many potentially relevant resources because they are not indexed with appropriate terms. Nor, in many cases, is the description full enough to allow the user make relevance judgements in advance of actually retrieving the resource. Typically, crawlers are not selective about the resources they index: they often aim for comprehensiveness at some level within their target area, whether that is the world or some part of it. For these reasons, they have some limitations as discovery services. These issues are well known and such services are seeking to enhance the metadata on which they operate: different services have different conventions to allow authors of web pages to include various categories of metadata which can then be collected. There is also some discussion about a common representation for the exchange of such metadata between indexes and other services and the harvesting of fuller metadata.
Because of the way they are designed crawlers parallel the disorganised nature of the web itself - they operate exclusively at the ‘copy’ level described above. They are not equipped to recognise duplicated files or relationships between resources. Because they rely on documents themselves for indexing they do not provide data about the status of a resource: whether it is fit for purpose or whether it is what it purports to be and so on. They only cover publicly available web pages – less visible resources (whether because they are commercial, or because they are hidden behind CGI interfaces, or others) are less well covered.
Band two includes data which contains full enough description to allow a user to assess the potential utility or interest of a resource without having to retrieve it or connect to it. The data is structured and supports fielded searching. Typically they are simple enough to be created by non-specialist users, or not to require significant discipline-specific knowledge. Descriptions tend to be of discrete objects and do not capture multiple relationships between objects. Typically, but not essentially, descriptions are manually created, or are manual enhancements of automatically extracted descriptions, and they include a variety of descriptive and other attributes. They may be created to be loaded directly into a discovery service or to be harvested.
Services using these types of formats include OCLC’s NetFirst (based on its own internal format) and the UK Electronic Libraries Programme subject-based information gateways (some of which use their own internal format; some use IAFA templates). Often, these services involve some selectivity in what they describe and may have more or less explicit criteria for selection. For these reasons, they may be expensive to create, again driving an interest in author- or publisher- generated description and automatic extraction techniques such as those piloted by Essence as part of the Harvest.
These services often focus on the server level, creating records for repositories or collections of resources. Interesting recent discussion over the Dublin Core and the eLib subject based services has exposed the tension between simplicity and structure - which is mentioned below in discussion of the records they use.
Our third Band includes fuller descriptive formats which may be used for location and discovery, but also have a role in documenting objects or, very often, collections of objects. Typically, they are associated with research or scholarly activity, require specialist knowledge to create and maintain, and cater for specialist domain-specific requirements. They are expressive enough to capture a variety of relationships at different levels. Developments include the ICPSR SGML codebook initiative to describe social science data sets, the Encoding Archive Description, Content Standards for Digital Geospatial Metadata and Computer Interchange of Museum Information. As might be expected with large international disciplines, there may be more than one initiative in a domain. For example, CIMI proposes a framework for the creation, search and retrieve of metadata, but there are other approaches within the museums community. The Geospatial area is quite well developed, and the Federal Geographic Data Committee (FGDC) has been steering the Content Standards for mapping and geopsatial data. Four characteristics of metadata guided development: discovery (to find what is available), fitness for use (does a data set meet a particular need), access (data needed to acquire a set a data, and transfer (data needed to process and use a set of data) (ref needed). As noted above, there are now metadata initiatives across all major scholarly disciplines as they prepare for effective digital use of their materials. For example, there are several initiatives within different branches of engineering and the IEEE has been organising metadata workshops. There is a biological initiative which is looking at the use of FGDC and MARC for use within the biological sciences (ref needed). And so on.
It should be clear that these are not watertight categories, especially as implementations may vary. GILS (Government Information Locator System - an approach to providing access to government information in the US based on a Z39.50 profile) and CIMI object descriptions might be considered to be in the middle band for example.
Against this background one can note some trends, especially across the boundaries of these bands. Author or site produced metadata will become more important for many purposes. This may be harvested unselectively, or only from selected sites. An important motivation for this is to overcome some of the deficiencies of current crawlers without a provider incurring the cost of record creation. In some respects, the crawlers will assume some of the characteristics of the middle band. (Alta Vista, for example encourages people to embed metadata. Within the Desire project, the Nordic Web Index, a Nordic ‘search engine’, is being enhanced to be ‘metadata aware’: where it is available it will harvest discovered embedded metadata. More of this kind of thing later.)
At the same time, communities using the richer ‘documentation’ formats will wish to disclose information about their resources to a wider audience. How best to achieve this will have to be worked out: perhaps ‘discovery’ records will be exported into other systems. These trends suggest that the middle band will become more important as a general-purpose access route, maybe with links to richer domain-specific records in some cases.
There is no currently no widely-used standard for data in band 1, though there are moves to develop a shared format for exchange, perhaps based on SOIF (the record syntax used by Harvest, a set of tools for creating distributed indexes of network resources). There is also a trend noted above to enhance the data collected by these services in various ways, making them better suited to discovery.
The middle band metadata used in discovery services tends to be based on simple record structures influenced by RFC-822 style attribute-value pairs. Formats here do not contain elaborate internal structure, do not easily represent hierarchical or other aggregated objects, nor, typically, do they express the variety of relationships which might exist between objects. This is usually by design: there is a necessary trade-off between simplicity and expressiveness. Also, their purpose is to be hospitable to the non-specialist description of information objects of different types and from different domains and so are not concerned with the very specific requirements of any one domain. Of the formats service formats which we examine here, IAFA templates are perhaps the most detailed. There are templates for different types of object (document, user, logical archive, etc.), and there has been some consideration given to ‘clusters’ of data which are likely to be repeated across records and to variants within records.
There has been much recent interest in the Dublin Core, a list of 13 data elements with some qualifiers, which has been developed to act as a simple description format (Dempsey and Weibel 1996). A separate section below looks at this is a little more detail.
Although the bulk of the formats in this range follow an attribute-value pair structure, it has been agreed that an SGML DTD will be developed for the Dublin Core. Because of some similarity of construction and content across formats in this band, conversion between them, though inevitably lossy, is feasible
The documentation band contains some very full frameworks for the description of multiple aspects of objects and collections of objects. In some cases, the frameworks describe metadata objects as one type only of information object: they are concerned with ‘information content’ also. Typically, work is proceeding within an SGML context and the example of the Text Encoding Initiative has been quite influential. Within the social science, museums, archives and geospatial data communities work is progressing on establishing DTDs. These may relate to collection level description, item level description, and allow various levels of aggregation and linkage appropriate to the domain. They cater for a very full range of attributes appropriate to documenting data sets or other resources. These can be distinguished from the range in the middle band by fullness (they go into more detail), structure (they contain richer structuring devices), and specialism (they may be specific to the relevant domain).
It seem likely that specialist users will want to search such data directly, but that to make data more visible to more general ‘discovery’ tools, there may be export of data in some of the simpler formats used in the middle band. Indeed, the Dublin Core has been explicitly positioned as a basis for semantic interoperability across richer formats, although it has not been widely used in this context.
Various obvious contrasts between the bands are clear. The web crawlers currently operate at a very fine-grained level: they see a world of pages. The services in the middle band face an interesting development challenge: to reconcile the economic and service goals of simplicity and a generic approach with the desire to make descriptive practices responsive to the relatedness of the information world at various levels. The domain specific approaches tend to focus on the description of particular ‘collections’ and to capture some of this relatedness, but are currently in various stages of development and resources may not be yet visible through general purpose tools.
Middle band discovery services are being delivered through emerging distributed searching and directory approaches on the Internet, notably Harvest, Whois++, LDAP, and Dienst. There is some use of Z39.50 also, notably for GILS.
An unknown factor is the influence of Netscape’s ongoing work based on Harvest technologies. It is working with something called Resource Description Messages, as a framework for search and retrieve of metadata. (Hardy 1996)
Many band three approaches are in early stages. However, there has been some discussion of using
Z39.50 for search and retrieve in several cases. In particular, there has been some interest in the Z39.50
profile for access to digital collections Standards-based resource discovery services are also in early stages. Examination of the
descriptions collected in Dempsey and Heery (1996) shows that many formats are still under
development or are not widely implemented.
In Band 3, the ‘documentation category’, in particular, communities of users are working towards
consensus and in some cases robust interoperating implementations are some time away.
In Band 2, the ‘discovery category’, IAFA/Whois++ templates are in use in several projects, and are
deployed in Whois++ directory services. Dublin Core is being piloted in several projects, but an agreed
syntax is only now being defined. RFC-1807 is used within the NCSTRL project The Dublin Core is a simple resource description format. It has attracted considerable attention
recently, partly because of the eloquence and consensus-building activity of it proponent, Stu Weibel,
but importantly because it has situated itself as a potential solution for three pressing requirements. The
first is to have a generally acceptable simple resource description format which is hospitable to the
description of a wide range of resources. Following recent discussions, the Dublin Core (a list of 13
elements with some qualifying structure) is being adapted to take on board some of the concerns of
those with an interest in image metadata and to address some structural and content issues. The second
target use is to provide a semantic base for metadata embedded in HTML (and subsequently other)
documents. The third target use is to provide a base for semantic interoperability between richer
metadata domains. Richer record formats might map a core set of data onto Dublin Core to provide a
common set of elements for discovery purposes - this might be implemented in various service and
technical environments. Dublin Core is attracting significant attention. It should be noted it only looks at
one aspect of metadata - simple description - and that a concrete syntax for it has only recently been
defined, and not much implemented.
UKOLN and OCLC jointly organised a conference in Spring 1996 to examine various general
metadata issues and the Dublin Core in particular. The venue was Warwick and a new requirement was
identified and scoped (Dempsey and Weibel, 1996), which resulted in the Warwick Framework
proposal.
Satisfying the need for competing, overlapping, and complementary metadata models requires an
architecture that will accommodate a wide variety of seperately maintained metadata models. It was
concluded that an architecture for the interchange of metadata packages was required. A package is
conceived as a metadata object specialized for a particular purpose. A Dublin Core-based record might
be one package, a MARC record another, terms and conditions another, and so on. Such discrete
packages might be numerous and varied in content and even source. Users or software agents would
need the ability to aggregate these discreet metadata packages in a conceptual framework, hence the
notion of a container-package architecture.
This architecture should be modular, to allow for differently typed metadata objects; extensible, to
allow for new metadata types; distributed, to allow external metadata objects to be referenced;
recursive, to allow metadata objects to be treated as 'information content' and have metadata objects
associated with them.
Although there is wide agreement that this is a sensible direction, the Warwick Framework has not
been implemented at the time of writing, and certain issues remain outstanding. There is still
considerable intradomain discussion; the need for inter-domain exchange is recognised but the
applications framework for this is not in place. Because of the variety of metadata approaches that have
been discussed here, and the variety of metadata requirements which may exceed the current provision
of any one format, the Warwick Framework has attracted a lot of interest as a simple but potentially
very powerful architectural component.
The discipline or control exercised over the production of collections of resources will improve as
the web becomes a more mature publishing environment. There will be managed repositories of
information objects. Such repositories may be managed by information producing organisations
themselves, universities for example, by traditional and ‘new’ commercial publishers, or by other
organisations (the Arts and Humanities Data Service in the UK, for example, or industrial and other
research organisations, archives, image libraries, and so on). This is not to suggest that the existing
permissive electronic publishing environment will not continue to exist in parallel. One concern of a
managed repository will be that its contents are consistently disclosed and that descriptions are
promulgated in such a way that potential users, whoever they might be, are alerted to potentially
relevant resources in that repository.
Different repositories will have different requirements and priorities. Examples are a social science
data archive, a university web site, a commercial publisher’s collection of electronic journals, an archival
finding list, and so on. Objects on a university web-site may be briefly and simply described. A data
archive may need extensive documentation.
There will be a variety of metadata creators. These fall into three broad categories: ‘authors’,
repository managers, and third party creators. As its importance becomes more apparent, ‘authors’ are
likely to create descriptive metadata: a major incentive for this will be agreement about the use of the
META tags in HTML documents for embedding metadata which will be harvested by programs.
Following various discussions earlier this year, partly prompted by the Warwick metadata workshop, an
agreed convention for embedding metadata without breaking existing browsers or robots may be soon
available. Descriptive data will be similarly embedded in other objects by those responsible for their
creation. Metadata will also be created by repository managers, who have some responsibility for a
resource and the data that describes it. Third party creators (including, for example, the eLib information
gateways) create metadata for resources which they themselves may not manage or store.
Metadata may sit separately from the resources it describes; in some cases, it may be included as
part of the resource. Embedded HTML tags is probably the simplest example of the latter case, but it is
common in some of the domain-specific SGML frameworks mentioned above. For example, a TEI (Text
Encoding Initiative) header needs to accompany conformant TEI documents. However, independent
TEI headers may also exist, which describe documents which may be physically remote.
Metadata, once created may be shared with others. Take for example, author-created metadata
embedded in HTML documents. This may be collected by robot or other means. Value will be added to
this data at various stages along whatever use chain it traverses: by a local repository manager, by
subject-based services, by crawler-based indexing services, by various other intermediary services. These
intermediary services might include librarians and others who now invest in current awareness and SDI
(selective dissemination of information) services, as well, maybe, as current abstracting and indexing
services. Many authors may only provide basic information: typically they will not be conversant with
controlled subject descriptor schemes, record all intellectual or formal relationships with other resources,
and so on.
A different use chain might be traversed by fuller metadata associated with the scholarly edition of
an electronic text, for example. Full documentary metadata would be available to assist in the analysis
and use of the text, but a subset might be output to a general purpose discovery service. There might be
a link back to the fuller metadata from the shorter record.
A number of factors, including the perceived value of a resource, will determine the relative balance
between author-produced, added value and third-party original descriptions in different scenarios. The
metadata ecology and economy is still in development.
The last section focused on the development of metadata in the network environment. Within the
BIBLINK project UKOLN is looking at metadata developments over a wider range of activities. It is
interesting to note the few points of contact that exist between ‘conventional’ publishing activities and
the networked world in terms of discussion about resource description, unique identifiers and other
issues.
The importance of metadata to future network environments has already been sketched. Programs
will handle it in various way depending on the context. Here we look briefly at search and retrieve in a
network information environment.
Programs will collect and manipulate data in a variety of ways, passing it off to other applications in
the process. Data may be collected and served up in various environments, converted on the fly to
appropriate formats: it may be collected by robot, added to a local 'catalogue', or pulled into a subject-
based service. The metadata we have been talking about refers to network information resources. This
will need to be integrated at some level with the large, albeit highly fragmented, metadata resource for
the print literature. There may also be metadata about people, about courses, about research
departments and about other objects. Programs might periodically look for resources that match a
particular user profile, might search for people with a particular research interest, and so on.
These developments will take place in a rapidly changing distributed environment in which directory
protocols (e.g. Whois++, LDAP), search and retrieve protocols (e.g.Z39.50), Harvest, and a variety of
other approaches will be deployed. These will be hidden from the user by the web, itself likely to be
transformed by the integration of Java and distributed object technologies.
We can identify a scale of interoperability among services. At one extreme, they may be closed or
proprietary, and not interoperate with other services. In the area under investigation this is the typical
situation. For example, search services tend to be proprietary and closely coupled with the databases
they provide access to. Each CD-ROM or each telnet service is differently accessed, and the user is not
hidden from this difference. More recently, some shallow level of interworking may have been
introduced by making services available through similar web interfaces. However, the burden of
selecting between the services and consolidating content rests with the user.
If services can be accessed through standard server interfaces then greater interoperability is
possible. Different applications could use the same, or common, client services to access a range of
services in a consistent manner. For example a library OPAC, a subject information gateway, an online
host, could use Z39.50-based services to access a variety of information sources in a consistent way.
Similarly, if they themselves provide Z39.50 server interfaces, their own sources are available to be
accessed in a consistent way by other applications. Several of the protocols in this area were mentioned
above. The ROADS project is developing a distributed systems framework based on WHOIS++, an
Internet search and retrieve protocol initially developed for directory applications. The aim is to provide
the eLib subject gateways with the ability to act autonomously with individual web interfaces, while also
being able to accessed as a collective unit within a WHOIS++ framework. WHOIS++ will provide a
consistent search and retrieve interface and a technique for routing queries to relevant servers.
It might be interesting to look at the subject gateways in a little more detail. They are part of a
bigger picture. There are several other discovery services planned. There are also likely to be a variety of
cases where services/organisations want to provide access to a depository of materials through a
database of descriptions: pre-print archives, training materials, image collections, electronic journal
collections. And so on. The provision of a similar type of descriptive metadata is seen as a central
component of the AHDS, of other ongoing European and eLib projects, and of a range of other
initiatives. Some of these projects/services will want to retain control of their data; some may be happy
to let it go to host services. However, whatever organisational patterns emerge it is very clear that all a
user's requirements will not be met by one or two centralised services. Even if all of the subject gateway
data were on a centralised service, it only provides a partial solution. Users will not merely want to
cross-search between the eLib services, they will want to cross-search between them and other services.
In fact, the latter is probably a more important requirement in some cases. And some eLib services will
want maybe to operate small distributed frameworks to bring in related resources, etc. ROADS is
motivated by the reality of this inevitably distributed environment and aims to deliver some components
of an integrating systems framework.
But it is clear that not everybody will be looking at WHOIS++. We are at the beginning of this
stage, where a common set of protocols and data formats will improve interworking, supplemented by
descriptive metadata and a federating layer of middleware which hides underlying protocol and format
differences. (UKOLN’s MODELS project is looking at what the components of such middleware might
be and how it will be constructed.)
The third level in our simple hierarchy is that of open, distributed objects where objects might be
located anywhere on the network, exist as ‘plug and play’ components, publicise information about
themselves, and are available for discovery by other objects which may wish to use their services. It will
be a while before such services are introduced in this area, although systems and services influenced by
distributed object technologies are now beginning to appear.
As noted above, we are talking about metadata at two levels here: the searchable repositories of
metadata such as the eLib gateways, and metadata which describes these repositories themselves.
[insert something here from MODELS to expand on this? A few pictures?]
I hope I have shown some of the complexity of the current situation. This list collects some of the
issues that might be addressed in the communities in which we work. UKOLN is already active in some
of the areas touched on.
This list is certainly not exhaustive, and is left deliberately open-ended to facilitate discussion. I
would propose developing a firmer list on the basis of input from the committee.
In general I favour some high level coordination - however that is managed. Things may be brought
together in various ways with current approaches but a lot of effort is being dissipated.
Discussion places emphasis on discovery. However, it might be useful to also emphasise disclosure:
the consistent disclosure of institutional resources. Organisations are going to adopt a more managed
approach to information provision. Provision of structured metadata which can be picked up and used
by several services would confer advantages, both for providers and potential users. Much of this will
happen anyway, but there are some organised environments where the process can be shepherded. Some
examples:
There are two issues here: What is good practice? And what incentive is there for organisations to
disclose their resources in this way? The answer to the first involves some investigation but it seems to
me that we are almost there (see the section on embedding/harvesting metadata). The answer to the
second involves the construction of services which make use of this metadata, and again some pointers
occur elsewhere in this list. As with most things, the challenge is to find an organisational framework for
moving things forward.
UKOLN is doing some work in this area:
These two strands address some chicken and egg situations.
However it would be useful to consider wider contexts for this type of activity and explore what
agencies should be pursuing initiatives. For example, the ACN has commissioned a study of robot and
search engine technologies. If any UK search engine is launched, then it should be ‘metadata aware’. We
will be making representations to this effect when it becomes clear if any initiative is to be launched.
At a more general level, there is scope for some work which looked at ways of cutting down the
load imposed by robots. Martin Hamilton and Dan La Liberte have proposed extensions to HTTP to
explicitly add support for downloading server indexes.
One can identify some of the components of a toolkit for creating and harvesting metadata in
distributed contexts. A project which brought something together and tested it in a natural ‘disclosure’
community would be useful.
We are involved in several initiatives here. For such an important topic there needs to be more work
done here. We hope to build a link with Netskills to promote some aspects of good practice. Much
could be done. It would benefit from a wider framework of priorities and objectives (e.g. an iniative to
disclose resources of UK universities).
I have described a variety of approaches which largely depend on manual processes: the creation of
the subject gateways, for example, and a variety which depend on automated approaches. The former
can be quite costly; the latter quite crude. There are a variety of things which could be attempted once
some infrastructure is in place to close the gap between the two. For example, robot generated citation
exercises (following links to a certain depth and counting them) could generate sets of core resources
from a seed set of known resources. Current subject gateways could be enriched by supplementary
databases of indexes harvested from the resources that have been quality assured and described more
fully in the main database. There are many other ways in which one could imagine enhancing services
which would repay investment (the current subject services do not have significant technical
development work at their disposal).
(Bearman and Sochats 1995) Bearman, D. and Sochats, K. Metadata requirements for evidence.
Draft 10/95. (Dempsey 1989) Dempsey, L. Bibliographic records: use of data elements in the book world.
Centre for Bibliographic Management, Bath University Library, 1989. BNB Research Fund
Report No. 40.
(Dempsey 1990) Dempsey, L. Bibliographic access: patterns and developments. In: Bibliographic
access in Europe: first international conference. Lorcan Dempsey (ed.) Gower, 1990
(Dempsey 1996a) Dempsey, L. Meta Detectors. Ariadne, No. 3, 1996.
(Dempsey 1996b) Dempsey, L. ROADS to Desire: some UK and other European metadata and
resource discovery projects. D-Lib Magazine, July/August 1996.
(Dempsey 1996c) Dempsey, L. Kicking up the dust in a chaotic storehouse. The Times Higher
Education Supplement, Multimedia, 11 October 1996, p. iv.
(Dempsey and Heery 1996) Dempsey, L. and Heery, R. Metadata: an overview of current resource
description practice. (Peer review draft of deliverable for Work Package 3 of Telematics for
Research project DESIRE). September 1996.
(Dempsey and Weibel 1996) Dempsey, L. and Weibel, S. The Warwick Metadata Workshop: a
framework for the deployment of resource description D-Lib Magazine, July/August 1996.
(Hardy 1996) Hardy, D. Resource Description Messages (RDM): Technical Specification: Draft
1.0b3. (Heaney 1995) Heaney, M. Object-oriented cataloging. Information Technology and Libraries,
September 1995 pp.135-153.
(Heery 1996a) Heery, R. ROADS: Resource Organisation and Discovery in Subject-based Services.
Ariadne, No. 3, 1996. (Heery 1996b) Heery, R. Review of metadata formats. Program,Vol. 30, no.4, October 1996, pp.
345-373. (Heery 1996c) Heery, R. ROADS templates: how they are used. September 1996.
(Lynch et al 1995) Lynch, C. and Michelson, A. and Summerhill, C and Preston, C. The nature of
the NIDR challenge. Draft of April 10, 1995
Note: Latest draft available from (Orfali et al 1996) Orfali, R. and Harkey, D. and Edwards, J. The essential distributed objects
survival guide. John Wiley, 1996
(Venturi 1996) Venturi, R. Iconography and electronics upon a generic architecture: a view from
the drafting room. MIT Press, 1996
Selective book trade elements
Distributed at the UKOLN Advisory Committee meeting 19th November 1996. Not for further circulation. AC/96-1 17
UKOLN Information Strategy V 0.3, Draft for discussion, 11 November 1996. 2
Web page by Isobel Stark, 13th November1996
Implementations
Dublin Core and Warwick Framework
THE CREATION OF METADATA
A CONVERGENCE OF COMMUNITIES?
A LOOK AT DISTRIBUTED ACCESS TO METADATA
A variety of approaches
SOME ISSUES AND DIRECTIONS
Policy and strategy
Disclosure
Harvesting/embedding
Training and awareness
Gap between manual and automated
Interworking R&D
Other technology R&D areas
Not much has been said here about URNs. Clearly, they will do
much to resolve certain issues - e.g. bringing together ‘copies’. There are several technologies on the table. However,
again, much of the burden will be organisational: creating frameworks and structures. There may be some merit for
some groups (e.g. subject gateways) to experiment with PURL servers or one of the URN proposals.
REFERENCES
APPENDIX
Work -
intellectual objectTitle - published object Copy - Handled object
Author/other responsibility Heading/authority As written
Title Heading/authority As written
Subject descriptor Heading/authority -
Abstract/description Abstract
Genre/category Yes
Evaluation Review, etc
Date of creation Yes Yes Yes
Identifier - ISBN Control no./unique identifier
Publisher/place - Yes
Date of publication Yes
Edition (Edition history)
Distributor
Terms and conditions - Price Loan policy/situational
Intellectual property rights Copyright
Format - Yes
Physical characteristics Publication specific
Date of receipt Yes
Transaction use - - Circulation data
Provenance/history - - Some contexts including historical bibliography
Location - - Building/shelf number