A JISC/BRITISH LIBRARY Workshop as part of the Electronic Libraries Programme (eLib)
Organised by UKOLN
27th and 28th November 1995 at the University of Warwick
British Library R&D Report 6238
This report may be reproduced by photocopying for consultation by interested parties inside and outside higher education institutions.
© The British Library Board 1996
© Joint Information Systems Committee of the Higher Education Funding Bodies 1996
The opinions expressed in this report are those of the contributors and
not necessarily those of the British Library or JISC.
British Library R&D Reports are published by the British Library Research and Development Department and may be purchased as photocopies or microfiche from:
the British Thesis Service,
British Library Document Supply Centre,
This report of the workshop was prepared by The Marc Fresko Consultancy
Telephone 0181 645 0080 E-mail email@example.com
This workshop was sponsored by JISC and the British Library, and organised by UKOLN . JISC is sponsoring many projects related to electronic libraries. Some of these projects are already creating digital resources, such as teaching materials resource banks and a range of electronic journals. In the near future, the Arts and Humanities Data Service expects to let contracts which will create data resources; and a call for proposals for World Wide Web pre-print services has recently been announced. JISC is active on behalf of the university community in negotiating licences for a large range of datasets, including electronic journals. The British Library has many reasons to be interested; as a major research library, as the leading source of funding for library-related research and development, and as a legal deposit library. Its declared intention is to develop collections in electronic formats and to deliver an increasing number of services to its users electronically. Like JISC, the British Library Research and Development department is supporting electronic library research in a number of institutions.
The workshop was held shortly after the CPA/RLG Task Force in the US issued a draft report on digital archiving . The contents of this draft acted as a touchstone for the workshop, shaping ideas and prompting discussion (particularly on the applicability of its recommendations in the UK). We are grateful to the Task Force for the work which went into producing this draft.
This level of activity serves to underline the importance of digital information, and the rate at which it is being generated. The effectiveness of our efforts on preservation, including our early steps in this workshop, will have a marked impact on the long term credibility of the use of digital media for academic purposes.
The workshop was organised in the UK, with highly-valued contributions from the USA. The discussion centred on UK needs and structures, but we recognise that activity in this area increasingly takes place in an international context.
The aims of the workshop were to:
We believe that these aims were met. Attendees listened attentively to several original and well-informed presentations from experienced practitioners. These presentations are reproduced in the body of this report. Resulting from these presentations, and from the lively discussions which followed them, eighteen action points were identified (see "Action Points" below). These points will form a realistic framework for us to structure our early activities to promote understanding and knowledge of this important field.
However, the Action Points presented below are but a beginning. Much of the discussion thus far has inevitably been very generic, too generic in some cases for specific conclusions to be drawn. The quality of our debates will be improved by increasing specificity, by developing and using some stratification and models of the realm of preservation; and the value of "learning by doing" must equally be recognised.
Eighteen potential action points emerged during the workshop. All were discussed at some stage, and received a measure of agreement; however, it is, naturally, too soon for all the points to have unanimous or unqualified support. Accordingly, they are presented here in a "raw" format, before full development. The British Library and JISC expect to take these points as a basis for initiating further activity, once they have been appropriately refined.
This report is an account of the presentations at the Long Term Preservation of Electronic Materials Workshop held as part of the Electronic Libraries Programme (eLib). Most of the accounts were assembled by the author using notes taken during the workshop and copies of speakers' notes, slides and handouts; they are not formal papers submitted by the presenters. Two exceptions are noted in the body of the report.
I am grateful to all participants and contributors for their co-operation in supplying materials. In all cases, significant questions and answers discussed after the formal presentations have been integrated into the accounts of the presentations. All credit for the information in this report belongs to the speakers; all blame for errors or inaccuracies remains mine.
Associate University Librarian for Technical and Networked Information Services, Rutgers University Libraries
This section is an edited version of a paper supplied by Peter Graham. Preliminary forms of this material were presented at the ALCTS Institute: The Electronic Library (October, 1993 and 1994) and at a task force meeting of the Coalition for Networked Information (November, 1993). In a different form this paper was published as "Requirements for the digital research library" in College and Research Libraries (July, 1995).
This paper is about what it means to be a research library in the electronic age. It draws on the traditional definition of libraries, particularly research libraries. It goes on to describe several components of libraries in the electronic environment, covering on the way preservation challenges which are new to electronic resources. The emphasis throughout is on the need for custodians of a digital research library to provide for the preservation of information at every step, in contrast to the print environment where these issues may be put off for some time. The paper concludes by describing the commitments needed to effect such a library, some of which are wholly new.
INTRODUCTION: DIGITAL RESEARCH LIBRARIES
The topic of this paper is the long-term preservation of electronic information (note: not electronic materials). Both archives and research libraries have this as their concern. It should be noted in passing that our concern is not merely technological, but social. Paul de Man is said to have remarked "Technology burns history, leaving no material residue." Most of the following addresses the research library rôle in preserving our culture, but will have some relevance to the archival community as well.
What in fact is a digital research library? The answer merges the histories, capabilities and missions of research librarianship and of computing science to produce a new service meeting long term needs.
The mission of research libraries is to acquire information, organise it, make it available and preserve it. This has been their significant distinctive and successful rôle with print and other artefactual materials for the past several hundred years. An implicit mission of computing science has been to make the benefits of computing technology of use to society at large. These missions, needs and capabilities must now come together to assure the continuity of scholarship. It will take conscious, planned efforts within both librarianship and computing to make this happen.
The primary requirement for a digital research library is that from the start it be committed to organising, storing and providing electronic information for periods of time longer than human lives. A library is not simply a network full of databases nor is it simply a building full of books. A digital research library is a collection of electronic information organised for the long term.
Many libraries of all kinds around the world now provide an increasing volume of scholarly information to their clients in the form of current information needs. However, research libraries have only begun to take on the provision, organisation and preservation of information for the long term, that is with the same long-term commitment they made for print materials.
THE NECESSITY FOR A SOLUTION
Until long term commitments are made, many currently proposed solutions will have only temporary effects. For example, discussion of cataloguing network resources will remain tentative, for until resources being catalogued have a permanent network presence (whether at fixed or virtual locations), the cataloguing that points to them must also have an ephemeral quality. Similarly the expensive products of recent valuable digitising demonstration projects will be at risk after only a few years if tools and commitments are not in place for the preservation of what has been achieved.
More important, the willingness of the scholarly community to give serious weight to electronic information depends upon scholarly trust in such information being dependably available, with its authenticity and integrity maintained. Such change is bound up with the future of electronic journals in the academic tenure process. The ability of the academy to count on long term, secure existence of electronic scholarly work will be an important determinant of the prestige and success of academic electronic publishing. Thus both libraries and universities have a stake in helping electronic publishing to succeed, and therefore have an interest in establishing secure digital research libraries.
Reader needs will continue to be what they long have been. Readers will want information to be reliably locatable, so that when they go there (whether personally or on the net) they can expect to find what they are looking for. Readers will want information easily accessible: the cataloguing must be clear and accurate, and the information must be promptly retrievable. In the electronic environment the needs for access tools will be more evident, and readers will expect appropriate and standard software to be readily available. Readers will expect information that was placed in the library's care a long time ago to be available; and they will expect that the integrity of the information they get from the library to be assured.
Implementation of a digital research library will require two major specific components and three kinds of new commitments. The two major components are the electronic repository, and the access tools and policies. The three major commitments are organisational, fiscal and institutional. In the following, the two components are discussed at more length, yet as technical problems they are probably the easiest to solve; they will only cost money. The institutional commitments will be much more difficult to achieve.
In spite of the greater space it will take, the description of the components will be in a cursory form. Each could be developed in great detail, but at the moment the outline and overall programme are most important. Early implementations will test many of these assumptions and will add more requirements to the list. Work needs to begin.
COMPONENTS OF A DIGITAL RESEARCH LIBRARY
A digital research library will be manifest to its users as collections of information existing in various places and accessible through the use of widely available tools. A locus of information may be called the electronic storage repository.
Over time, we will learn how collection development plays out in an access environment as well as in an ownership environment. It is sometimes loosely proposed (seldom by librarians) that libraries need not acquire electronic information, for it will be available somewhere on the network. Such proposals ignore the obvious truth that some institution must still, in the end, take responsibility for information, and that this has always been a definition of the library responsibility.
There will be many electronic storage repositories, responding both to requirements of redundancy and to the individual needs of institutions. In contrast to print collections, it is unlikely that there will be a high degree of content duplication across large electronic repositories, since for most purposes existence in a single place allows world-wide access. Aside from their actual contents, however, repositories that are part of a digital research library will have many common characteristics. Some of these are described below.
Even an initial repository will comprise many gigabytes of information, growing quickly to millions of electronic documents. Disk storage is cheap and the possible resources are plentiful.
Sources and Potential Participants
It is easy to cite numbers of electronic scholarly resources that now exist. A few are noted here only as examples:
These are only examples. Nothing, of course, should be selected automatically; collection development policies should be adapted and followed. The continuing substantial costs of providing electronic information will require that electronic collection decisions be made even as carefully and parsimoniously as for print.
Backup/restore procedures must be in place. They must be automated and economical, for libraries are never likely to have expensive labour available in quantity. Backups must be multi-generational, using remote storage, with regular disaster simulations and tests.
"Staging" refers to the prioritised use of different mechanical methods of storing data as it waits to be recalled. All data does not need to be immediately available on the fastest and most expensive storage media. Alternatives for providing immediate online access to the enormous potential volume of scholarly information need to be provided. What can be off-line, and how can it be retrieved?
Data Structure Standards
In a repository, does information simply exist as is (as first created) or is complementary information associated with it? Widely differing possibilities include SGML (Standard Generalised Markup Language) headers, ICPSR codebooks, picture captions, hypertext links and early software versions for use with data files. OCLC and NCSA jointly held a recent conference at which a number of core elements of metadata have been defined (the "Dublin core").
It will be important to establish standards for the number of repository locations necessary to assure long-term existence of electronic information and access to it. The issues concern longevity of information but also the dynamic interplay between costs of network bandwidth and response time, and costs of storage. Geographic location, nationalism and regionalism will likely play a rôle (at least intercontinentally, and probably intracontinentally). Major institutions may separately or consortially establish repositories. It is not yet clear how much redundancy of their components will be desirable among them. In addition, it seems likely that many library consortia will be formed on the basis of joint contracts with information vendors, also leading to information redundancy. One location won't do for a major electronic document or set; will two, or three? How many?
The electronic repository must be preserved. Preservation of information needs to be looked at from at least three points of view: medium preservation, technology preservation and intellectual preservation. The problem, and what is new about preservation in the electronic environment, is that electronic information must now be dealt with separately from its medium. This can be illustrated by an analogy, one which is very oversimplified, as readers will be aware: if a book is placed on a closet shelf, and the closet door is closed for 500 years, then at the end of that time one can, broadly speaking, open that door and read the book. With an electronic resource one does not have that confidence after ten years, and for several reasons.
The artefact or medium can decay. Medium preservation is the concern for preserving the medium on which information is stored, such as tapes, disks, optical disks, CD-ROMs and the like. Backup is appropriate, as is copying to other devices of the same kind, a technique which we know of as "refreshing"; we speak of refreshing a tape by copying its contents to another similar tape. In the current climate of protection of intellectual property rights, copyright concerns must be recognised (recent proposed USA legislation has many flaws, but it does recognise this need).
More problematic than medium decay are the rapid changes in the means of recording, in the storage formats and in the software that allows electronic information to be of use. We need to be aware of technology obsolescence as even more of a problem than medium decay, and undertake steps of technology preservation. Rather than simply refreshing, we also need to speak of migration: of migrating information forward through technology stages as they become available and as the old technologies cease being supported by vendors and the user community.
There remains a third preservation requirement, intellectual preservation, which addresses the integrity and authenticity of the information as originally recorded. Preservation of the media and of the software technologies will serve only part of the need if the information content has been corrupted from its original form, whether by accident or design. The need for intellectual preservation arises because the great asset of digital information is also its great liability: the ease with which an identical copy can be made, quickly and flawlessly, is paralleled by the ease with which an undetectable change may be made.
Here are some of the intellectual preservation questions that arise for a researcher using electronic information: How can I be sure that what I am viewing is what I want to see? How do I know that the document I have found is the same one that you read and made reference to in your footnote? How can I be sure that the document I now read has not been changed since the last time I read it? Note that in this instance backup is not the issue; rather, it is how we know which version we have or don't have.
There are at least three kinds of possible changes:
1. Accidental change: for example, data loss during transfer, accidents during updating, saving the wrong version.
2. Intended change (well meaning):
a. New versions or drafts (authorial texts, legislative bills);
b. Structural changes: updating Books In Print or a telephone directory;
c. Interactive documents, e.g. hypertexts with note-taking capabilities.
3. Intended change (fraud): e.g. of one's own work to cover one's tracks or change evidence; or of another's work. Possible examples: political papers, laboratory notebooks, historical rewriting, legal documents, contracts.
Whatever technique is used must provide generality, flexibility, ease of use, privacy protection where desired, openness of documents where desired, low cost - and functionality over long periods of time on the human scale.
So far, this paper has dealt with the electronic repository component of the digital research library. The second major component comprises the Access Tools and Policies.
Usage and Retrieval Mechanisms.
The digital research library must of course support the full panoply of present access tools (for example online catalogues and OPACs, FTP, gopher, and certainly the World Wide Web and its multiple browser clients). The digital research library must also prepare to support the new access tools that are likely to appear regularly, in particular the implementations of Uniform Resource Names (URNs) and Uniform Resource Characteristics (URCs).
The granularity of documents needs to be addressed: how may one retrieve only part of a document when the full document may be of substantial size (for example the full text of Moby-Dick or of a legal code; or a presentation of many images from which one is desired). Must documents be pre-coded (or pre-marked) to allow such granular access, or can access-time mechanisms be made available?
We need to be able to provide documents which change rapidly, for example ANSI standards, monthly statistical reports and draft document versions. Therefore we will need to develop techniques for dynamic documents and consequent archiving and labelling, as well as flags indicating obsolescence or suppression (or conversely indicating status as an authorised version). A form of SGML may be appropriate in some cases, for example the format proposed by the TEI (Text Encoding Initiative). The URNs and URCs referred to above are likely to be part of the solution to this problem.
Providing access to voluminous information is an intellectual problem that historically has been solved in the print environment by abstracting and indexing services and by library cataloguing. We have developed extensive rules and procedures to ensure consistency and accuracy. These tools, adapted to suit new needs, will work for electronic information as well. They should be linked to the new retrieval mechanisms so that users can smoothly navigate from location of information to retrieval of it without having to shift their mode of use. Early mechanisms will probably link catalogue records to documents using tools such as the WWW, the Uniform Resource Locator, and the MARC 856 field. SGML may offer other possibilities for linking of certain documents through its document description techniques. In any case, there eventually will need to be consensus both on the display of physical electronic locations in bibliographic records and on representation of virtual locations.
If the digital research library's catalogue system works well, users will be able to search for information, locate bibliographic records for desiderata, and use those records directly to draw the desired information to their workstation. Where an authentication technique is available, we must provide means for including and testing the certification. Standards for such cataloguing and remote access still need to be developed, particularly for providing catalogue access to non-owned materials.
A digital research library should from the outset be intended for access from multiple remote locations. Presumably the bibliographic utilities, such as OCLC and RLIN, will play their accustomed rôle. Internet-wide access should generally be possible to a digital research library. In initial pilot implementation it may be advisable for a few libraries to plan and develop a closed set of catalogue and access mechanisms to their own individual libraries' electronic collections. In the process they will create catalogue records that allow access to electronic information of their own. We will need procedures for dissemination of such catalogue records. It is not only a technical matter, but a policy matter, for such catalogue records will then provide access to local holdings for non-local readers. There are compensation, capacity and intellectual property considerations here.
Fees and freedom
In practice these are often linked issues. Standards and techniques will be necessary to solve a knot of interconnected problems surrounding access and ownership, including:
Much of what has been described so far is merely technical. The outlines of solutions are becoming clear even if the details remain to be worked out (and the non-trivial matters of cost have admittedly not been dealt with). More difficult will be the social compacts, that is the agreements on standards, on intellectual property and on access modes.
Most difficult of all to achieve, if electronic preservation and access are to be accomplished on any significant scale, will be the long term commitments to these goals by institutions. Nothing makes clearer that a library is an organisation, rather than a building or a collection, than the requirement for institutional commitment so that electronic information can have more than a fleeting existence. In this context, we should welcome the recent statement by the CPA/RLG Task Force on Archiving of Digital Information in their draft report that "the key that unlocks the path to the digital environment is not technological but organisational".
Three kinds of commitment will be necessary:
Commitment to organisational change
The organisation of libraries is already changing as electronic information increasingly becomes part of their charge. Most research libraries now have substantial systems departments. Some libraries locate the responsibility for electronic information distinctly from that for print. Other libraries see the forms as inseparable and include electronic responsibilities along with artefactual responsibilities in assignments for collection development, cataloguing and public service.
What is new will be the permanent assignment of staff responsibility for the long term maintenance of electronic information within a library. There is no obvious artefactual parallel for this responsibility: the various departments of circulation, stack maintenance, preservation and physical plant now share it for print. Nor are there present parallels in academic computing centres, where staffs typically focus on technological advance and availability, leaving data to the users. The electronic preservation responsibility will be focused as it will require technical expertise likely to be located in a single functional area.
It is by no means clear that this functional area will be what we used to call the library's systems department. As libraries move more into the electronic environment the traditional tripartite division of libraries into public services, technical services and collection development will continue but in more fluid arrangements. People who combine bibliographic understanding, problem-solving abilities and process orientation have often been found in technical services but also elsewhere in libraries. Such librarians will take on the demanding new technical, collection and service responsibilities for long term support of digital collections. It is also becoming clear that the traditional computing community is fertile with ideas, analysis and skills that will be important to electronic library goals, for example in the work of the Internet Engineering Task Force work groups on the Uniform resource Identifiers and in the work of Jerome Saltzer of MIT.
The permanent existence of a digital research library will require assured continuity in operational funding. Almost any other library activity can survive a funding hiatus of a year or more. Acquisitions, building maintenance, and preservation can be suspended, or an entire staff can be dispersed and a library shut down for several years, and the artefactual collections will more or less survive. But digital collections, like the online catalogue, require continual maintenance if they are to survive more than a very brief interruption of power, environmental control. backup, technological advance and related technical care.
Our online catalogues are cheap compared to what the digital research libraries will cost. The hardware and software maintenance costs for online catalogues have reached a rough steady state, and the capital costs for new OPACs are decreasing relative to the capabilities provided. The catalogue size will continue to increase, but catalogue records are small relative to the information to which they refer. Digital research libraries, however, as a proportion of the library's supply of information, will grow for the foreseeable future, and the quantity of information requiring care will become considerable (and much larger than the catalogue). Unit costs of storage are likely to continue falling for some time, which will make the financial burden manageable. And, in the US at least, staffing costs are not expected to increase, simply because overall staff growth in most libraries is likely to be consciously retrained for the foreseeable future; reassignments, however, are likely.
Long term funding will be required to assure long term care. Libraries and their parent institutions will need to develop new fiscal tools and use familiar fiscal tools for new purposes. Public institutions, usually constrained to annual funding, will have particular difficulties; existing procedures for capital or plant funding may provide precedents. One familiar technique is the endowment. It has been difficult to obtain private funding for endowments and services rather than books and mortar, but it is possible (it will not be hard in the current environment to create "virtual bronze plaques" that can be guaranteed to be around for some years). Institutions might also build endowments out of operating funds over periods of time. In the UK at least there appears to be some evidence that the government may actually want to listen to needs. In the US there is almost none.
Some revenue streams associated with digital research libraries may be practical. Consortial arrangements may allow for lease or purchase of shares in a digital research library. Shorter term access might be provided to other institutions on a usage basis. Access could be sold to certain classes of users, for example businesses, non-local clienteles, or specific information projects. New relations with publishers, presently difficult to perceive through the mists rising from intellectual property, might result in fee income for storage of electronically published materials for the copyright lifetime, during which publishers collect usage fees. With commitment and imagination in the long term, fiscal tools will be found.
All these are instrumental means of accomplishing the greatest requirement, that of conscious, planned institutional commitment to preserve that part of human culture which will flower in electronic form. The advent of electronic information will by no means create museums of the book out of libraries, as so often alleged. Instead, it will force the realisation of the distinct rôle of research libraries in preserving information rather than simply artefacts. Our traditional museums have always preserved artefacts (often beautiful) that embody information, while libraries have preserved information that has been embedded in artefacts which are only occasionally of aesthetic interest in themselves. The advent of electronic information will accentuate the difference between these rôles, as libraries and their parent institutions take on the responsibility for preservation of information in non- artefactual form.
For the past century most research libraries have been associated with universities, and this connection seems likely to continue in the immediate future. But whatever the institutional parent of a major library, an institution wishing to benefit from electronic information will have to make a constitutional commitment to providing resources. Michael Buckland, of the University of California at Berkeley, has distinguished between a library's rôle and its mission. Buckland suggests that although the rôle of a library is to facilitate access to information, its mission is to support the mission of its parent institution. The implication is that if a university or a state wishes to continue relying on support from its library, it will have to make commitments to support the library's rôle. In the electronic environment, this means new and long-standing financial commitments which the library and its governors together must identify and establish.
The commitment will have to be clearly and publicly made if scholars and other libraries are to have confidence that a given digital research library is indeed likely to exist for the long term. Guidelines or standards will be desirable that define what is meant by a long term commitment. Such standards will define which electronic repositories of data can qualify to be termed part of a digital research library. Just as donors of books, manuscripts and archives now look for demonstration of long term care and commitment, so too will scholars and publishers as they create electronic information and require for it a home.
Establishing a digital research library continues the research library rôle. For a major library or a university to do so should be considered as natural as acquiring the next book or cataloguing the next journal. Not to do so will be an abdication of that responsibility. The skills and understandings of both the library and computing communities will be essential in carrying out this goal of preserving the human record in the electronic environment.
The tasks call not so much on new knowledge nor on new technologies, but upon our collective informed commitment; that is, upon will. It is clear that the new knowledge being created in electronic form will not survive by itself if nothing is done. Specific action is necessary. It is extremely gratifying for an American to see a conference convened such as the present event, apparently with some level of support from the academic community and the government, planning for long-term preservation of electronic information. As so often in human history, it is concerted human action which answers the pessimists; technology need not burn history, for we as librarians shall see to it that there persists an informing residue.
Following is a short list of references which may be useful.
PROF. DENISE LIEVESLEY
Director, The Data Archive, University of Essex
This paper presents a number of factors and changes which are shaping data archives, and the challenges facing them. It goes on to discuss The Data Archive as an instance of an established archive, and to explore important issues of policy and strategy.
THE IMPORTANCE OF DATA ARCHIVES
Data archives are playing, and will continue to play, a growing rôle in our electronic and cultural institutions. Their importance is increasing in parallel with the explosive growth in the volume of electronic data being created. At the same time, the cost of collecting data is increasing, and perceptions are changing: the value of using and re- using data is becoming accepted, as is the importance of making research "transparent". Combining this with the specialist skills and equipment needed to preserve data makes a strong case for the importance of data archives.
The very nature of data makes it unique; it is inexhaustible, non-renewable and non- substitutable. That is to say, data cannot be exhausted no matter how much it is used (by contrast with information held in almost any non-digital form); it frequently can only be captured at a specific time, and once that chance has passed the potential data vanished forever; and nothing other than data can serve the function of data.
A TIME OF GREAT CHANGE
In common with other functions in the academic sector, archiving is seeing many changes:
Usage of the Internet continues to rise, bringing issues (among others) of distributed computing, requirements for a wider range of software, and new media.
The university sector is growing, and is increasingly obtaining external funding for academic research. The population of users is becoming more diverse, and a range of "data brokers" are appearing in line with their increasing demands for data. And correspondingly, data providers are increasingly interested in giving access to their own data.
New generations of researchers have greater expectations for access to digital data. The data itself increasingly is being gathered in the private sector, as data collection agencies are privatised. Privatisation and the trend towards "open government" both bring the need for fair competition.
Simultaneously, users, data and archives increasingly have perspectives which are not limited to national horizons. This has complicated the environment for data sharing, since legislation on access, data protection, confidentiality and copyright varies from state to state; European legislation is also becoming relevant. The internationalisation of data has additionally created a new culture among some data users: for example, it has fostered the ethos of data being a public good.
The challenge facing data archives is how to anticipate the new whilst continuing to provide a service. This challenge is giving rise to a great deal of debate, centring on questions such as:
Various new services are being considered in response to the challenge; these in turn raise issues, including:
THE DATA ARCHIVE AT THE UNIVERSITY OF ESSEX
An instance of a significant archive is the Data Archive at the University of Essex. Funded by the ESRC, JISC and the University of Essex, it exists in order to promote wider and more informed use of data in teaching and research, and to preserve the data so that it remains accessible over time. Its functions are:
THE COMMUNITIES SERVED
Archives serve a number of communities. These can be divided broadly into two groups: data producers (including data owners and funders as well as those who actually produce and deposit the data) and data users (eg teachers and researchers).
It is vital to obtain and retain the support of data producers. In summary, we can do this by offering numerous benefits to producers, namely:
Note that the Archive (and the ESRC) will normally support effective information services rather than dataset acquisition. It will only support acquisition if there is a high demand, a cost effective mechanism for maintenance and support and perhaps a key leverage rôle for its support.
The benefits to users are more clearly cut. They are:
This range of benefits should raise a question in our minds: what are users' priorities? Indeed, do we know enough about these priorities? What are their requirements for acquisition, media and formats, speed of service, timeliness, quality, and charges?
A selection policy is essential. It is simply not practical to accept all potential data without some selection - there is simply too much data available. The policy must be based on users' likely needs and demands, and also:
Of course, an integral part of an acquisitions policy is a list of rejection criteria. These should include:
Quality of data is difficult to pin down. Quality can only be measured in terms of fitness for purposes, and different purposes have different needs. Equally, the method of informing users about quality needs to be determined. A related issue is achieving a balance between measuring or improving data, and getting data out faster; experience shows that different European archives are reaching different conclusions.
Finally, we have to remember to consider the legal liability we may have for distributing defective data. At present, the position is not entirely clear.
THE IMPORTANCE OF DOCUMENTATION
Documentation is best produced by depositors, but they need persuasion, help and guidance to encourage them to produce and deposit it. The format(s) in which documentation is kept need to be considered; appropriate format and media are important for ease of use and of delivery.
The preservation of documentation is also critical, as data without its documentation is greatly reduced in worth if not entirely worthless. Consequently, documentation becomes a resource in its own right, and thus its preservation becomes a critical requirement.
Documentation does not stop at objective description of the data; it can also include contextual data. For example, the value of a data set on a family income survey can be greatly enhanced by the availability of documentation on the income tax structure in force at the time of the survey. Deposit of contextual documentation should be encouraged, though such encouragement may not be easy.
The Aims of Preservation
The preservation must include the following:
The key challenges facing data preservation are mainly to do with growth and change. They are:
A preservation strategy must take these factors and challenges into account. The strategy needs to incorporate a data management strategy, which may include the relationship between data preservation and distribution (eg the data being preserved may not be the same as that being distributed). The strategy must also recognise the wide range of access rates which different data sets might experience.
Factors other than data and preservation must be taken into account. The opportunities for staff training and development are key to the success of an organisation, for example. And flexibility is definitely required in a archive; "Change is usually stressful and often unwelcome - but less so if provision for it has been built into the culture and ethos of the entire organisation" (Osborne and Gaebler, "Re- inventing Government"). Finally, nurturing a sense of "service orientation" is critical.
Partnerships are essential for the success of data archives. These partnerships can take a number of forms; they can, for example, allow the sharing of expertise through existing archives; involve the use of third party distributors (eg the Manchester centre); exploit the expert knowledge of data sources and needs; and result in the building of new facilities (eg R-Cade).
Director, Arts & Humanities Data Service Executive
This paper is founded on the recognition that structures and policies of the Arts & Humanities Data Service will gradually develop. It sets out the structure which is envisaged for the AHDS, and some of the key functions which it will perform. Finally, a possible stratified collection policy is outlined.
The Arts and Humanities Data Service is a new national service funded by the Joint Information Systems Committee (JISC) of the Higher Education Funding Councils. The AHDS's mission is to co-ordinate access to, and facilitate the creation and use of, electronic resources in the arts and humanities by offering a range of services to higher education institutions and their members. This paper is presented by the Executive-designate of the AHDS. The appointment and the establishment of the AHDS are both recent, and it follows that the ideas below are at a relatively early stage in their development.
STRUCTURE OF THE AHDS
Over the next three years, the Executive expects to develop a structure of three organisational "branches". Each branch will include one distinct type of service provider. The three types of service provider will be:
The current level of funding only permits the establishment of data supply services. User support and network service providers will be established only with additional funding (not necessarily from the JISC) or through association with existing service (not necessarily funded by the JISC) and after feasibility studies are conducted in these two relatively under-evaluated areas (studies are currently being commissioned by the executive for completion by October 1996).
These services will focus on particular types of data irrespective of their origins and use within the academic community. They will focus on collecting, describing/ documenting, cataloguing, and preserving electronic information in their respective domains. They will promote relevant standards for data creation, description, and preservation, and will develop guides to good practice for would-be data creators. They will negotiate access to similar data whether commercially produced or stored at other public or semi-public sites.
Five services are envisaged. They will focus on:
Data Integration and Catalogue Access
At least one service provider will be established to implement on a system-wide basis network access mechanisms to AHDS data and metadata (catalogues, training materials, etc.) wherever these are stored. It will integrate the facilities and data stored by disparate services (of the data supply and user support types) in a distributed system which will provide a genuinely seamless user environment.
Other Organisational Structures
Essential collaboration between service providers of each type will be assured through a Service Providers' Forum in which managers from each service provider will participate.
RATIONALE FOR THE ORGANISATIONAL MODEL
The model seeks efficiencies in two ways.
It concentrates expertise which is necessary to collect, catalogue, document, and preserve different types of electronic information (viz. databases, texts, image, time- based, and GIS/site-mapping data). An alternative model which would loosely tie data supply services to academic disciplines would introduce inefficiencies. For example, three data services - one supplying historians, one supplying archaeologists and art historians, and one supplying literary and linguistic scholars would all have to develop the expertise necessary to create, catalogue, preserve, and describe database materials which emerge in substantial number from each of the disciplines. Concentrating that methodological expertise in one supplier makes economic sense.
The model also conforms to the logic of humanities research and teaching. For example, historians as end users will not be interested in databases exclusively. They will require access to material supplied by the textual, image, time-based, and site- mapping data services. Philosophers, art historians, literary scholars, etc. will have similarly wide ranging data needs. According to this model, subject-based user support services will provide their respective communities with relevant training and other materials for data known to the AHDS irrespective of how, or by which service supplier, they are held. The end user will therefore have greatest contact with his/her user support service which will travel freely across the range of different data suppliers. The end user need not know that there are several underlying data suppliers. An alternative model would staff each data-supplying service with personnel sufficient to support the same whole range of humanities disciplines. This model, however, would introduce substantial overlaps and inefficiencies.
COLLECTION POLICY FOR ELECTRONIC TEXTS
A policy is essential. It will emerge progressively and collaboratively as the AHDS develops its ideas together with both providers and users. It is to be hoped that the policy will evolve towards a stratified and prioritised approach. One possible policy approach for texts can be summarised as building a series of collections, defined roughly and tentatively as:
Of course, there may be other holdings in addition to these collections. The key to success will be to recognise the value of different digital resources and to prioritise them by means of extensive negotiations.
This paper introduces the current situation regarding legal deposit, and explores The British Library's proposals for extensions to encompass electronic and other non-print publications. The scope of proposed legislation, access mechanisms and possible repositories are discussed.
The requirement in the UK for legal deposit stems from a 1911 Act of Parliament. This Act specifies that The British Library, along with the other specified deposit libraries, is entitled to copies of publications. The Act applies only to printed publications; though its intent, as generally understood, is to ensure the preservation and availability of publications generally.
Of course, publications are now being produced on many media other than paper. The limitation to paper will, if continued, mean that the proportion of publications being deposited will decrease; and potentially some particularly important reference works might not be deposited where they are published solely on (say) CD-ROM. This clearly would cause a problem for those interested in preserving our intellectual heritage.
The British Library's approach to this problem is to press for new legislation which requires deposit for publications on non-print media. The legislation should cover both current media and media yet to be devised.
The British Library is recommending that new primary legislation should not be limited to any medium or media. Rather, the Statute should provide for subsidiary legislation to allow for the coverage of future new media.
A proposal is currently being drafted, with the intention of submitting it for consideration by the current Parliament.
WHAT IS A "PUBLICATION"?
With the huge variety of electronic formats, media and business models, the meaning of the term "published" is no longer as obvious as with printed media. Therefore we need to agree on a definition for this term in the context of a requirement for legal deposit.
The British Library suggests as a working definition that any hand-held item which is offered for sale should be considered as "published". Further, sale is not a necessary element of publication, so some free items would also be considered as "published". Obviously there is scope for some grey areas with this kind of definition, and so The Library proposes the establishment of an independent public body to decide whether specific borderline items should or should not be subject to legal deposit. This body would, we assume, be constituted of unpaid citizens, some of which might be drawn from professional institutions, The Royal Society, The British Academy, etc.
The proposed legislation recognises that no single institution need cope with all possible deposit media. However, it seems reasonable to think that the existing deposit libraries may form a basis for the new deposit scheme. Each deposit library is currently formulating for the Secretary of State its views on how responsibilities might be divided.
Legal deposit repositories will have to be extremely selective in their choice of electronic publications to accept for deposit if the scale of the deposit operation is to remain manageable. We have identified three classes of non-paper publications:
The technological problems associated with the first two can be regarded as largely soluble. However, the technological and economic context associated with the third is in too great a state of flux for legislation to be appropriate at this stage.
As we all recognise, electronic media bring with them considerable intellectual property issues. For example, one deposited copy of a valuable electronic publication might be made available to a wide population by network; this obviously would be a concern to its publisher.
There are also financial considerations. Whereas the marginal cost of depositing one copy of a printed publication is generally negligible, the costs associated with depositing an electronic item may be substantial. There may be costs resulting from the need to provide documentation, and from the removal of copy protection for example. Consequently a repository can be expected to seek a trade-off between immediate access and long term availability.
Currently, our discussions centre on the levels of public access to deposited items. Given the extreme positions of (1) consultation at one terminal only and (2) unlimited and free remote access, The Library's working party has chosen a compromise. The current proposals are considering:
We recognise the difficulties of preserving digital documents, and we recognise also that we have not yet determined how best to achieve preservation. There are two fundamentally different approaches, namely (1) continual migration to new media and formats and (2) provision of original support environments by hardware and software emulation.
One particularly complex area is that of online databases. As examples of the complexity, we can point out that online databases exist in a multitude of formats; they are constantly and rapidly being changed; and they are made available in a number of forms (which may or may not correspond to our understanding of "publishing" as described above). This makes them too challenging to be considered for legal deposit today, though one day all should be taken within its scope. It is already too late to preserve some publications, yet it is too early for legal deposit to be practical. We shall have to wait until the economics of this form of publishing are less turbulent and better understood.
Associate Professor, School of Information and Library Studies, University of Michigan
This paper summarises the findings of a Task Force in the USA which has recently produced a draft report on the Archiving of Digital Information. The need for a national infrastructure of digital archives is argued. Critical issues (operating environment, migration strategies, intellectual property and finances) are examined. The paper ends with a summary of the draft report's recommendations.
We tend to dwell on the problems of digital preservation; we can easily overlook some of the unique benefits that digital storage can bring. By way of introduction, it is interesting to reflect on some of these benefits, lest we take them too much for granted. As an example, the author's notebook computer stores ten years worth of work between 600 and 700 large documents. The entire collection was recently migrated from another computer in about one hour. It is routinely backed up in a quarter of that time. These are performance levels which we simply cannot hope to emulate with paper or non-digital formats. Benefits such as this are inherent in digital formats; they will allow us to perform some functions more easily than before, and others which we could not previously perform at all..
DRAFT REPORT OF THE TASK FORCE ON ARCHIVING OF DIGITAL INFORMATION
The author was a member of the task force established by The Commission on Preservation and Access (CPA) and The Research Libraries Group (RLG). The task force issued a draft report on 25th August 1995 for comment. This draft will be used as a basis for this presentation.
Information on obtaining and commenting on the draft is given at the end of the paper.
THE TASK FORCE
The task force was charged with the following duties:
THREE KEY CONCEPTS
Three key concepts developed by the task force are used throughout the report. These are defined below.
"Repositories of digital information that are responsible for storing and ensuring, through the exercise of various migration strategies, the long term accessibility of the nation's social, economic, cultural and intellectual heritage instantiated in digital form."
Note that this definition distinguishes digital archives from libraries. Whereas libraries have access as a main objective, archives' priorities centre around storage and preservation.
"A set of organised tasks designed to achieve the periodic transfer of digital material from one hardware/software configuration to another, or from one generation of computer technology to a subsequent generation."
The Task Force adopted this definition instead of the concept of "refreshing" which had been used in its original terms of reference, because refreshing was felt to be insufficient in scope.
"Retaining the ability to display, retrieve, manipulate and use the digital information in the face of constantly changing technology."
THE NEED FOR A DEEP INFRASTRUCTURE
The Task Force concluded that a national infrastructure is called for. This should include a number of recognised repositories. Recognition would be achieved by certification of an independent authority. A fail-safe mechanism would be needed, for example to "rescue" data if an archive closes.
Other mechanisms will be needed, for example to direct data producers who find themselves unable to maintain particular data sets, and for archives to proactively seek out data sets in danger of being "orphaned".
The following critical issues have been identified and are examined below:
The operating environment will be conditioned by the diversity of attributes which describe it. Archives will have to contend with:
The Task Force takes the view that owners, creators and copyright holders have the initial responsibility for archiving their data sets. This is not to say that they will look after every object which they should care for; they represent only a first line of defence. Pressure should be applied in this area, particularly on publishers.
Certified archives will have a right and the responsibility to exercise aggressive rescues of endangered data sets.
Migration Strategies for Digital Information
Migration strategies will have to be developed. The nature of the these strategies will depend on the relevant application environments, on the formats involved, and on the degrees of functionality sought.
Migration will take into account the need to change media, change formats, and in some cases incorporate standards. It is quite possible that this will be achieved by specialist processing centres or bureaux, possibly consortium-owned.
One action which would greatly ease the task of migration would be the incorporation of migration paths into new software.
The relevant legislation is the US Copyright Act Section 108, which protects intellectual property while allowing libraries to make copies of protected material for preservation purposes. However, this legislation did not anticipate the need to copy digital documents for the same purpose; consequently legislative changes will be called for before data can be migrated with impunity.
Similarly, other preservation-related activities may require the permission of copyright holders, under present rules. These activities are:
Clearly, this would be an onerous responsibility. Requiring owner authorisation for each preservation action could undermine the effectiveness of an archives network.
The Task Force therefore proposes that digital archives would not be required to seek authorisation to create a copy or to store, migrate and manage that copy. Intellectual property owners would retain control over the making of new copies in other circumstances, however.
We also propose that any work which is not protected intellectual property can be accessed, used and disseminated according to the terms of the archive; but for any work which is thus protected, the actions would require agreement of the rights holder.
Costs and Financing
The questions of cost, and relative cost, are complex. A large number of interrelated factors contribute to the cost of digital archiving, namely the costs of:
Cost models therefore must include consideration of the functions which are included. They must also allow for predictions of change in technologies and costs over long time periods. Trends caused by the archives themselves will also affect the costs: for example, once operations are routine and predictable, we can anticipate that unit costs would greatly decrease; and there may be some economies of scale.
The costs of an archive must of course be matched by its revenues. A model will need to allow for potential income from tax and accounting incentives, user fees and subscriptions.
RECOMMENDATIONS TO CPA AND RLG
The Task Force is currently reviewing its draft recommendations, with a view to strengthening some in response to comments. At this point, the recommendations include the following:
The next steps will be to reconvene the Task Force, early in 1996. The final report will then be issued, and we look forward to the recommendations being implemented, hopefully in 1996.
Copies of the draft report are available from:
FTP server: ftp://lyra.stanford.edu/
Task Force discussion list: comments can be addressed to ARCHTF-L, the open listserv sponsored by the Task Force. To subscribe to the listserv send the message subscribe archtf-l to firstname.lastname@example.org
Director of Collections and Preservation at the British Library
This section is an edited version of a paper supplied by Mirjam Foot
This paper poses two basic questions. First: what is a preservation policy for digital material? Second: does it differ from a preservation policy for "conventional" library and archive material and, if so, in what way? It examines many detailed issues related to these two fundamental questions, including the influences of other factors (collection purpose, format, medium etc) on preservation policy.
PRESERVATION Definitions In the UK, preservation as it applies to conventional library materials is usually defined as "all managerial and financial considerations including storage and accommodation provision, staffing levels, policies, techniques and methods involved in preserving library and archive materials and the information contained therein" (NPO Glossary). The Oxford English Dictionary puts it more succinctly as the art of "keeping safe", "keeping alive", "maintaining" and "retaining". In a digital context, we have to look anew at this definition, as an extra dimension has to be taken into account.
According to the OED, policy is "a course or general plan of action". In other words, a preservation policy at its most basic is a plan of action for safe keeping. Such a plan of action should address the questions of what needs to be preserved, why, for what purpose, and for how long.
In order to address these questions we shall have to look at the function and purpose of the collections themselves, and of those of the institutions in which they are kept. For example:
It is not necessary to spell out for this audience that although the answers to the questions of what, why and for how long differ with the aims and purpose of the institution or the collection in question, they are also influenced by the nature of the material itself. If we consider a broadly-based international collection, comprising original sources and secondary material, basic research material and ephemera, we will encounter a wide variety of formats. These can include:
Different formats and different media demand different technical solutions as well as different storage conditions, but the aim and the purpose of a library itself and its functions determine its preservation policy which should cover all formats and all media. This policy then steers the preservation programme, which sets out the order in which collections or items will be preserved and the method by which this should be done.
Relationship Between Purpose and Preservation Needs
If we talk about the purpose of a collection as a determining element in its preservation needs, we need to look at other library and archive functions that are closely linked to preservation, such as acquisition, retention and access. How strong these links are and what their relative importance is depends on the purpose of the library or archive in question, as well as on the nature of the material.
Although the aims and purposes of the various kinds of libraries and archives vary enormously they all have some basic objectives in common. All libraries and archives acquire material (or have at one stage in their existence done so), mostly with the aim of making it available at some time or other; and all want to retain some of it for a shorter or longer period of time, some in perpetuity. If we assume that all research libraries want to make their collections available for use now or in the future, they will have to ensure that those collections can be used and are in a fit state to be used. This "fit state" applies both to the information contained in these collections and, in many cases and certainly for conventional material to their actual format, to their physical entity.
When talking about digital materials, there is an extra dimension that needs to be preserved, and that is the dimension of access. For conventional materials the human body provides its own access mechanism. Moreover this is renewed with each new generation. For digital material this is not the case. Eyes alone are not much use when faced with any of the formats or media in which digital data is presented. Unless we have a separate usable and maintainable access mechanism, we simply cannot use the acquisition data.
The question of what to preserve is answered in part by the reason why an item is acquired. If it is acquired in order to serve a community of undergraduates for one or at the most two years, there may be reasons for acquiring multiple copies, but little reason for preserving the copies once they have served their very limited purpose. If on the other hand an item is acquired for permanent addition to and retention in a collection, its preservation becomes as important as its acquisition. Short term use may still call for a short term conservation fix; it does not call for a controlled long term preservation policy.
If we consider a national deposit library as a library of "last resort" for publications which otherwise may disappear, and as a place where the entire "published archive" of a nation is kept together and is recorded, the principles of selection and acquisition of material are the same whether we talk about conventional or electronic formats.
The way in which these formats are selected and acquired will vary. As a matter of principle, all publications, whether conventional or electronic should at least be considered for acquisition in a deposit collection. In practice selectivity is forced upon us by constraints in resources, storage space, handling capacity and funding. Selectivity may also be influenced by technical capacity. It has been said that the selection of electronic publications should be limited to those that can be acquired, handled and stored locally by the library. However, in a digital environment one could equally well argue that giving access to publications that reside elsewhere also fulfils one of the major purposes of any library, namely to make information available to its users (although it is not a deposit function).
Dynamic documents (such as frequently-updated online databases) pose an acquisition problem that we do not face with conventional texts. Although one may argue for selective acquisition that is frequent enough to preserve all information contained in such a publication during its lifetime, prohibitive costs may well compel a much greater selectivity aimed at only acquiring representative samples (however difficult it may be to decide what is representative).
Format and Medium
The format in which the information is presented should not influence its selection or non-selection, as a format that cannot be easily handled may be converted to one that the library or archive can handle. This may be problematic, but it should be attempted; time and effort should be spent to achieve it. Nor should the medium be regarded as a selection criterion. Here again, the information content may be transferred to another medium that can be accommodated. Selection criteria relating to the intrinsic value or importance of the material to be acquired will be the same for conventional and for electronic material. In libraries where maximum access of most up-to-date material is the prime objective, selection criteria may well be guided by medium or format.
RETENTION VS. PRESERVATION: CONVENTIONAL MATERIALS
The question of retention is inextricably linked with preservation. It is technically possible at least for conventional material to preserve an item virtually for ever (provided it has not been neglected beyond rescue to begin with). The decision of whether or not an item will be retained needs to be made, as well as the decision of whether an item needs to be retained in its original format or in surrogate form. In many cases, the format is as important as, sometimes more important than, the information it contains. Format alone can provide information over and above its contents and there are library and archive users who have a real need to consult the material in its original format. For many users a surrogate will suffice and can at times be preferred. The decision whether to retain the original once a surrogate has been made is not clear cut.
RETENTION VS. PRESERVATION: ELECTRONIC MATERIALS
While for conventional material we can still make the distinction between retention and preservation, for electronic material such a distinction no longer applies. The main reason is the lack of longevity of the storage media for electronic information, coupled with the imminent obsolescence of their retrieval hardware and software. Simply "leaving things as they are" is not an option for digital collections. The choice whether to retain the document as an artefact, or to retain the information it contains, or both, is less of a real choice with electronic material. If we try and keep electronic publications as artefacts (i.e. exactly as received from the publisher) they will eventually become inaccessible and their contents will be lost. On the other hand, if we attempt to retain the content, many aspects of the visual presentation and perhaps even of the "functionality" of the electronic document will be lost. We may also lose what Peter Graham has called the "integrity and authenticity of the information as originally recorded". Experience so far seems to indicate that in the long run the intellectual content of an electronic publication is all we can retain and we shall have to accept (at least for the time being) that certain interactive dynamic and presentational aspects of the original cannot be retained.
In parallel with conventional publications, the off-line digital publication as a physical object is itself an expression of a part of our culture. It could therefore be argued that we must try to retain at least a representative sample of such physical objects and of their retrieval mechanism, in the knowledge that once the latter have broken down or can no longer be replaced, we will end up not as a functioning library or archive but as a museum of dead digital dodos.
RELATIONSHIP BETWEEN ACCESS AND PRESERVATION
The need for access has already been mentioned several times. Many libraries and archives take the amount of use that is made of their collections as an indication of their preservation needs. One can argue that the nature and purpose of the use, rather than the amount of use an item may get or is expected to get, is of paramount importance when making retention and preservation decisions. To give low use as a reason for neglect or non-preservation is dangerous. Some material may not be in immediate demand nor in frequent demand, but it may be needed by someone at some stage to increase knowledge or improve understanding. If we believe this, then the model proposed for digital preservation by Donald Waters as the "just-in-time" model (versus the just-in-case model of conventional preservation) is one that should be used only in awareness of its limitations. The increasing tendency in some parts of the library world away from collections in favour of access reduces the chances of our long term ability to fulfil the research needs of future generations.
Nevertheless, the question of why an item or a collection should be preserved is closely linked to considerations of use and considerations of access. Only if we want to create a time capsule is there any point in preserving material to which access is withheld and even then, a time capsule is only of value if people know what it contains or if it is opened one day.
Most libraries and archives have a rôle that is wider than that of guardian of the cultural heritage. They have the duty to make their collections available to those who need to use them, now and in many cases also in the future. Providing access to the collections while preserving them for future use can, at least for conventional material, be seen as two conflicting aims. There are indeed kinds of access that defeat or prevent future use, in the same way as there are preservation methods that inhibit instant access. Nevertheless, such conflicts can be resolved and if the need for, and the purpose of, access are considered carefully, the dilemma between access and preservation is not quite so acute. Per contra, for digital material we can argue that access can assist preservation.
Unlike conventional items, electronic items do not deteriorate through use, but if they are not used for a long period, they may prove not to work any longer (because of deterioration of the mechanism and/or technical obsolescence). While not in itself sufficient, a high level of systematic access helps to check the usability of electronic publications.
The kind of use, the kind of access that is needed, influences preservation decisions and preservation methods. It has already been pointed out that with electronic material we may not have the choice to preserve both content and physical integrity. We do, however, have the choice whether to preserve electronic documents in digital format, on-line or off-line, and whether we "convert" them (for the purpose of long term retention) to non-electronic media. These choices will to some extent be steered by the medium and format of the publication, but also by the type of access that is needed. In many cases electronic publications cannot be preserved as originally received, whether this is because the medium will not survive, or because the technical environment becomes obsolete, or for intrinsic reasons (for instance networked publications by definition cannot be acquired and stored in their original medium so have to be converted to another). If access is needed to the content only, irrespective of any other functional considerations, the cowardly way out may be to convert from electronic media to paper or microfilm. However, such a strategy may only be valid for publications which are not true electronic documents but are just non-interactive static documents distributed on an electronic medium. For dynamic, interactive documents and multimedia, such conversion is not an option.
If we want to preserve publications as electronic publications there are two basic options for their archival storage, either off-line storing them as physical objects, or on-line, on a database. These options provide a different kind of access. in the case of storage off-line, access needs to be provided first of all through reference in a catalogue, then by fetching the object and putting it into a suitable reading device. On-line storage implies on-line access, and a reference in a catalogue will give an on- line storage location, allowing direct access to the publication. If distributed or networked access is necessary, the on-line storage option will be preferable.
Having discussed what to preserve and why, the vexed question of how to preserve may well be asked. This has not been covered here, partly because the author does not feel qualified to do so, at least not for digital material, and partly because the "how" is not really part of a preservation policy.
ECONOMIC AND MANAGEMENT CONSIDERATIONS
It is however relevant to mention two more considerations that will influence a preservation policy, for any sort of material, namely economic and management considerations. Although human intellect, human understanding, historical and technical knowledge, common sense, energy and a will to succeed are all vital, no preservation policy, no preservation programme, however well conceived, stands a chance of being implemented without sufficient funding. But preservation is only one of many library and archive functions that cry out for funding. In order to find a proper balance between the funding of preservation and other functions, we must again consider how they are related. Historically, libraries have looked at the balance of funding between acquisitions and preservation, between access and preservation, and sometimes between public services and preservation. In recent decades the balance of funding between computing and telecommunication services and preservation has also been considered. However, when we talk about the preservation of electronic material, the latter distinction may well disappear.
Lack of resources has always stood in the way of the successful implementation of a preservation policy or strategy and will certainly do so no less for electronic material. Perhaps the situation is even worse. At least once one has conserved a book, one can be reasonably satisfied of its continued existence (provided the item is properly stored and not over-handled). Similarly, once one has made a microfilm, provided the film and its production methods are of archival quality and it is stored in the right conditions, the contents of a book or manuscript will be preserved for about 300 years. However, this is not the case with electronic material. Long term access to such material requires an ongoing commitment to reformat, refresh or migrate data, and only if libraries and archives are willing to commit long term funding and long term effort should they embark on the acquisition and maintenance of electronic collections. To do otherwise is irresponsible. Planning for long term preservation of electronic material is made even more difficult because of the rapid changes in technology and the impossibility of predicting what the state of technology will be, even in the medium term.
THE IMPORTANCE OF COLLECTIONS
Notwithstanding all these uncertainties and all these problems, there is one thing that remains certain, and that is the importance of the data itself, of in old fashioned terminology the library and archive collections and their continued existence. The collections form a library and archive's most valuable and most important asset, and the provision of access to those collections their most important duty. The argument has been presented in the past that in an electronic environment a library will become an information broker, an institution that does not own the data but simply enables access to them. If that is our future rôle, someone will have to ensure that the data remains accessible and usable.
Technology will help. It will continue to improve and to become more and more useful and affordable. We must seize it when appropriate, but we must not think that it provides an answer to all our problems at least thus far it has failed to do so. The answer may lie in the human ingenuity to develop and use it, but we must also endeavour to make the best possible use of the available resources; we must ensure that we do not duplicate efforts; we must combine to work together, to share the responsibility for preserving our cultural heritage, and we must be selective, in the full knowledge that selectivity is almost certain to damage future research. It is therefore the more important to be selective in the context of a national or international preservation strategy.
It seems fitting to close with words from Northrop Frye: "Society, like the individual, becomes senile in proportion as it loses its continuous memory". In an electronic age these words are not merely a warning, they are a threat.
Technical Director, Cimtech Ltd.
This paper discusses the practical implications of the tasks required to preserve digital works. It presents two main options ( on-line and off-line storage), and relates these to The British Library's needs. Finally, estimated costs are stated.
This paper stems from a consultancy study carried out by Cimtech for The British Library. The study examined the issues which surround the preservation of digital materials. It started with a literature review, then moved on to a review of the preservation process, developed a statement of objectives, reviewed the preservation options, and considered the resource requirements.
OBJECTIVES FOR THE LIBRARY
The starting point is expressed neatly by David Martin: "Any document which is published within the UK shall be eligible to be designated for legal deposit". Though few would disagree with this basic idea, in the electronic age it does require understanding of the scope of the term of "published". It is proposed that in this context we consider:
DIGITAL PUBLICATION MANAGEMENT REQUIREMENTS
The Library's requirements can simply be categorised in conventional form as falling under the following headings:
Identify publications/publishers; sign agreements; enforce deposit; maintain list of classes for deposit; update list of exclusions.
Log receipt; assign accession number; check documentation; count copies; check permissions; check media; forward copies to deposit libraries; pass on.
Check media; send out; virus check; read documentation; load data; run tests; repeat for copies; check keys to usage restrictions; download data; technical notes; pass on.
Link accession record, publication, documentation, documentation, notes; view and inspect; create bibliographic record and profile; record storage location of data and documentation.
Label publication; store data online and back-up, or download and store off-line and record location.
Provide users with access to publications, manually at standalone workstations, on- line at local workstations or at deposit library workstations via a wide area network. Some of these tasks are familiar to libraries from the handling of books. Others are unique to digital material; of these some (eg checking documentation) can represent enormous levels of human effort.
OPTIONS FOR MANAGING DIGITAL MATERIALS
The Cimtech consultancy study identified two options for ways in which The British Library can handle digital materials.
Option 1: On-Line
This option has the following features:
Option 2: Off-Line
This option has the following features:
Although this approach is followed at the Library of Congress to an extent, it is not practical for very high volumes. It also raises problem of media, security and standards.
THE LIBRARY'S DIGITAL PRESERVATION REQUIREMENTS
The main requirements are to ensure that data is not lost, and to ensure that the data can be interpreted in the future.
Ensure Data is Not Lost
This can be taken to mean that the data is preserved for "digital archaeologists" of future generations to decode. This would mean that no effort is made to make the data accessible or usable for immediate or medium term access.
Preservation of this type can be effected by copying the data to CD-R (Compact Disk Recordable) platters. The platters would be stored in controlled conditions, and the data would be refreshed by copying to new platters (or other media) every ten years.
Making suitable assumptions, we estimate a cost of approximately £47 p.a. to archive an item in this way over 25 years.
This approach presumes that some issues can be overcome. For example, some data cannot be easily copied from some media (eg some existing CD-ROMs); and current CD-R technology does not automatically perform read-after-write checking (and so something needs to be done to ensure the integrity of CD-R copies).
In the absence of complete, issue-free solutions to the problems, the challenge is to start managing the data now, in the assumption that the answers for long term preservation will emerge naturally.
Ensure Data Can be Interpreted in Future
There are three ways in which we can manage data to make sure that future generations will be able to make use of it:
The Cimtech study developed the following cost estimates for preservation:
Annual cost to manage and preserve a paper monograph: £5 p.a.
Annual cost to manage and preserve a CD-ROM off-line: £95 p.a. (this in addition to the £47 p.a. estimate for refreshing the media, as explained above) (note that one CD-ROM holds the equivalent of about twenty paper monographs. The cost includes an allocation of the costs of providing PC workstations for access.)
In the longer term, costs would increase.
There are no clear solutions yet which answer The British Library's needs for long term digital preservation without significant drawbacks.
For digital publications which are "similar" to paper documents, the pragmatic approach will be to convert all unformatted text into ASCII format for preservation; and to convert formatted text to a portable, platform-independent format such as Adobe's PDF (Acrobat) format. Ideally, a completely open format will be adopted.
This approach will allow some risk of charges that publications are being republished and/or corrupted by the changes introduced in this preservation process. Both these are to be avoided as much as possible.
Clearly, the state of the technological art in the field of digital preservation means that we have to tread very carefully when taking long term decisions. The high costs and risks point to the need to be very selective in preserving digital works. Copying of some to paper or microfiche may remain the most desirable option.
This is an enormously diverse subject, which cannot adequately be covered in a discussion of only one hour. The description or definition of "strategy" in this context was the subject of some discussion, and there was no attempt to develop a formal, complete definition.
The principal conclusion concerned the importance of establishing and maintaining a momentum.
Notwithstanding the relevance of adopting a strategy, it was felt that some actions should be initiated as soon as possible, so that there is not an inordinate delay while a thorough strategy is produced.
The higher education sector has its own needs; it will have to create its own solutions, rather than relying entirely on other institutions such as The British Library. Major public libraries may, however, have an important rôle to play.
Other conclusions were:
Many ideas and issues, but fewer concrete conclusions. emerged from this group. Two divergent views were represented in the group, namely:
As with the syndicate on Collection, some of the discussion centred on the (unanswered) question of which objects should be preserved; in particular, the issues of whether unusable digital objects (eg those needing obsolete IT components) should be preserved for future generations.
Five positive suggestions were developed:
This syndicate identified three major headings for practical implications, namely Management, Resources and Technology. The limited time available restricted discussion to Management and Resource issues.
There was considerable debate on the meaning of the term "publishing", with a conclusion that the debate is more of concern to national libraries than to the higher education sector.
Key issues were:
Chair: Lynne Brindley, British Library of Political and Economic Science, UK
12.00 - 13.00 Arrival and Registration
13.00 - 14.00 Lunch
14.00 - 14.15 Introduction - Lynne Brindley
14.15 - 15.00 Preserving the Digital Library - Peter Graham, Rutgers University Libraries, US
15.00 - 15.30 Tea
15.30 - 16.00 Strategies for Managing Electronic Archives - Denise Lievesley, ESRC Data Archive, UK
16.00 - 16.30 Collection Policies - Daniel Greenstein, Arts & Humanities Data Service Executive, UK
16.30 - 17.30 Syndicate A: Strategies and Syndicate B: Collection
17.30 - 17.45 Report from Syndicate A
17.45 - 18.00 Report from Syndicate B
18.00 - 19.00 Free
19.00 for 19.30 Dinner
After dinner: Legal Deposit The British Library Experience - Sir Anthony Kenny, Chairman of The British Library Board, UK
Chair: Nigel Macartney, British Library, UK
09.00 - 10.00 Preserving Digital Information- Margaret Hedstrom, University of Michigan, US
10.00 - 10.30 Preservation Policies -Mirjam Foot, British Library Collections and Preservation, UK
10.30 - 11.00 Practical Implications- Tony Hendley, CIMTECH Limited, UK
11.00 - 11.20 Coffee
11.20 - 12.20 Syndicate C: Preservation Policy and Syndicate D: Practical Implications
12.20 - 12.35 Report from Syndicate C
12.35 - 12.50 Report from Syndicate D
12.50 - 13.30 Summary of outcomes and conclusions- Lynne Brindley
Roy Baker, University of London Computer Centre
Lynne Brindley, British Library of Political & Economic Science
David Buckle, OCLC Europe
Lou Burnard, Oxford University Computing Services
Terry Cannon, British Library Research & Development Department
Reg Carr, Leeds University Library
Julia Chruszcz, University of Manchester Computer Centre
Ann Clarke, British Library
Alice Colban, JISC Secretariat
Margaret Croucher, British Library Research & Development Department
Marilyn Deegan, de Montfort University
Richard Field, University of Edinburgh
Mirjam Foot, British Library Collections & Preservation Department
Marc Fresko, Imaging & Information Technology Consultant
Hazel Gott, UKOLN
Peter Graham, Rutgers University Libraries
Daniel Greenstein, Arts & Humanities Data Service
Rhidian Griffiths, The National Library of Wales
Margaret Hedstrom, University of Michigan
Tony Hendley, CIMTECH Limited
Bjørn Henrichsen, Norwegian Social Science Data Services
Andrew Jordan, University of Huddersfield
Sir Anthony Kenny, The British Library Board
Geraldine Kenny, National Preservation Office, British Library
Denise Lievesley, ESRC Data Archive
Nigel Macartney, British Library Research & Development Department
John Mahoney, British Library Research & Development Department
Ann Matheson, National Library of Scotland
Simon Musgrave, ESRC Data Archive
Bernard Naylor, University of Southampton Library
Seamus Ross, The British Academy
Chris Rusbridge, The Electronic Libraries Programme
Anne Thurston, University College London
Frank Wright, Ordnance Survey