Repositories and preservation cluster session 2006-03-28 report

From DigiRepWiki

Contents

Repositories and preservation cluster session

28th March 2006

Informal report from the Repositories and preservation cluster session at the Second JISC Digital Repositories Programme Meeting, 27th-28th March 2006, Warwick. Participants in the session are encouraged to add to or comment on this document, either by editing directly or adding comments to the associated talk page.

Introduction

Steve Hitchcock introduced the meeting and its aim to enable collaboration between projects. The discussion group included projects that were not identified as cluster members, and these were encouraged to join if relevant. Thinking about new areas and projects for future JISC funding was also encouraged.

Helen Hockx-Yu, JISC funding

Helen gave a recap of the 12 workpackages for forthcoming JISC funding, outlining some of the preservation elements. For example, workpackage 8 – Tools and innovations, could encompass work on exploring new technologies for digital preservation, such as auto-extraction of metadata, cost model tools, assessment tools, retention / appraisal tools, ingest tools, personal desktop preservation, lifecycle management support tools and implementations of PREMIS. Workpackage 11 – Preservation assessment, will address preservation gaps through later calls.

Questions for Helen touched on whether there is space to address the users’ desktop, through supporting the uptake of XML and increasing the availability of XML tools. Although this was not seen as an agreement that XML is the preservation format, the XML openness of the standard is a significant consideration. By default, rather than policy PDF is widely used by repositories and the implications of this as, as yet, unknown.

Discussion

Steve Hitchcock opened the discussion session with a short presentation on the content-service provider model and talked about project that are developing practical applications based on this model. Referring to the OAIS functional model, Steve noted that repositories are well advanced in fulfilling data management functions, but preservation elements are lacking. Existing expertise, such as that brought together by the cluster, can be used to plug this gap. After this introductory context provided by Steve, discussion kicked off with some debate around the definition of curation and preservation. The following sections represent the broad areas of discussion, with some key points noted:

Semantics and cluster remit

  • The remit of the cluster encompasses both preservation and curation.
  • Terminology confusion exists wrt the meaning of preservation, curation and archiving.
    • ‘Archiving’ has specific meanings in the Archive and ICT communities.
    • Curation is taken to be the management of information to ensure future use.
    • Preservation is largely dedicated to ensuring the longevity of pre-existing objects or data.
  • A forthcoming article by Neil Beagrie on this topic will be circulated to the cluster when published.
  • There is a role for curation activity and ‘small p’ preservation in a repository.

Lifecycle and advocacy

  • Changing the attitude of users and embedding an awareness of preservation at the point of creation are key issues for repositories and preservation.
  • Repositories need a business case and commitment to preservation from the outset.
  • Repositories are a start, rather than a curation-solution, but repositories and preservation must co-exist. Joining up services across repositories in the coming years must include preservation services.
  • The OA movement and the use of repositories encourage academics to take responsibility for their assets.
  • Curation should be embedded into workflow through awareness-raising, best practice and tools, rather than through ‘scare’ tactics.
    • For repositories, where improved access is the primary focus, avoiding barriers to deposit is crucial. Preservation must be advocated as an additional, even ‘Selfish’, benefit of deposit, rather than an additional task.
      • e.g., use of accessible, long-term formats might offer immediate gain
      • searching within documents might be enhanced by particular formats
      • there might also be increased exposure to Google, benefits for the RAE process, or benefits offered by capturing metadata
    • Often these ‘organisational’ benefits are recognised by academics after they start using repositories and it can be difficult to convince, in advance, that extra work is a good thing.
  • Traditionally, preservation has been carried out by librarian and archivists, without academics needing any awareness.
  • There are lessons to be learnt from records management about lifecycle and retention; part of curation is knowing what and when to preserve.

Preservation metadata

  • Without metadata, preservation and curation decisions cannot be made, particularly for non-text materials such as images.
  • Capturing metadata must begin at the point of creation.
  • The importance of metadata for current and future use must be demonstrated.
  • Using a repository for self-archiving forces the capture of essential metadata.

IPR and DRM

  • There needs to be a coordinated approach to collecting, particularly where collections are under the control of agencies without a preservation remit. Transferring materials to other agencies raises:
    • intellectual property questions.
    • DRM issues.

Assessment factors and risk

  • In the digital realm, preservation need to be forward-looking and preventative.
  • Determining the value of assets:
    • e.g. through the ESPIDA scorecard system
    • evidence-based categorisation of retention periods, e.g. using citation as a measure of usefulness.
    • future and historical significance can’t always be predicted
    • ‘collecting communities’ may be needed to identify the long-term value of material that might otherwise be discarded.
    • if storage is cheap, should everything be preserved? or is appraisal needed?
    • if preservation is done for the short-term, long-term preservation is still possible and may be more economic in future.
    • repositories are part of a changing landscape, but they do exist and they do have content for us to work with.
  • There are significant economic implications of preserving and migrating data - convincing arguments must be made regarding the need for preservation if funding is to be secured.
  • Discussion of long-term preservation is, by its nature, speculative
    • evidence of future success is not available.
    • expanding knowledge through research enables us to make best guesses.
    • Failures and loss are inevitable, as it is in the analogue world.
  • How much current content is under threat and how much threat is over-estimated?
    • Are there any practical examples of formats that present serious problems?
    • It is undesirable to divert funds to problems that have been overstated.
    • There are many stories of a failure to manage ‘media’.
    • There are examples of where there is data loss from file formats migration.
    • There is documentation of the vulnerability of bits.
      • Repositories deal with this by systematic ways of replicating. Preservation is historically split off from the day-to-day management of repositories – this is a mistake; together management and preservation become curation.
  • Content provider perspective, from MIDESS, SPECTRa, eBank/R4L:
    • MIDESS: looking at non-text multimedia materials. Formats for preservation are an issue.
      • For example, various, mainly proprietary, video formats are used and large file sizes make it uneconomic to store multiple copies.
      • Lossless multi-platform open standards are not necessarily the most highly-compressed and are unlikely to be the formats chosen by creators. Film archives face similar problems.
      • A JISC-commissioned archiving study by the AHDS will report on this area soon.
    • SPECTRa are storing chemical data
      • currently most of this is simply lost because there is no mechanism for saving it
      • creators don’t appreciate the value of what they have, because they are creating a small part of a larger whole – added-value is perceived by curators and collectors.
    • R4L/eBank
      • scientific data can provide examples of disappearing formats.
  • File format registries need to be embedded into the repository infrastructure.
  • There need to be mechanisms for transferring assets that are no longer used by the originating repository, possibly to use of third party services, such as AHDS.

Trust and certification

  • Lightweight ‘feel-good’ approaches may be better than prescriptive certification regimes.
  • Levels of certification will be explored by the DCC.
  • Certification need to answer the following question:
    • What would it give me as a repository administrator?
    • Certification must offer incentives, not barriers
      • e.g. it might bring addition funding, kudos or be a lead into an aggregated service
  • ArXiv is trusted, but it wasn’t developed with certification in mind.
  • German work in this area has tackled certification form the outset, but certification must be driven by what repositories are doing in practice.
  • How can we provide tools to make certification easier?

Out of this discussion, some ways forward for the Repositories and preservation cluster were identified and are outlined on the accompanying presentation (Repositories and preservation.ppt), presented to the plenary session by Gareth Knight.