Cedars Access Issues Working Group

Cedars Preservation Metadata Elements

Cedars Project Document AIW02

Andy Stone
Oxford University Computing Service, 13 Banbury Road, Oxford, OX2 6NN, UK.
andy.stone@computing-services.oxford.ac.uk
http://www.oucs.ox.ac.uk/

Michael Day
UKOLN: The UK Office for Library and Information Networking, University of Bath, Bath BA2 7AY, UK.
m.day@ukoln.ac.uk
http://www.ukoln.ac.uk/

Cedars Document Number: AIW02

Status: Current

Created: 25-Feb-1999

Updated: 25-Feb-1999

Availability: Project Only

1 Introduction
2 The OAIS model
3 The draft Cedars architecture
4 Cedars metadata elements
    4.1 Content Information
    4.2 Representation Information
    4.3 Preservation Description Information
    4.4 Packaging Information
    4.5 Descriptive Information
5. Cedars Metadata Elements (First draft) - Tabular Format
6. References
7. Document History

1 Introduction

This document builds upon the work carried out for Cedars Project Document AIW01 Metadata for preservation (Day 1998). That report identified and outlined a number of metadata initiatives and projects with relevance to digital preservation. This document synthesises metadata elements taken from some of these initiatives and places them in the framework of the Reference Model for an Open Archival Information System (OAIS) published by the Consultative Committee for Space Data Systems (1998).

This is a first attempt at identifying preservation metadata elements for the Cedars project within the general framework provided by the OAIS model.

The metadata schemes most used in this document are:

PANDORA – National Library of Australia. PANDORA logical data model, Version 2. Canberra: NLA, 10 November 1997. <URL:http://www.nla.gov.au/pandora/ldmv.html>
Pittsburgh Project – Bearman, D. and Sochats, K. Metadata requirements for evidence. University of Pittsburgh, School of Information Science, 1996. <URL:http://www.lis.pitt.edu/~nhprc/BACartic.html>
RLG – Research Libraries Group Working Group on Preservation Issues of Metadata. Final Report. Mountain View, Ca.: RLG, May 1998. <URL:http://www.rlg.org/preserv/metadata.html>

2 The OAIS model

The OAIS model identifies what it calls a Taxonomy of archival information object classes that includes the following types of data:

Content Information
Representation Information
Preservation Description Information (broken down into Reference, Context, Provenance, and Fixity information)
Packaging Information
Descriptive Information

Using this structure should make it easier to identify the components of the different Information Packages which OAIS refers to (which reflect actions at different stages of an archive). Metadata elements detailed in the PANDORA data model, the Pittsburgh metadata requirements document and the RLG Working Group report were mapped on to the OAIS taxonomy in order to help identify specific metadata elements.

3 The draft Cedars architecture

The metadata specification will be an important part of the Cedars paper demonstrator. It will therefore need to fit into the general framework of the proposed Cedars demonstrator architecture. David Holdsworth’s draft architecture outlines two essential outcomes of a preservation demonstrator:

Ensuring that one can find the preserved digital object
Ensuring that one can successfully access relevant aspects of this object when you’ve found it

Holdsworth has already noted the requirement for unique identifiers and has proposed a Cedars Reference Identifier (CRID).

Metadata, therefore, has several distinct roles within the Cedars architecture:

Resource discovery within the Cedars demonstrator. Finding the preserved digital object will the function of the Descriptive Information (in OAIS terms). For the demonstrator, this could take one of several different forms: MARC, EAD, etc. For the purposes of this document the Descriptive Information will conform to simple Dublin Core as defined in RFC 2413.

Retrieval from the Cedars demonstrator. Accessing the preserved digital object will depend upon Representation Information (in OAIS terminology) or structural metadata. The Representation Information (RI) may need to contain the whole specification for the format in which the object is ‘encoded’, e.g. for TIFF images or PDF files together with specifications of software environments (operating systems) and hardware. It is expected that the Cedars Preservation Strategies Working Group (PSWG) will contribute to the specification of RI. In this document, RI will just document relevant formats, operating environments and hardware.

Rights. Rights management will be an important part of any preservation demonstrator and could be used to manage access to objects stored in the archive. In OAIS terms, rights metadata are part of the Provenance Information of the Preservation Description Information (PDI). Rights metadata may need to kept separately from the rest of the PDI within the Cedars demonstrators. It is expected that the Cedars Content Issues Working Group will contribute to the specification of rights metadata within the context of the paper demonstrator.

Preservation description. The remainder of the PDI gives information about References (chiefly identifiers), Context, and Fixity. Fixity Information is concerned with the authentication of the archived object and will need to be the subject of further study. Context Information deals with relationships between (and within) resources, granularity issues and will document the history of an archived object with relation to format conversions, preservation strategies and usage. Further Provenance Information within the PDI might contain information about the context of creation of the archived object and the digitisation capture process.

Note that an Archive Information Package (AIP) will comprise the Digital Object together with its RI and PDI.

The following list of elements are to encourage discussion about the metadata required by the Cedars demonstrators and will lead to the eventual production of a Cedars metadata specification. The elements are part-derived from metadata schemes with a different focus to Cedars.

A proposed metadata record for one resource, the Wife of Bath’s Prologue on CD-ROM will be added to aid discussion.

4 Cedars metadata elements

4.1 Content Information

Content information identifies the bits comprising the primary Digital Object; i.e. is the raw file. This may consist of:

File Identifier (the unique identifier which indicates the data object in question – e.g. the CRID which David Holdsworth has referred to)
File Size (size of file in kilobytes, megabytes, etc.)

Other information regarding the Digital Object can be found under the Representation Information and Preservation Description Information (PDI).

4.2 Representation Information

Representation Information is defined by the OAIS model as "information that maps data into more meaningful concepts", and as "accompanying a digital object, or sequence of bits … to provide additional meaning". In general, it can be said that "Data interpreted using its Representation Information" yields information.

It should be noted that Representation Information stored in digital form will require Representation information of its own to be understood. This recursion can be broken in one of two ways: through a physical document, or through software. The latter is felt to be a risky option, since there is no guarantee that such software will continue to exist. This identifies a need for a two-pronged approach: to indicate the required software (and, potentially, hardware emulators) to view files; and provision of physical documentation on standards for software formats, particularly in cases where they may be in decline.

There are varying levels of representation: collections may come with additional levels of functionality; although a description may be preserved of this, the implementation may or may not also be preserved.

Since there will be a long-term issue of format migration, this could be devolved to a dedicated "technology watch unit" which can identify and recommend migration action.

Proposed RI Elements:

4.2.1 Language

for content an international standard such as NISO Z39.53 could be used. This includes older languages which may be useful for textual resources (e.g. in OTA). Available: <URL:http://www.oasis-open.org/cover/nisoLang3-1994.html>

4.2.2 File Information:

There are two main strands to this; the physical storage format used (called "file format" in Pittsburgh), and the type of file formatting ("file rendering" in Pittsburgh) used.

Storage format: Type Examples would be diskette (3.5, 5.25 inch), CD-ROM, information server, HFS path, etc. This may be important since the underlying software may require the emulation of mounting diskettes, for example.
Storage format: Version
Storage format: Specification (link to reference resource; should be derived from type and version)

Type of file format:
Name of File Type (e.g. TIFF)
Version
Number of files
Total file size
Documentation (link to reference resource; should be derived from type and version)

A list of file formats would need to be maintained; formatting metadata may be required to understand how logical access to the bit-stream takes place.

File Modality (from Pittsburgh)

A higher level of description, similar to Dublin Core ‘Type’, i.e. text, graphic, sound, video, etc.

If a type of encoding is not initially recognised, one could be made aware of its generic type so that exploratory work can be done to assess the feasibility of accessing the file.

4.2.3 Modality-Specific Metadata

There will be format-specific technical information. For example:

4.2.3.1 Text-based data:

Character Set (e.g. ASCII, Unicode)
Encoding
Associated DTD (for SGML/XML documents)

4.2.3.2 Graphic data

The following have been modified from RLG:

Resolution (e.g. 600dpi)
Colour (e.g. 16-bit)
Colour Management
Colour Bar/Grey Scale Bar

4.2.3.2 Audio data

CEDARS is not addressing multimedia, but examples might include:

Sampling Frequency (e.g. 8kHz, 44.1KhZ)
File Encoding (e.g. AIFF, AU, WAV – will require associated package to play back file)
Bits (e.g. 8-bit, 16-bit)
Mono/Stereo

4.2.4 Dependencies (adapted from Pittsburgh)

Resources may have dependencies on particular applications, software environments and hardware. These dependencies will, where appropriate, need to be recorded.

Application Dependency (e.g. Yes/No)
Application Name (e.g. OpenText, MS-Word)
Application Version (e.g. version 5.0)
Notes (e.g. "Versions of OpenText later than 5.0 are unable to handle SGML files")

Application dependencies could also refer to Plug-ins required, as mentioned by PANDORA’s "Online Format" field.

[Note: Is it within the remit of CEDARS to maintain the location of application required? At the librarians’ focus group in December, the notion of an application repository was mentioned. This would imply applications would have their own identifiers (and consequently locations) so they could be accessed].

Software Environment Dependency

Dependencies on particular operating systems and environments.

Software Environment Name (e.g. Windows, Macintosh, UNIX, none)
Software Environment Version (e.g. Windows 3.1, 95, 98, Mac System 7, System 8.1, AIX, IRIX)

Hardware Dependency

This may be audio or visual minimum requirements. However, since CEDARS is not addressing time-based media, this may not be as essential as previously thought:

[Note: This could also be used to record dependencies on particular hardware, e.g. a PC with 486 processor (minimum)?]

Hardware Name (e.g. Mouse, Joystick, Monitor)
Hardware Type (e.g. Pointer Device, Colour Monitor)
Hardware Description (e.g. mouse needed to interact with software; images served are large, high-quality colour, etc.)
Specification (e.g. suggested minimum size 17", refresh rate 75Hz, dot pitch 0.28mm)
Modality Served (e.g. text, graphics, sound, video)

4.2.4.5 Compression

Name (e.g. TIFF-LZW, ZIP, TAR)
Method
Required uncompression package (e.g. graphics viewer package, PKUNZIP, etc)

If the uncompression package is not available, a reference to its specification will be needed.
If an application repository exists, compression/uncompression packages should be included in this.

4.2.4.6 Location

The location of file on archive system - an identifier (i.e. CRID)

4.3 Preservation Description Information

4.3.1 Reference Information

This identifies and (if necessary) describes one or more mechanisms used to provide assigned identifiers for the Content Information. It also provides identifiers which allow outside systems to refer unambiguously to particular Content Information.

Identifer Name (e.g. CRID,URL, ISBN, Author, Title, etc.)
Identifier Value

Examples: CEDARS Identifier (a unique identifier within the CEDARS scheme - e.g. the CRID); Internet URIs or URNs; ISBN/ISSN numbers (for electronic texts where applicable). Preference must be given to persistent and unique identifiers.

4.3.2 Context Information

Defined as the relationships of the content information to its environment. This includes background information on why the Content Information was created, and how it relates to other Content Information objects elsewhere.

The main metadata types refer to: granularity, formatting history, preservation strategy history, usage, and format processing.

4.3.2.1 Granularity

Different types of archive will have different levels of granularity, e.g. in ILEJ, there will be publication level, issue level and page level. It needs to be clear which one is meant. PANDORA refers to Publication Identifier and Copy Identifier. There may be a role for the Dublin Core Relation element.

Level of granularity (this could be defined as part of the archival schema; examples need to be worked upon using the exemplars we have)
Related item (e.g. list of CRID/URNs of items)
Relation to item (e.g. it is an element/component of it)
Is item required? (e.g. will the object in question make sense without it, e.g. links required to understand object within collection)

Online data, while usually output as text-based files read by a client (e.g. Web browser), may only be accessible via server-based programs. For example, the ILEJ collection is available via the Web, with a variety of searchable functions. If it was felt important for this functionality (or part of it) to be ‘preserved’, there would be a trade-off between any completeness of functionality and cost. (Note: this observation also refers to dependency on software to provide FUNCTIONALITY which has already been covered – this is distinct from the granularity of the digital object files)

Additionally, the ILEJ and Oxford Early Manuscripts projects both raised issues regarding associated files. The uncompressed scanned TIFF file of a manuscript page may not be appropriate, nor is the full context of the digital artefact known. There needs to be the facility to link to the relevant files in order for a contextual understanding to be made (e.g. ILEJ may link the OCR text file with the scanned image and thumbnail)

In instances such as these, the judgement of what constitutes the digital object is important.

Perhaps in each archive, this complete functionality needs to be documented (and if functional instances of this have been preserved, e.g. specific CGI search routines) stating how these were used.

4.3.2.2 Formatting History

History: Submission:
RLG: Date - when the digital preservation copy was created.
History: Change:
The RLG metadata refers to "Change History" - changes made to the formatting of the data under a migration strategy, or to decisions regarding its preservation.
History: Change: Formatting:
A value known as "Current File/Format" was in a previous draft; however, this is redundant if the submission event is recorded as the first change – if this is a digitisation process, the original medium can be

If the Current File/Format fields have a "no" value, then information about the file/format type (and version) converted to; reason for conversion; date converted; and conversion agent is also captured. It can be envisaged that this information will be retained as conversions take place over time. This facilitates a history trail to be kept of files/formats.

FormattingChange: date
FormattingChange: previous format (e.g. paper master, CD-ROM, HTML)
FormattingChange: process (e.g. scanning, HFS backup, conversion to XML)
FormattingChange: reason (e.g. creation of digital master, change in storage policy, archive decides to convert HTML docs into XML)
FormattingChange: conversion-agent (e.g. OUCS, ILEJ Project Officer)

PANDORA mentions that the physical characteristics of tangible formats will need to be kept in the archive.

4.3.2.3 Preservation Strategy History

History: Change: Preservation Strategy:
Pittsburgh: Disposition Requirements Metadata
Removal Authority - who’s authorised to purge a item from the archive system, e.g. the Oxford Text Archive.
Retention Policy Citation - textual information of organisation’s policy for record retention, e.g. link to relevant document or inclusion of relevant text in metadata. Could be a URI.
This is similar to PANDORA: Decision Rule (refers to the decision for keeping a record archived)
Retention Authority Issuance - textual information on legislative/government laws/regulations for record retention
Retention Period End Time (e.g. 31/12/2015, indefinite)
Also PANDORA: Selection Status (e.g. withdrawn, selected, not selected)

4.3.2.4 Usage

Pittsburgh: Use History, breaks down to:

Use Type - e.g. viewed, copied, edited, filed, indexed, classified, sent, disposed – identifies usage types permitted.
Use Instance Time - when data was used, e.g. 09/02/99 15:45.
Use Instance User - who used the data – although this could be just at the level of organisation; e.g. user ID number, guest, email of user.

The Use-Evidential-Consequences appears to be more related to the evidential focus of the Pittsburgh project. Here preservation is more concerned with frequency of usage. While this may not be the sole determinant of preservation strategy, it may be of interest (e.g. if usage is a determinant of the continuing preservation of the objects). Similarly, at what level would users be defined (individuals, groups, geography, anonymous? Need to check with Rights (Content Issues) Working Group).

4.3.2.5 Format Processing:

PANDORA: Process - the formal states a tangible format undergoes before and/or during each time its content is made available to user; e.g. testing by installation and virus checking, conversion from TIFF to JPEG format, rendering from SGML to HTML.
PANDORA: Process Type - the processes a given tangible format may undergo.
PANDORA: Process Agent - what utilities have been used (e.g. Dr Solomons, Norton, TIFF2JPG utility, MS PhotoShop)
PANDORA: Publisher-Preferred Browser (e.g. Netscape 4.0, JPEG viewer)

Format processing would need to be monitored, a "technology watch" facility would be useful to update this. Also, changes in format imply changes required in format processing.

4.3.3 Provenance Information

This documents the history of the Content Information; i.e. who has had custody of the Content Information since it was originated, and what was its source. This would include information about the processes that generated it. If the content was not "born digital", information about digitisation processes could be held here. Also, custodial information is also likely to refer to the rights implications regarding access to the content. Rights metadata issues will be further looked at by the Cedars CIWG.

[We will also need to include some kind of field which indicates if the content should be made visible to archive staff querying the system, in addition to whether end users can access it].

4.3.3.1 Publisher Information

Publisher
Publisher Name (e.g. Oxford University Press)
Publisher Contact Details (e.g. address, email, fax, telephone number)
Rights Management
Access Profile
Access Rights

4.3.3.2 Capture Proceedure

[PANDORA deals primarily with Internet publications – I feel this section relates more to ongoing capture of on-line publications, which is a specific instance of an ACQUISITIONS PROCEDURE].

Capture Procedure Type
Gathering Frequency - Sampling techniques (e.g. weekly, monthly, quarterly, annually, one-off)
Gathering Regime (e.g. the frequency and extent of capture of a publication)

4.3.3.3 Rights information

Event Type (e.g. Publication selected, No publisher response after x weeks, Publisher Permission obtained, Null publisher response)
Negotiation Status (e.g. First Contact Made, Follow-up contact made, Agreement Finalised, On hold pending further negotiation, Archived without agreement, De-selected)
Note: event type and negotiation status are interlinked, the PANDORA model examples are cited.
Organisation (body which negotiates on behalf of publisher – e.g. ASCAP in the case of musicians)
Copyright Warning
Publisher Copyright Statement
[Rights input from CIWG].
User Class (e.g. from a list of designated user types – this would need to be developed in conjunction with the archive(s) we are working with and the Rights group)
Restriction Status (T&C – e.g. restricted only to archivists, i.e. no public access)
Access Conditions (T&C – e.g. may only be accessed by universities)
Use Conditions (T&C – e.g. only for research purposes)

4.3.4 Fixity Information

This documents authentication mechanism, and provides any authentication keys which may be needed, for example:

Validation Key
Encryption
Digital watermark

This would require the following metadata items:

Fixity-type (the mechanism, e.g. digital watermark)
Fixity-key (e.g. key value, if applicable)
Fixity-value
Fixity-info-location (e.g. of documentation on the authentication mechanism)

4.4 Packaging Information

This is defined as information which, logically or actually, binds and relates the components of the package into an identifiable entity on specific media, e.g. on a CD-ROM, packaging info may include the ISO-9660 volume/file structure of a CD-ROM. These choices are the subject of local archive definitions or conventions.

The "Granularity" information in PDI-Reference may have relevance in this section. In this way, the binding of content and functionality can be expressed and be kept distinct.

4.5 Descriptive Information

This is the metadata used for use by access aids (which locate, analyse, retrieve, etc from an OAIS), for example:

Title
Author/Creator
Description
Contributor
Date
Language

This could be generally derived from the Content Information and PDI – so file identification/location, and other descriptive information will be replicated here. There may also be a role for metadata in existing schemes (MARC, EAD, etc) or in core formats like Dublin Core.

5. Cedars Metadata Elements - Tabular Format

The following table is an alternative presentation of the draft metadata elements identified to date, with the addition of thoughts as to whether each suggested element will be required, if they can be repeatable, and the likely source of this metadata.

Content Information
Element	Examples/Definitions	Comments	Req.	Rep.	Source
File Identifier	A unique identifier which indicates each data object being archived.	In the Cedars context this would most likely be an internally defined identifier or CRID.	Yes	No	Cedars nameserver
File Size	The size of file in kilobytes, megabytes, etc.	There may be a need for a content standard for this field.	?	No	?
Representation Information
Element	Examples/Definitions	Comments	Req.	Rep.	Source
Language	The language of the content of the data object being archived. NISO Z39.53 codes could be used.	To be used where language is a significant part of the resource.	No	Yes
Storage format: Type	The initial storage format of the data object being archived. Examples might be: Diskette 3.5 inch Diskette 5.25 inch CD-ROM Information server HFS path	This may be important since the underlying software may require the emulation of mounting diskettes, for example.	Yes	Yes	To be supplied as pre-ingest data?
Storage format: Version	The version of storage format being used				To be supplied as pre-ingest data?
Storage format: Specification	A link to documentation about the source format.	The link would need to be to a reference list of formats that could be derived from the ‘type’ and ‘version’ fields	Yes	?
Name of File Type	The name of the file type. e.g. TIFF HTML Proprietary Collection	A reference list of file formats would need to be maintained; formatting metadata may be required to understand how logical access to the bit-stream takes place. Collection would indicate this particular package contains a collection of files	Yes	Yes
Version of File Type	The version of file type	The reference list of file formats would include the differences between different versions of file types
Number of files	The total number of files that comprise the archive object
Total file size	The total size (in KB, MB etc,) of the files that comprise the archive object.	This would give an indication as to how much file space this resource would require to host, and how much data would be transferred from the archive.
Documentation	A link to the documentation about the file format; should be derived from type and version)	The link would need to be to a reference list of formats that could be derived from the ‘type’ and ‘version’ fields
File Modality	This is a higher level of description, which is similar to the Dublin Core ‘Type’ element, i.e. text, graphic, sound, video, etc. Adapted from the Pittsburgh metadata set.	If a type of encoding is not initially recognised, one could be made aware of its generic type so that exploratory work can be done to assess the feasibility of accessing the file.
Character Set	The character set which the data uses. (e.g. ASCII, Unicode)	For text-based data only
Associated DTD	The Data Type Definition which the text requires for interpretation. The value would be the CRID of DTD	This is only required for text-based data which contain markup tags; i.e. SGML, HTML, XML.
Encoding		For text-based data only
Resolution	The graphics resolution of an image; e.g. 600dpi	For graphics data only
Colour	The colour depth of an image; e.g. 1-bit (black and white), 16-bit	This will allow users to be aware of the minimum requirement for viewing graphics.	Yes	No	Possibly from metadata within graphics files, or from pre-ingest information supplied
Colour Management	Identify system, if any, that is used to improve consistency of color across capture, display and output of an image. EXAMPLES: Photo CD; OptiCal (color management system); Profile/80 (color sync profile maker); Softproof (Photoshop Plugin)	For graphics data only	No	No
Colour Bar/Grey Scale Bar	This links to a reference image which would allow colour comparison to take place; e,g, Kodak Q13 or Q14 Colour Separation Guide and Grey Scale; Kodak Q60 Colour Input Target	An image of the colour bar may be required within the archive, in order for comparisons to be made. For graphics data only	No	No	This would need to be supplied by the depositor, at pre-ingest stage.
Sampling Frequency	The sampling frequency at which audio was sampled: e.g. 8kHz, 44.1KhZ	This is only required for audio data	Yes	No	=TOP>
File Encoding	The type of encoding used to create a digital audio file. For example, AIFF, AU, WAV.	This may be the same as file format name. This is only required for audio data.	Yes	No
Bits	The number of bits each audio sample contains; e.g. 8-bit, 16-bit.	This is only required for audio data.	Yes	No
Mono/Stereo	Self-explanatory.	This is only required for audio data.	Yes	No
Application-Dependency	This is used to indicate whether or not a file or collection is dependent on a specific application. This will be a Yes/No value.	Where some resources are given in a stand-alone format, the component data files may be in a proprietary format which requires a specific viewer and is not easily converted.	Yes	Yes
Application Name	The name of the application which a file or collection is dependent on; e.g. OpenText, DynaText, MS-Word.	In some collections, an application may not be required to view the files which comprise a collection, but may be required to achieve the functionality of the collection. This would mean while the collection is not dependent on the application, it can be noted that added value can be obtained from it using a particular application. This can inform future preservation decisions.	Yes	?
Application Version	Version of the application which the e.g. version 5.0)		Yes	No
Notes	Text information which is of importance to the successful use of the application; e.g. "Versions of OpenText later than 5.0 are unable to handle SGML files"	This can also be used to describe the functionality which the application adds to the collection.	Yes	No
Software Environment Dependency	Indicates whether the collection is dependent on a particular operating system. Value: yes/no	Dependencies on particular operating systems and environments.	Yes	Yes
Software Environment Name	Name of the operating system which is required for using he digital archive (e.g. Windows, Macintosh, UNIX, none)	In the absence of a particular operating system, an emulator might be used to facilitate use. This could be considered an application dependency, although this matching could be a distinct function, for ease of maintainance.	No	Yes
Software Environment Version	The version of the operating system in question: e.g. Windows 3.1, 95, 98, Mac System 7, System 8.1, AIX, IRIX	A list of operating systems and versions could be kept. This may also help with keeping track of emulator compatibility.	No	Yes
Hardware Dependency	Indicates if any specific hardware is required for the successful use of the resource (yes/no).	This may be audio or visual minimum requirements. However, since CEDARS is not addressing time-based media (i.e. audio and video), this may not be as essential as previously thought:	Yes	Yes
Hardware Name	Name of the hardware device required: e.g. Mouse, Joystick, Monitor	The examples here are trivial, but could include a specific device	Yes	No
Hardware Type	Type of hardware device; e.g. Pointer Device, Colour Monitor		Yes	No
Hardware Description	Description of why hardware is required. (e.g. mouse needed to interact with software; images served are large, high-quality colour, etc.)	The description of the hardware dependency will explain why it has been critical to list it. Some value judgement will need to be made as to what can be regarded as superfluous (e.g. monitor, mouse)	Yes	No
Specification	Provides specification of required hardware; e.g. suggested minimum size 17", refresh rate 75Hz, dot pitch 0.28mm		Yes	No
Modality Served	Describes the type of resource the hardware is facilitating; e.g. text, graphics, sound, video, input		No	Yes
Compression	Have files been compressed; yes/no value	With proprietary file formats, it may not be possible to know this.	No	No
Name	Name of type of compression used (e.g. TIFF-LZW, ZIP, TAR)		Yes	No
Method	Description of parameters used, package and process used, etc. e.g. "compressed to 80% quality JPEG files using GraphicConverter on macintosh as a batch process"		Yes	No
Required uncompression package	Name of software package required to decompress the files; e.g. graphics viewer package, PKUNZIP, etc.	This would be another instance of a file dependency.
Location	location of file on archive system - an identifier (i.e. CRID)
Preservation Description Information
PDI: Context Information
Element	Examples/Definitions	Comments	Req.	Rep.	Source
Identifer Name	Name of type of identifier; e.g. CRID,URL, ISBN, Author, Title, etc.	Preference must be given to persistent and unique identifiers. Internal references may be used within a package which do not correspond to newest incarnation; original filenames may be useful in these cases.
Identifier Value	Value of identifier for this package.
Level of granularity	This has been identified as an important item of metadata. Describing this will vary according to collections, and so needs to be defined on a case-by-case basis.	(this could be defined as part of the archival schema; examples need to be worked upon using the exemplars we have)	Yes	Yes
Related item	Items related to this one: e.g. list of CRID/URNs of items		No	Yes
Relation to item	Description of the relation to an item: e.g. it is an element/component of it	Describes a hierarchy/network of relations: we may decide that a textual description at collection level may suffice, if this is too much to model.	No	Yes
Is item required?	Will the object in question make sense without it, e.g. links required to understand object within collection. Yes/No value.
History: Submission: Date	Date when the digital preservation copy was created, should follow ISO 8601 date format		Yes	No
History: Change: Formatting:
The RLG metadata refers to "Change History" - changes made to the formatting of the data under a migration strategy, or to decisions regarding its distinct.preservation If the Current File/Format fields have a "no" value, then information about the file/format type (and version) converted to; reason for conversion; date converted; and conversion agent is also captured. It can be envisaged that this information will be retained as conversions take place over time. This facilitates a history trail to be kept of files/formats.
FormattingChange: date	Date of change, in ISO 8601 format: ( CCYY-MM-DD). Example: 1985-04-12		No	Yes
FormattingChange: previous format	The previous format which data was stored in.	This may be a redundant field; if you have date and formats held, this could be inferred through a query.	No	Yes
FormattingChange: process	Process undergone to change formats.		No	Yes
FormattingChange: reason	Reason for changing format.	This could be a change in storage policy, or file format in danger of being outdated.	No	Yes
FormattingChange: conversion agent	Details of who carried out the work.		No	Yes
Preservation Strategy History
History: Change: Preservation Strategy:	This would link to changes in policy which would affect the preservation strategy.		No	No
Removal Authority -	Defines who is authorised to purge a item from the archive system, e.g. the Oxford Text Archive.	This must point to a user ID within Cedars.	No	No
Retention Policy Citation -	Textual information of organisation’s policy for record retention, e.g. link to relevant document or inclusion of relevant text in metadata. Could be a URI.	This is similar to PANDORA: Decision Rule (refers to the decision for keeping a record archived)	No	No
Retention Authority Issuance	Textual information on legislative /government laws /regulations for record retention	This may not be necessary for Cedars frameworks, as this is from an evidential requirements framework.	No	No
Retention Period End Time	Date which indicates the end of the retention period (e.g. 2012-03-99 or indefinite) ISO 8601	Again, this is from evidential requirements, and may not be necessary for Cedars.	No	No
Selection Status	Status of item’s selection within archive (e.g. withdrawn, selected, not selected)	Allows us to keep track of what has been withdrawn from archive.	No	No
Usage
Use-Type	Description of usage of the item - e.g. viewed, copied, edited, filed, indexed, classified, sent, disposed – identifies usage types permitted.	This will require a set of keywords to be identified (?)	No	Yes
Use Instance Time	Notes when the data was used (ISO 8641), e.g. 09/02/99 15:45.		No	Yes
Use Instance User	who used the data – although this could be just at the level of organisation; e.g. user ID number, guest, email of user	The Use Evidential Consequences appears to be more related to the evidential focus of the Pittsburgh project. Here preservation is more concerned with frequency of usage. While this may not be the sole determinant of preservation strategy, it may be of interest (e.g. if usage is a determinant of the continuing preservation of the objects). Similarly, at what level would users be defined (individuals, groups, geography, anonymous? Need to check with Rights (Content Issues) Working Group).	No	Yes
Format Processing
Records the formal states a tangible format undergoes before and/or during each time its content is made available to user; e.g. testing by installation and virus checking, conversion from TIFF to JPEG format, rendering from SGML to HTML.
Process Type	The processes a given tangible format may undergo
Process Agent	Documents what particularly utilities have been used (e.g. Dr Solomons, Norton, TIFF2JPG utility, MS PhotoShop)
Publisher Preferred Browser	Recommended version of browser, e.g. Netscape 4.0, JPEG viewer)	Format processing would need to be monitored, a "technology watch" facility would be useful to update this. Also, changes in format imply changes required in format processing.
PDI Provenance Information
This documents the history of the Content Information; i.e. who has had custody of the Content Information since it was originated, and what was its source. This would include information about the processes that generated it. If the content was not "born digital", information about digitisation processes could be held here. Also, custodial information is also likely to refer to the rights implications regarding access to the content. Rights metadata issues will be further looked at by the Cedars CIWG.
Element	Examples/Definitions	Comments	Req.	Rep.	Source
Publisher Information
Publisher Name	The name of a publisher. e.g. Oxford University Press	Publisher in this context could mean information provider.
Publisher Contact Details	Contact details (address, email, fax, telephone number) for the publisher	CIWG input needed
Rights Management	Text field or link to rights management information
Access Profile
Access Rights
Capture Procedure
Capture Procedure Type	Describes how the archived object was originally created.
Gathering Frequency	Sampling techniques (e.g. weekly, monthly, quarterly, annually, one-off)
Gathering Regime	(e.g. the frequency and extent of capture of a publication)
Rights information
Event Type	Type of event in the rights acquisition process (e.g. Publication selected, No publisher response after x weeks, Publisher Permission obtained, Null publisher response)	This was taken from the PANDORA model; needs input from CIWG.
Negotiation Status	Type of event in the rights negotiation process (e.g. First Contact Made, Follow-up contact made, Agreement Finalised, On hold pending further negotiation, Archived without agreement, De-selected)	Note: event type and negotiation status are interlinked, the PANDORA model examples are cited. Organisation (body which negotiates on behalf of publisher – e.g. ASCAP in the case of musicians)
Copyright Warning	Information on the copyright of the item.	Requires input from CIWG
Publisher Copyright Statement	Statement from publisher regarding copyright.	Requires input from CIWG
User Class	e.g. from a list of designated user types – this would need to be developed in conjunction with the archive(s) we are working with and the Rights group)	Requires input from CIWG
Restriction Status	(T&C – e.g. restricted only to archivists, i.e. no public access)	Requires input from CIWG
Access Conditions	(T&C – e.g. may only be accessed by universities)	Requires input from CIWG
Use Conditions	(T&C – e.g. only for research purposes)	Requires input from CIWG
PDI: Fixity Information
This documents authentication mechanisms, and provides any authentication keys which may be needed, for example: Validation Keys Encryption Digital watermarks
Element	Examples/Definitions	Comments	Req.	Rep.	Source
Authentication type	The authentication mechanism in use: e.g. digital watermark	Needs more research.
Authentication key	The authentication key (if appropriate)
Authentication value	The value e.g. a hash value
Authentication documentation	Information on, or link to information, on the authentication mechanism in use
Packaging Information
This is defined as information which, logically or actually, binds and relates the components of the package into an identifiable entity on specific media, e.g. on a CD-ROM, packaging info may include the ISO-9660 volume/file structure of a CD-ROM. These choices are the subject of local archive definitions or conventions. The "Granularity" information in PDI-Reference may have relevance in this section. In this way, the binding of content and functionality can be expressed and be kept
Descriptive Information
This could be generally derived from the Content Information and PDI – so file identification/location, and other descriptive information will be replicated here. There may also be a role for metadata in existing schemes (MARC, EAD, etc) or in core formats like Dublin Core.
Element	Examples/Definitions	Comments	Req.	Rep.	Source
Title	Title of archived information object		Yes	Yes
Author/Creator	Creator of intellectual content of the archived information object		Yes	Yes
Description	Brief textual description of the archived information object.	This could contain an abstract.	Yes	Yes
Contributor	Other contributors to the intellectual content of the archived information object.		Yes	Yes
Date	Date that object was made available in its current form. ISO 8601 format.	This could be problematical – there will be a need to define specific types of dates.	Yes	Yes
Language	Language of resource. Probably in NISO X39.53.		Yes	Yes

6. References

Consultative Committee for Space Data Systems, 1998, Reference Model for an Open Archival Information System (OAIS), ed. L. Reich and D. Sawyer. CCSDS 650.0-W-4.0. White Book, Issue 4, 17 September.
Latest version available from: <URL:http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html>

Day, M.W., 1998, Metadata for Preservation. CEDARS Project Document AIW01. <URL:http://www.ukoln.ac.uk/metadata/cedars/AIW01.html>

7. Document History

Version	Date	Comments
Version 1	25-Feb-1999	First Draft by Andy Stone and Michael Day.

Cedars is a CURL Project funded by the Joint Information Systems Committee through its Electronic Libraries Programme (eLib).

Created and maintained by: Andy Stone (OUCS) and Michael Day (UKOLN) of the Cedars Project.
Page created: 25-Feb-1999.
Last updated: 05-Mar-1999.