Cedars Access Issues Working Group


Cedars Preservation Metadata Elements

Cedars Project Document AIW02

Andy Stone
Oxford University Computing Service, 13 Banbury Road, Oxford, OX2 6NN, UK.
andy.stone@computing-services.oxford.ac.uk
http://www.oucs.ox.ac.uk/

Michael Day
UKOLN: The UK Office for Library and Information Networking, University of Bath, Bath BA2 7AY, UK.
m.day@ukoln.ac.uk
http://www.ukoln.ac.uk/

Cedars Document Number: AIW02  Status: Current  Created: 25-Feb-1999  Updated: 25-Feb-1999  Availability: Project Only 

Contents:

1 Introduction
2 The OAIS model
3 The draft Cedars architecture
4 Cedars metadata elements
    4.1 Content Information
    4.2 Representation Information
    4.3 Preservation Description Information
    4.4 Packaging Information
    4.5 Descriptive Information
5. Cedars Metadata Elements (First draft) - Tabular Format
6. References
7. Document History

1 Introduction

This document builds upon the work carried out for Cedars Project Document AIW01 Metadata for preservation (Day 1998). That report identified and outlined a number of metadata initiatives and projects with relevance to digital preservation. This document synthesises metadata elements taken from some of these initiatives and places them in the framework of the Reference Model for an Open Archival Information System (OAIS) published by the Consultative Committee for Space Data Systems (1998).

This is a first attempt at identifying preservation metadata elements for the Cedars project within the general framework provided by the OAIS model.

The metadata schemes most used in this document are:

2 The OAIS model

The OAIS model identifies what it calls a Taxonomy of archival information object classes that includes the following types of data:

Using this structure should make it easier to identify the components of the different Information Packages which OAIS refers to (which reflect actions at different stages of an archive). Metadata elements detailed in the PANDORA data model, the Pittsburgh metadata requirements document and the RLG Working Group report were mapped on to the OAIS taxonomy in order to help identify specific metadata elements.

3 The draft Cedars architecture

The metadata specification will be an important part of the Cedars paper demonstrator. It will therefore need to fit into the general framework of the proposed Cedars demonstrator architecture. David Holdsworth’s draft architecture outlines two essential outcomes of a preservation demonstrator:

Holdsworth has already noted the requirement for unique identifiers and has proposed a Cedars Reference Identifier (CRID).

Metadata, therefore, has several distinct roles within the Cedars architecture:

Resource discovery within the Cedars demonstrator. Finding the preserved digital object will the function of the Descriptive Information (in OAIS terms). For the demonstrator, this could take one of several different forms: MARC, EAD, etc. For the purposes of this document the Descriptive Information will conform to simple Dublin Core as defined in RFC 2413.

Retrieval from the Cedars demonstrator. Accessing the preserved digital object will depend upon Representation Information (in OAIS terminology) or structural metadata. The Representation Information (RI) may need to contain the whole specification for the format in which the object is ‘encoded’, e.g. for TIFF images or PDF files together with specifications of software environments (operating systems) and hardware. It is expected that the Cedars Preservation Strategies Working Group (PSWG) will contribute to the specification of RI. In this document, RI will just document relevant formats, operating environments and hardware.

Rights. Rights management will be an important part of any preservation demonstrator and could be used to manage access to objects stored in the archive. In OAIS terms, rights metadata are part of the Provenance Information of the Preservation Description Information (PDI). Rights metadata may need to kept separately from the rest of the PDI within the Cedars demonstrators. It is expected that the Cedars Content Issues Working Group will contribute to the specification of rights metadata within the context of the paper demonstrator.

Preservation description. The remainder of the PDI gives information about References (chiefly identifiers), Context, and Fixity. Fixity Information is concerned with the authentication of the archived object and will need to be the subject of further study. Context Information deals with relationships between (and within) resources, granularity issues and will document the history of an archived object with relation to format conversions, preservation strategies and usage. Further Provenance Information within the PDI might contain information about the context of creation of the archived object and the digitisation capture process.

Note that an Archive Information Package (AIP) will comprise the Digital Object together with its RI and PDI.

The following list of elements are to encourage discussion about the metadata required by the Cedars demonstrators and will lead to the eventual production of a Cedars metadata specification. The elements are part-derived from metadata schemes with a different focus to Cedars.

A proposed metadata record for one resource, the Wife of Bath’s Prologue on CD-ROM will be added to aid discussion.

4 Cedars metadata elements

4.1 Content Information

Content information identifies the bits comprising the primary Digital Object; i.e. is the raw file. This may consist of:

Other information regarding the Digital Object can be found under the Representation Information and Preservation Description Information (PDI).

4.2 Representation Information

Representation Information is defined by the OAIS model as "information that maps data into more meaningful concepts", and as "accompanying a digital object, or sequence of bits … to provide additional meaning". In general, it can be said that "Data interpreted using its Representation Information" yields information.

It should be noted that Representation Information stored in digital form will require Representation information of its own to be understood. This recursion can be broken in one of two ways: through a physical document, or through software. The latter is felt to be a risky option, since there is no guarantee that such software will continue to exist. This identifies a need for a two-pronged approach: to indicate the required software (and, potentially, hardware emulators) to view files; and provision of physical documentation on standards for software formats, particularly in cases where they may be in decline.

There are varying levels of representation: collections may come with additional levels of functionality; although a description may be preserved of this, the implementation may or may not also be preserved.

Since there will be a long-term issue of format migration, this could be devolved to a dedicated "technology watch unit" which can identify and recommend migration action.

Proposed RI Elements:

4.2.1 Language

for content an international standard such as NISO Z39.53 could be used. This includes older languages which may be useful for textual resources (e.g. in OTA). Available: <URL:http://www.oasis-open.org/cover/nisoLang3-1994.html>

4.2.2 File Information:

There are two main strands to this; the physical storage format used (called "file format" in Pittsburgh), and the type of file formatting ("file rendering" in Pittsburgh) used.

Storage format: Type Examples would be diskette (3.5, 5.25 inch), CD-ROM, information server, HFS path, etc. This may be important since the underlying software may require the emulation of mounting diskettes, for example.
Storage format: Version
Storage format: Specification (link to reference resource; should be derived from type and version)

Type of file format:
Name of File Type (e.g. TIFF)
Version
Number of files
Total file size
Documentation (link to reference resource; should be derived from type and version)

A list of file formats would need to be maintained; formatting metadata may be required to understand how logical access to the bit-stream takes place.

File Modality (from Pittsburgh)

A higher level of description, similar to Dublin Core ‘Type’, i.e. text, graphic, sound, video, etc.

If a type of encoding is not initially recognised, one could be made aware of its generic type so that exploratory work can be done to assess the feasibility of accessing the file.

4.2.3 Modality-Specific Metadata

There will be format-specific technical information. For example:

4.2.3.1 Text-based data:

Character Set (e.g. ASCII, Unicode)
Encoding
Associated DTD (for SGML/XML documents)

4.2.3.2 Graphic data

The following have been modified from RLG:

Resolution (e.g. 600dpi)
Colour (e.g. 16-bit)
Colour Management
Colour Bar/Grey Scale Bar

4.2.3.2 Audio data

CEDARS is not addressing multimedia, but examples might include:

Sampling Frequency (e.g. 8kHz, 44.1KhZ)
File Encoding (e.g. AIFF, AU, WAV – will require associated package to play back file)
Bits (e.g. 8-bit, 16-bit)
Mono/Stereo

4.2.4 Dependencies (adapted from Pittsburgh)

Resources may have dependencies on particular applications, software environments and hardware. These dependencies will, where appropriate, need to be recorded.

Application Dependency (e.g. Yes/No)
Application Name (e.g. OpenText, MS-Word)
Application Version (e.g. version 5.0)
Notes (e.g. "Versions of OpenText later than 5.0 are unable to handle SGML files")

Application dependencies could also refer to Plug-ins required, as mentioned by PANDORA’s "Online Format" field.

[Note: Is it within the remit of CEDARS to maintain the location of application required? At the librarians’ focus group in December, the notion of an application repository was mentioned. This would imply applications would have their own identifiers (and consequently locations) so they could be accessed].

Software Environment Dependency

Dependencies on particular operating systems and environments.

Software Environment Name (e.g. Windows, Macintosh, UNIX, none)
Software Environment Version (e.g. Windows 3.1, 95, 98, Mac System 7, System 8.1, AIX, IRIX)

Hardware Dependency

This may be audio or visual minimum requirements. However, since CEDARS is not addressing time-based media, this may not be as essential as previously thought:

[Note: This could also be used to record dependencies on particular hardware, e.g. a PC with 486 processor (minimum)?]

Hardware Name (e.g. Mouse, Joystick, Monitor)
Hardware Type (e.g. Pointer Device, Colour Monitor)
Hardware Description (e.g. mouse needed to interact with software; images served are large, high-quality colour, etc.)
Specification (e.g. suggested minimum size 17", refresh rate 75Hz, dot pitch 0.28mm)
Modality Served (e.g. text, graphics, sound, video)

4.2.4.5 Compression

Name (e.g. TIFF-LZW, ZIP, TAR)
Method
Required uncompression package (e.g. graphics viewer package, PKUNZIP, etc)

If the uncompression package is not available, a reference to its specification will be needed.
If an application repository exists, compression/uncompression packages should be included in this.

4.2.4.6 Location

The location of file on archive system - an identifier (i.e. CRID)

4.3 Preservation Description Information

4.3.1 Reference Information

This identifies and (if necessary) describes one or more mechanisms used to provide assigned identifiers for the Content Information. It also provides identifiers which allow outside systems to refer unambiguously to particular Content Information.

Identifer Name (e.g. CRID,URL, ISBN, Author, Title, etc.)
Identifier Value

Examples: CEDARS Identifier (a unique identifier within the CEDARS scheme - e.g. the CRID); Internet URIs or URNs; ISBN/ISSN numbers (for electronic texts where applicable). Preference must be given to persistent and unique identifiers.

4.3.2 Context Information

Defined as the relationships of the content information to its environment. This includes background information on why the Content Information was created, and how it relates to other Content Information objects elsewhere.

The main metadata types refer to: granularity, formatting history, preservation strategy history, usage, and format processing.

4.3.2.1 Granularity

Different types of archive will have different levels of granularity, e.g. in ILEJ, there will be publication level, issue level and page level. It needs to be clear which one is meant. PANDORA refers to Publication Identifier and Copy Identifier. There may be a role for the Dublin Core Relation element.

Level of granularity (this could be defined as part of the archival schema; examples need to be worked upon using the exemplars we have)
Related item (e.g. list of CRID/URNs of items)
Relation to item (e.g. it is an element/component of it)
Is item required? (e.g. will the object in question make sense without it, e.g. links required to understand object within collection)

Online data, while usually output as text-based files read by a client (e.g. Web browser), may only be accessible via server-based programs. For example, the ILEJ collection is available via the Web, with a variety of searchable functions. If it was felt important for this functionality (or part of it) to be ‘preserved’, there would be a trade-off between any completeness of functionality and cost. (Note: this observation also refers to dependency on software to provide FUNCTIONALITY which has already been covered – this is distinct from the granularity of the digital object files)

Additionally, the ILEJ and Oxford Early Manuscripts projects both raised issues regarding associated files. The uncompressed scanned TIFF file of a manuscript page may not be appropriate, nor is the full context of the digital artefact known. There needs to be the facility to link to the relevant files in order for a contextual understanding to be made (e.g. ILEJ may link the OCR text file with the scanned image and thumbnail)

In instances such as these, the judgement of what constitutes the digital object is important.

Perhaps in each archive, this complete functionality needs to be documented (and if functional instances of this have been preserved, e.g. specific CGI search routines) stating how these were used.

4.3.2.2 Formatting History

History: Submission:
RLG: Date - when the digital preservation copy was created.
History: Change:
The RLG metadata refers to "Change History" - changes made to the formatting of the data under a migration strategy, or to decisions regarding its preservation.
History: Change: Formatting:
A value known as "Current File/Format" was in a previous draft; however, this is redundant if the submission event is recorded as the first change – if this is a digitisation process, the original medium can be

If the Current File/Format fields have a "no" value, then information about the file/format type (and version) converted to; reason for conversion; date converted; and conversion agent is also captured. It can be envisaged that this information will be retained as conversions take place over time. This facilitates a history trail to be kept of files/formats.

FormattingChange: date
FormattingChange: previous format (e.g. paper master, CD-ROM, HTML)
FormattingChange: process (e.g. scanning, HFS backup, conversion to XML)
FormattingChange: reason (e.g. creation of digital master, change in storage policy, archive decides to convert HTML docs into XML)
FormattingChange: conversion-agent (e.g. OUCS, ILEJ Project Officer)

PANDORA mentions that the physical characteristics of tangible formats will need to be kept in the archive.

4.3.2.3 Preservation Strategy History

History: Change: Preservation Strategy:
Pittsburgh: Disposition Requirements Metadata
Removal Authority - who’s authorised to purge a item from the archive system, e.g. the Oxford Text Archive.
Retention Policy Citation - textual information of organisation’s policy for record retention, e.g. link to relevant document or inclusion of relevant text in metadata. Could be a URI.
This is similar to PANDORA: Decision Rule (refers to the decision for keeping a record archived)
Retention Authority Issuance - textual information on legislative/government laws/regulations for record retention
Retention Period End Time (e.g. 31/12/2015, indefinite)
Also PANDORA: Selection Status (e.g. withdrawn, selected, not selected)

4.3.2.4 Usage

Pittsburgh: Use History, breaks down to:

Use Type - e.g. viewed, copied, edited, filed, indexed, classified, sent, disposed – identifies usage types permitted.
Use Instance Time - when data was used, e.g. 09/02/99 15:45.
Use Instance User - who used the data – although this could be just at the level of organisation; e.g. user ID number, guest, email of user.

The Use-Evidential-Consequences appears to be more related to the evidential focus of the Pittsburgh project. Here preservation is more concerned with frequency of usage. While this may not be the sole determinant of preservation strategy, it may be of interest (e.g. if usage is a determinant of the continuing preservation of the objects). Similarly, at what level would users be defined (individuals, groups, geography, anonymous? Need to check with Rights (Content Issues) Working Group).

4.3.2.5 Format Processing:

PANDORA: Process - the formal states a tangible format undergoes before and/or during each time its content is made available to user; e.g. testing by installation and virus checking, conversion from TIFF to JPEG format, rendering from SGML to HTML.
PANDORA: Process Type - the processes a given tangible format may undergo.
PANDORA: Process Agent - what utilities have been used (e.g. Dr Solomons, Norton, TIFF2JPG utility, MS PhotoShop)
PANDORA: Publisher-Preferred Browser (e.g. Netscape 4.0, JPEG viewer)

Format processing would need to be monitored, a "technology watch" facility would be useful to update this. Also, changes in format imply changes required in format processing.

4.3.3 Provenance Information

This documents the history of the Content Information; i.e. who has had custody of the Content Information since it was originated, and what was its source. This would include information about the processes that generated it. If the content was not "born digital", information about digitisation processes could be held here. Also, custodial information is also likely to refer to the rights implications regarding access to the content. Rights metadata issues will be further looked at by the Cedars CIWG.

[We will also need to include some kind of field which indicates if the content should be made visible to archive staff querying the system, in addition to whether end users can access it].

4.3.3.1 Publisher Information

Publisher
Publisher Name (e.g. Oxford University Press)
Publisher Contact Details (e.g. address, email, fax, telephone number)
Rights Management
Access Profile
Access Rights

4.3.3.2 Capture Proceedure

[PANDORA deals primarily with Internet publications – I feel this section relates more to ongoing capture of on-line publications, which is a specific instance of an ACQUISITIONS PROCEDURE].

Capture Procedure Type
Gathering Frequency - Sampling techniques (e.g. weekly, monthly, quarterly, annually, one-off)
Gathering Regime (e.g. the frequency and extent of capture of a publication)

4.3.3.3 Rights information

Event Type (e.g. Publication selected, No publisher response after x weeks, Publisher Permission obtained, Null publisher response)
Negotiation Status (e.g. First Contact Made, Follow-up contact made, Agreement Finalised, On hold pending further negotiation, Archived without agreement, De-selected)
Note: event type and negotiation status are interlinked, the PANDORA model examples are cited.
Organisation (body which negotiates on behalf of publisher – e.g. ASCAP in the case of musicians)
Copyright Warning
Publisher Copyright Statement
[Rights input from CIWG].
User Class (e.g. from a list of designated user types – this would need to be developed in conjunction with the archive(s) we are working with and the Rights group)
Restriction Status (T&C – e.g. restricted only to archivists, i.e. no public access)
Access Conditions (T&C – e.g. may only be accessed by universities)
Use Conditions (T&C – e.g. only for research purposes)

4.3.4 Fixity Information

This documents authentication mechanism, and provides any authentication keys which may be needed, for example:

This would require the following metadata items:

Fixity-type (the mechanism, e.g. digital watermark)
Fixity-key (e.g. key value, if applicable)
Fixity-value
Fixity-info-location (e.g. of documentation on the authentication mechanism)

4.4 Packaging Information

This is defined as information which, logically or actually, binds and relates the components of the package into an identifiable entity on specific media, e.g. on a CD-ROM, packaging info may include the ISO-9660 volume/file structure of a CD-ROM. These choices are the subject of local archive definitions or conventions.

The "Granularity" information in PDI-Reference may have relevance in this section. In this way, the binding of content and functionality can be expressed and be kept distinct.

4.5 Descriptive Information

This is the metadata used for use by access aids (which locate, analyse, retrieve, etc from an OAIS), for example:

Title
Author/Creator
Description
Contributor
Date
Language

This could be generally derived from the Content Information and PDI – so file identification/location, and other descriptive information will be replicated here. There may also be a role for metadata in existing schemes (MARC, EAD, etc) or in core formats like Dublin Core.


5. Cedars Metadata Elements - Tabular Format

The following table is an alternative presentation of the draft metadata elements identified to date, with the addition of thoughts as to whether each suggested element will be required, if they can be repeatable, and the likely source of this metadata.

Content Information
Element Examples/Definitions Comments Req. Rep. Source
File Identifier A unique identifier which indicates each data object being archived. In the Cedars context this would most likely be an internally defined identifier or CRID. Yes No Cedars nameserver
File Size  The size of file in kilobytes, megabytes, etc. There may be a need for a content standard for this field. ? No ?
Representation Information
Element Examples/Definitions Comments Req. Rep. Source
Language The language of the content of the data object being archived.

NISO Z39.53 codes could be used. 

To be used where language is a significant part of the resource. No Yes  
Storage format: Type  The initial storage format of the data object being archived. Examples might be: Diskette 3.5 inch 

Diskette 5.25 inch CD-ROM

Information server

HFS path

This may be important since the underlying software may require the emulation of mounting diskettes, for example. Yes Yes To be supplied as pre-ingest data?
Storage format: Version The version of storage format being used       To be supplied as pre-ingest data?
Storage format: Specification  A link to documentation about the source format. The link would need to be to a reference list of formats that could be derived from the ‘type’ and ‘version’ fields Yes ?  
Name of File Type  The name of the file type.

e.g.

TIFF

HTML

Proprietary

Collection

A reference list of file formats would need to be maintained; formatting metadata may be required to understand how logical access to the bit-stream takes place. Collection would indicate this particular package contains a collection of files Yes Yes  
Version of File Type The version of file type The reference list of file formats would include the differences between different versions of file types      
Number of files The total number of files that comprise the archive object        
Total file size The total size (in KB, MB etc,) of the files that comprise the archive object. This would give an indication as to how much file space this resource would require to host, and how much data would be transferred from the archive.      
Documentation A link to the documentation about the file format; should be derived from type and version) The link would need to be to a reference list of formats that could be derived from the ‘type’ and ‘version’ fields      
File Modality  This is a higher level of description, which is similar to the Dublin Core ‘Type’ element, i.e. text, graphic, sound, video, etc. Adapted from the Pittsburgh metadata set. If a type of encoding is not initially recognised, one could be made aware of its generic type so that exploratory work can be done to assess the feasibility of accessing the file.      
Character Set 

 

The character set which the data uses. (e.g. ASCII, Unicode) For text-based data only      
Associated DTD  The Data Type Definition which the text requires for interpretation. The value would be the CRID of DTD This is only required for text-based data which contain markup tags; i.e. SGML, HTML, XML.      
Encoding   For text-based data only      
Resolution  The graphics resolution of an image; e.g. 600dpi For graphics data only      
Colour  The colour depth of an image; e.g. 1-bit (black and white), 16-bit This will allow users to be aware of the minimum requirement for viewing graphics.  Yes No Possibly from metadata within graphics files, or from pre-ingest information supplied
Colour Management Identify system, if any, that is used to improve consistency of color across capture, display and output of an image. EXAMPLES: Photo CD; OptiCal (color management system); Profile/80 (color sync profile maker); Softproof (Photoshop Plugin) For graphics data only No No  
Colour Bar/Grey Scale Bar This links to a reference image which would allow colour comparison to take place; e,g, Kodak Q13 or Q14 Colour Separation Guide and Grey Scale; Kodak Q60 Colour Input Target An image of the colour bar may be required within the archive, in order for comparisons to be made.

For graphics data only

No No This would need to be supplied by the depositor, at pre-ingest stage.
Sampling Frequency  The sampling frequency at which audio was sampled: e.g. 8kHz, 44.1KhZ This is only required for audio data Yes No =TOP> 
File Encoding The type of encoding used to create a digital audio file. For example, AIFF, AU, WAV. This may be the same as file format name. This is only required for audio data. Yes No  
Bits  The number of bits each audio sample contains; e.g. 8-bit, 16-bit. This is only required for audio data. Yes No  
Mono/Stereo Self-explanatory. This is only required for audio data. Yes No  
Application-Dependency  This is used to indicate whether or not a file or collection is dependent on a specific application. This will be a Yes/No value. Where some resources are given in a stand-alone format, the component data files may be in a proprietary format which requires a specific viewer and is not easily converted.  Yes Yes  
Application Name  The name of the application which a file or collection is dependent on; e.g. OpenText, DynaText, MS-Word. In some collections, an application may not be required to view the files which comprise a collection, but may be required to achieve the functionality of the collection. This would mean while the collection is not dependent on the application, it can be noted that added value can be obtained from it using a particular application. This can inform future preservation decisions.  Yes ?  
Application Version  Version of the application which the e.g. version 5.0)   Yes No  
Notes  Text information which is of importance to the successful use of the application; e.g. "Versions of OpenText later than 5.0 are unable to handle SGML files" This can also be used to describe the functionality which the application adds to the collection. Yes No  
Software Environment Dependency Indicates whether the collection is dependent on a particular operating system. Value: yes/no Dependencies on particular operating systems and environments. Yes Yes  
Software Environment Name Name of the operating system which is required for using he digital archive (e.g. Windows, Macintosh, UNIX, none) In the absence of a particular operating system, an emulator might be used to facilitate use. This could be considered an application dependency, although this matching could be a distinct function, for ease of maintainance.  No Yes  
Software Environment Version  The version of the operating system in question: e.g. Windows 3.1, 95, 98, Mac System 7, System 8.1, AIX, IRIX A list of operating systems and versions could be kept. This may also help with keeping track of emulator compatibility. No Yes  
Hardware Dependency Indicates if any specific hardware is required for the successful use of the resource (yes/no). 

 

This may be audio or visual minimum requirements. However, since CEDARS is not addressing time-based media (i.e. audio and video), this may not be as essential as previously thought: Yes Yes  
Hardware Name  Name of the hardware device required: e.g. Mouse, Joystick, Monitor The examples here are trivial, but could include a specific device  Yes No  
Hardware Type  Type of hardware device; e.g. Pointer Device, Colour Monitor   Yes No  
Hardware Description  Description of why hardware is required. (e.g. mouse needed to interact with software; images served are large, high-quality colour, etc.) The description of the hardware dependency will explain why it has been critical to list it. Some value judgement will need to be made as to what can be regarded as superfluous (e.g. monitor, mouse) Yes No  
Specification  Provides specification of required hardware; e.g. suggested minimum size 17", refresh rate 75Hz, dot pitch 0.28mm   Yes No  
Modality Served  Describes the type of resource the hardware is facilitating; e.g. text, graphics, sound, video, input   No Yes  
Compression Have files been compressed; yes/no value With proprietary file formats, it may not be possible to know this. No No  
Name  Name of type of compression used (e.g. TIFF-LZW, ZIP, TAR)   Yes No  
Method Description of parameters used, package and process used, etc. e.g. "compressed to 80% quality JPEG files using GraphicConverter on macintosh as a batch process"   Yes No  
Required uncompression package  Name of software package required to decompress the files; e.g. graphics viewer package, PKUNZIP, etc. This would be another instance of a file dependency.      
Location location of file on archive system - an identifier (i.e. CRID)        
Preservation Description Information
PDI: Context Information
Element Examples/Definitions Comments Req. Rep. Source
Identifer Name  Name of type of identifier; e.g. CRID,URL, ISBN, Author, Title, etc. Preference must be given to persistent and unique identifiers. Internal references may be used within a package which do not correspond to newest incarnation; original filenames may be useful in these cases.      
Identifier Value Value of identifier for this package.        
Level of granularity 
 
 

 

This has been identified as an important item of metadata. Describing this will vary according to collections, and so needs to be defined on a case-by-case basis. (this could be defined as part of the archival schema; examples need to be worked upon using the exemplars we have) Yes Yes  
Related item  Items related to this one: e.g. list of CRID/URNs of items   No Yes  
Relation to item  Description of the relation to an item: e.g. it is an element/component of it Describes a hierarchy/network of relations: we may decide that a textual description at collection level may suffice, if this is too much to model. No Yes  
Is item required?  Will the object in question make sense without it, e.g. links required to understand object within collection. Yes/No value.        
History: Submission:

Date 

Date when the digital preservation copy was created, should follow ISO 8601 date format   Yes No  
History: Change: Formatting:
The RLG metadata refers to "Change History" - changes made to the formatting of the data under a migration strategy, or to decisions regarding its distinct.preservation If the Current File/Format fields have a "no" value, then information about the file/format type (and version) converted to; reason for conversion; date converted; and conversion agent is also captured. It can be envisaged that this information will be retained as conversions take place over time. This facilitates a history trail to be kept of files/formats.
FormattingChange: date Date of change, in ISO 8601 format: ( CCYY-MM-DD). Example: 1985-04-12   No Yes  
FormattingChange: previous format The previous format which data was stored in. This may be a redundant field; if you have date and formats held, this could be inferred through a query. No Yes  
FormattingChange: process Process undergone to change formats.    No Yes  
FormattingChange: reason Reason for changing format. This could be a change in storage policy, or file format in danger of being outdated. No Yes  
FormattingChange: conversion agent Details of who carried out the work.   No Yes  
Preservation Strategy History
History: Change: Preservation Strategy: This would link to changes in policy which would affect the preservation strategy.   No No  
Removal Authority - Defines who is authorised to purge a item from the archive system, e.g. the Oxford Text Archive. This must point to a user ID within Cedars. No No  
Retention Policy Citation - Textual information of organisation’s policy for record retention, e.g. link to relevant document or inclusion of relevant text in metadata. Could be a URI. This is similar to PANDORA: Decision Rule (refers to the decision for keeping a record archived) No No  
Retention Authority Issuance Textual information on legislative /government laws /regulations for record retention This may not be necessary for Cedars frameworks, as this is from an evidential requirements framework. No No  
Retention Period End Time Date which indicates the end of the retention period (e.g. 2012-03-99 or indefinite) ISO 8601 Again, this is from evidential requirements, and may not be necessary for Cedars. No No  
Selection Status Status of item’s selection within archive (e.g. withdrawn, selected, not selected) Allows us to keep track of what has been withdrawn from archive. No No  
Usage
Use-Type Description of usage of the item - e.g. viewed, copied, edited, filed, indexed, classified, sent, disposed – identifies usage types permitted. This will require a set of keywords to be identified (?) No Yes  
Use Instance Time Notes when the data was used (ISO 8641), e.g. 09/02/99 15:45.   No Yes  
Use Instance User who used the data – although this could be just at the level of organisation; e.g. user ID number, guest, email of user The Use Evidential Consequences appears to be more related to the evidential focus of the Pittsburgh project. Here preservation is more concerned with frequency of usage. While this may not be the sole determinant of preservation strategy, it may be of interest (e.g. if usage is a determinant of the continuing preservation of the objects). Similarly, at what level would users be defined (individuals, groups, geography, anonymous? Need to check with Rights (Content Issues) Working Group).  No Yes  
Format Processing
Records the formal states a tangible format undergoes before and/or during each time its content is made available to user; e.g. testing by installation and virus checking, conversion from TIFF to JPEG format, rendering from SGML to HTML.
Process Type

 

The processes a given tangible format may undergo        
Process Agent Documents what particularly utilities have been used (e.g. Dr Solomons, Norton, TIFF2JPG utility, MS PhotoShop)

 

       
Publisher Preferred Browser Recommended version of browser, e.g. Netscape 4.0, JPEG viewer) Format processing would need to be monitored, a "technology watch" facility would be useful to update this. Also, changes in format imply changes required in format processing.      
PDI Provenance Information
This documents the history of the Content Information; i.e. who has had custody of the Content Information since it was originated, and what was its source. This would include information about the processes that generated it. If the content was not "born digital", information about digitisation processes could be held here. Also, custodial information is also likely to refer to the rights implications regarding access to the content. Rights metadata issues will be further looked at by the Cedars CIWG.
Element Examples/Definitions Comments Req. Rep. Source
Publisher Information
Publisher Name 

 

The name of a publisher.

e.g. Oxford University Press

Publisher in this context could mean information provider.      
Publisher Contact Details 

 

Contact details (address, email, fax, telephone number) for the publisher CIWG input needed      
Rights Management Text field or link to rights management information        
Access Profile          
Access Rights          
Capture Procedure
Capture Procedure Type Describes how the archived object was originally created.        
Gathering Frequency Sampling techniques (e.g. weekly, monthly, quarterly, annually, one-off)        
Gathering Regime (e.g. the frequency and extent of capture of a publication)        
Rights information
Event Type Type of event in the rights acquisition process (e.g. Publication selected, No publisher response after x weeks, Publisher Permission obtained, Null publisher response) This was taken from the PANDORA model; needs input from CIWG.      
Negotiation Status Type of event in the rights negotiation process (e.g. First Contact Made, Follow-up contact made, Agreement Finalised, On hold pending further negotiation, Archived without agreement, De-selected) Note: event type and negotiation status are interlinked, the PANDORA model examples are cited. 

Organisation (body which negotiates on behalf of publisher – e.g. ASCAP in the case of musicians)

     
Copyright Warning Information on the copyright of the item. Requires input from CIWG      
Publisher Copyright Statement Statement from publisher regarding copyright. Requires input from CIWG      
User Class 

 

e.g. from a list of designated user types – this would need to be developed in conjunction with the archive(s) we are working with and the Rights group) Requires input from CIWG      
Restriction Status (T&C – e.g. restricted only to archivists, i.e. no public access) Requires input from CIWG      
Access Conditions (T&C – e.g. may only be accessed by universities) Requires input from CIWG      
Use Conditions (T&C – e.g. only for research purposes) Requires input from CIWG      
PDI: Fixity Information
This documents authentication mechanisms, and provides any authentication keys which may be needed, for example:
  • Validation Keys
  • Encryption
  • Digital watermarks
Element Examples/Definitions Comments Req. Rep. Source
Authentication type  The authentication mechanism in use: e.g. digital watermark Needs more research.      
Authentication key The authentication key (if appropriate)        
Authentication value The value e.g. a hash value        
Authentication documentation Information on, or link to information, on the authentication mechanism in use        
Packaging Information
This is defined as information which, logically or actually, binds and relates the components of the package into an identifiable entity on specific media, e.g. on a CD-ROM, packaging info may include the ISO-9660 volume/file structure of a CD-ROM. These choices are the subject of local archive definitions or conventions.

The "Granularity" information in PDI-Reference may have relevance in this section. In this way, the binding of content and functionality can be expressed and be kept

Descriptive Information
This could be generally derived from the Content Information and PDI – so file identification/location, and other descriptive information will be replicated here. There may also be a role for metadata in existing schemes (MARC, EAD, etc) or in core formats like Dublin Core.
Element Examples/Definitions Comments Req. Rep. Source
Title Title of archived information object   Yes Yes  
Author/Creator Creator of intellectual content of the archived information object   Yes Yes  
Description Brief textual description of the archived information object. This could contain an abstract. Yes Yes  
Contributor Other contributors to the intellectual content of the archived information object.   Yes Yes  
Date Date that object was made available in its current form. ISO 8601 format. This could be problematical – there will be a need to define specific types of dates. Yes Yes  
Language Language of resource. Probably in NISO X39.53.   Yes Yes  

6. References

Consultative Committee for Space Data Systems, 1998, Reference Model for an Open Archival Information System (OAIS), ed. L. Reich and D. Sawyer. CCSDS 650.0-W-4.0. White Book, Issue 4, 17 September.
Latest version available from: <URL:http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html>

Day, M.W., 1998, Metadata for Preservation. CEDARS Project Document AIW01. <URL:http://www.ukoln.ac.uk/metadata/cedars/AIW01.html>

7. Document History

Version  Date  Comments 
Version 1  25-Feb-1999  First Draft by Andy Stone and Michael Day. 

Cedars is a CURL Project funded by the Joint Information Systems Committee through its Electronic Libraries Programme (eLib).


Created and maintained by: Andy Stone (OUCS) and Michael Day (UKOLN) of the Cedars Project.
Page created: 25-Feb-1999.
Last updated: 05-Mar-1999.