Andy Stone
Oxford University Computing Service, 13 Banbury Road, Oxford, OX2
6NN, UK.
andy.stone@computing-services.oxford.ac.uk
http://www.oucs.ox.ac.uk/
Michael Day
UKOLN: The UK Office for Library and Information Networking, University
of Bath, Bath BA2 7AY, UK.
m.day@ukoln.ac.uk
http://www.ukoln.ac.uk/
Cedars Document Number: AIW02 | Status: Current | Created: 25-Feb-1999 | Updated: 25-Feb-1999 | Availability: Project Only |
This document builds upon the work carried out for Cedars Project Document AIW01 Metadata for preservation (Day 1998). That report identified and outlined a number of metadata initiatives and projects with relevance to digital preservation. This document synthesises metadata elements taken from some of these initiatives and places them in the framework of the Reference Model for an Open Archival Information System (OAIS) published by the Consultative Committee for Space Data Systems (1998).
This is a first attempt at identifying preservation metadata elements for the Cedars project within the general framework provided by the OAIS model.
The metadata schemes most used in this document are:
The OAIS model identifies what it calls a Taxonomy of archival information object classes that includes the following types of data:
Using this structure should make it easier to identify the components of the different Information Packages which OAIS refers to (which reflect actions at different stages of an archive). Metadata elements detailed in the PANDORA data model, the Pittsburgh metadata requirements document and the RLG Working Group report were mapped on to the OAIS taxonomy in order to help identify specific metadata elements.
The metadata specification will be an important part of the Cedars paper demonstrator. It will therefore need to fit into the general framework of the proposed Cedars demonstrator architecture. David Holdsworth’s draft architecture outlines two essential outcomes of a preservation demonstrator:
Holdsworth has already noted the requirement for unique identifiers and has proposed a Cedars Reference Identifier (CRID).
Metadata, therefore, has several distinct roles within the Cedars architecture:
Resource discovery within the Cedars demonstrator. Finding the preserved digital object will the function of the Descriptive Information (in OAIS terms). For the demonstrator, this could take one of several different forms: MARC, EAD, etc. For the purposes of this document the Descriptive Information will conform to simple Dublin Core as defined in RFC 2413.
Retrieval from the Cedars demonstrator. Accessing the preserved digital object will depend upon Representation Information (in OAIS terminology) or structural metadata. The Representation Information (RI) may need to contain the whole specification for the format in which the object is ‘encoded’, e.g. for TIFF images or PDF files together with specifications of software environments (operating systems) and hardware. It is expected that the Cedars Preservation Strategies Working Group (PSWG) will contribute to the specification of RI. In this document, RI will just document relevant formats, operating environments and hardware.
Rights. Rights management will be an important part of any preservation demonstrator and could be used to manage access to objects stored in the archive. In OAIS terms, rights metadata are part of the Provenance Information of the Preservation Description Information (PDI). Rights metadata may need to kept separately from the rest of the PDI within the Cedars demonstrators. It is expected that the Cedars Content Issues Working Group will contribute to the specification of rights metadata within the context of the paper demonstrator.
Preservation description. The remainder of the PDI gives information about References (chiefly identifiers), Context, and Fixity. Fixity Information is concerned with the authentication of the archived object and will need to be the subject of further study. Context Information deals with relationships between (and within) resources, granularity issues and will document the history of an archived object with relation to format conversions, preservation strategies and usage. Further Provenance Information within the PDI might contain information about the context of creation of the archived object and the digitisation capture process.
Note that an Archive Information Package (AIP) will comprise the Digital Object together with its RI and PDI.
The following list of elements are to encourage discussion about the metadata required by the Cedars demonstrators and will lead to the eventual production of a Cedars metadata specification. The elements are part-derived from metadata schemes with a different focus to Cedars.
A proposed metadata record for one resource, the Wife of Bath’s Prologue on CD-ROM will be added to aid discussion.
Content information identifies the bits comprising the primary Digital Object; i.e. is the raw file. This may consist of:
Other information regarding the Digital Object can be found under the Representation Information and Preservation Description Information (PDI).
Representation Information is defined by the OAIS model as "information that maps data into more meaningful concepts", and as "accompanying a digital object, or sequence of bits … to provide additional meaning". In general, it can be said that "Data interpreted using its Representation Information" yields information.
It should be noted that Representation Information stored in digital form will require Representation information of its own to be understood. This recursion can be broken in one of two ways: through a physical document, or through software. The latter is felt to be a risky option, since there is no guarantee that such software will continue to exist. This identifies a need for a two-pronged approach: to indicate the required software (and, potentially, hardware emulators) to view files; and provision of physical documentation on standards for software formats, particularly in cases where they may be in decline.
There are varying levels of representation: collections may come with additional levels of functionality; although a description may be preserved of this, the implementation may or may not also be preserved.
Since there will be a long-term issue of format migration, this could be devolved to a dedicated "technology watch unit" which can identify and recommend migration action.
Proposed RI Elements:
4.2.1 Language
for content an international standard such as NISO Z39.53 could be used. This includes older languages which may be useful for textual resources (e.g. in OTA). Available: <URL:http://www.oasis-open.org/cover/nisoLang3-1994.html>
4.2.2 File Information:
There are two main strands to this; the physical storage format used (called "file format" in Pittsburgh), and the type of file formatting ("file rendering" in Pittsburgh) used.
Storage format: Type Examples would be diskette
(3.5, 5.25 inch), CD-ROM, information server, HFS path, etc. This may be
important since the underlying software may require the emulation of mounting
diskettes, for example.
Storage format: Version
Storage format: Specification (link to reference
resource; should be derived from type and version)
Type of file format:
Name of File Type (e.g. TIFF)
Version
Number of files
Total file size
Documentation (link to reference resource; should
be derived from type and version)
A list of file formats would need to be maintained; formatting metadata may be required to understand how logical access to the bit-stream takes place.
File Modality (from Pittsburgh)
A higher level of description, similar to Dublin Core ‘Type’, i.e. text, graphic, sound, video, etc.
If a type of encoding is not initially recognised, one could be made aware of its generic type so that exploratory work can be done to assess the feasibility of accessing the file.
4.2.3 Modality-Specific Metadata
There will be format-specific technical information. For example:
4.2.3.1 Text-based data:
Character Set (e.g. ASCII, Unicode)
Encoding
Associated DTD (for SGML/XML documents)
4.2.3.2 Graphic data
The following have been modified from RLG:
Resolution (e.g. 600dpi)
Colour (e.g. 16-bit)
Colour Management
Colour Bar/Grey Scale Bar
4.2.3.2 Audio data
CEDARS is not addressing multimedia, but examples might include:
Sampling Frequency (e.g. 8kHz, 44.1KhZ)
File Encoding (e.g. AIFF, AU, WAV – will require
associated package to play back file)
Bits (e.g. 8-bit, 16-bit)
Mono/Stereo
4.2.4 Dependencies (adapted from Pittsburgh)
Resources may have dependencies on particular applications, software environments and hardware. These dependencies will, where appropriate, need to be recorded.
Application Dependency (e.g. Yes/No)
Application Name (e.g. OpenText, MS-Word)
Application Version (e.g. version 5.0)
Notes (e.g. "Versions of OpenText later than 5.0
are unable to handle SGML files")
Application dependencies could also refer to Plug-ins required, as mentioned by PANDORA’s "Online Format" field.
[Note: Is it within the remit of CEDARS to maintain the location of application required? At the librarians’ focus group in December, the notion of an application repository was mentioned. This would imply applications would have their own identifiers (and consequently locations) so they could be accessed].
Software Environment Dependency
Dependencies on particular operating systems and environments.
Software Environment Name (e.g. Windows, Macintosh,
UNIX, none)
Software Environment Version (e.g. Windows 3.1,
95, 98, Mac System 7, System 8.1, AIX, IRIX)
Hardware Dependency
This may be audio or visual minimum requirements. However, since CEDARS is not addressing time-based media, this may not be as essential as previously thought:
[Note: This could also be used to record dependencies on particular hardware, e.g. a PC with 486 processor (minimum)?]
Hardware Name (e.g. Mouse, Joystick, Monitor)
Hardware Type (e.g. Pointer Device, Colour Monitor)
Hardware Description (e.g. mouse needed to interact
with software; images served are large, high-quality colour, etc.)
Specification (e.g. suggested minimum size 17",
refresh rate 75Hz, dot pitch 0.28mm)
Modality Served (e.g. text, graphics, sound,
video)
4.2.4.5 Compression
Name (e.g. TIFF-LZW, ZIP, TAR)
Method
Required uncompression package (e.g. graphics
viewer package, PKUNZIP, etc)
If the uncompression package is not available, a reference
to its specification will be needed.
If an application repository exists, compression/uncompression
packages should be included in this.
4.2.4.6 Location
The location of file on archive system - an identifier (i.e. CRID)
4.3.1 Reference Information
This identifies and (if necessary) describes one or more mechanisms used to provide assigned identifiers for the Content Information. It also provides identifiers which allow outside systems to refer unambiguously to particular Content Information.
Identifer Name (e.g. CRID,URL, ISBN, Author, Title,
etc.)
Identifier Value
Examples: CEDARS Identifier (a unique identifier within the CEDARS scheme - e.g. the CRID); Internet URIs or URNs; ISBN/ISSN numbers (for electronic texts where applicable). Preference must be given to persistent and unique identifiers.
4.3.2 Context Information
Defined as the relationships of the content information to its environment. This includes background information on why the Content Information was created, and how it relates to other Content Information objects elsewhere.
The main metadata types refer to: granularity, formatting history, preservation strategy history, usage, and format processing.
4.3.2.1 Granularity
Different types of archive will have different levels of granularity, e.g. in ILEJ, there will be publication level, issue level and page level. It needs to be clear which one is meant. PANDORA refers to Publication Identifier and Copy Identifier. There may be a role for the Dublin Core Relation element.
Level of granularity (this could be defined as
part of the archival schema; examples need to be worked upon using the
exemplars we have)
Related item (e.g. list of CRID/URNs of items)
Relation to item (e.g. it is an element/component
of it)
Is item required? (e.g. will the object in question
make sense without it, e.g. links required to understand object within
collection)
Online data, while usually output as text-based files read by a client (e.g. Web browser), may only be accessible via server-based programs. For example, the ILEJ collection is available via the Web, with a variety of searchable functions. If it was felt important for this functionality (or part of it) to be ‘preserved’, there would be a trade-off between any completeness of functionality and cost. (Note: this observation also refers to dependency on software to provide FUNCTIONALITY which has already been covered – this is distinct from the granularity of the digital object files)
Additionally, the ILEJ and Oxford Early Manuscripts projects both raised issues regarding associated files. The uncompressed scanned TIFF file of a manuscript page may not be appropriate, nor is the full context of the digital artefact known. There needs to be the facility to link to the relevant files in order for a contextual understanding to be made (e.g. ILEJ may link the OCR text file with the scanned image and thumbnail)
In instances such as these, the judgement of what constitutes the digital object is important.
Perhaps in each archive, this complete functionality needs to be documented (and if functional instances of this have been preserved, e.g. specific CGI search routines) stating how these were used.
4.3.2.2 Formatting History
History: Submission:
RLG: Date - when the digital preservation copy
was created.
History: Change:
The RLG metadata refers to "Change History" - changes
made to the formatting of the data under a migration strategy, or to decisions
regarding its preservation.
History: Change: Formatting:
A value known as "Current File/Format" was in a previous
draft; however, this is redundant if the submission event is recorded as
the first change – if this is a digitisation process, the original medium
can be
If the Current File/Format fields have a "no" value, then information about the file/format type (and version) converted to; reason for conversion; date converted; and conversion agent is also captured. It can be envisaged that this information will be retained as conversions take place over time. This facilitates a history trail to be kept of files/formats.
FormattingChange: date
FormattingChange: previous format (e.g. paper
master, CD-ROM, HTML)
FormattingChange: process (e.g. scanning, HFS
backup, conversion to XML)
FormattingChange: reason (e.g. creation of digital
master, change in storage policy, archive decides to convert HTML docs
into XML)
FormattingChange: conversion-agent (e.g. OUCS,
ILEJ Project Officer)
PANDORA mentions that the physical characteristics of tangible formats will need to be kept in the archive.
4.3.2.3 Preservation Strategy History
History: Change: Preservation Strategy:
Pittsburgh: Disposition Requirements Metadata
Removal Authority - who’s authorised to purge
a item from the archive system, e.g. the Oxford Text Archive.
Retention Policy Citation
- textual information of organisation’s policy for record retention, e.g.
link to relevant document or inclusion of relevant text in metadata. Could
be a URI.
This is similar to PANDORA: Decision Rule (refers to
the decision for keeping a record archived)
Retention Authority Issuance - textual information
on legislative/government laws/regulations for record retention
Retention Period End Time (e.g. 31/12/2015, indefinite)
Also PANDORA: Selection Status (e.g. withdrawn,
selected, not selected)
4.3.2.4 Usage
Pittsburgh: Use History, breaks down to:
Use Type - e.g. viewed, copied, edited, filed,
indexed, classified, sent, disposed – identifies usage types permitted.
Use Instance Time - when data was used, e.g. 09/02/99
15:45.
Use Instance User - who used the data – although
this could be just at the level of organisation; e.g. user ID number, guest,
email of user.
The Use-Evidential-Consequences appears to be more related to the evidential focus of the Pittsburgh project. Here preservation is more concerned with frequency of usage. While this may not be the sole determinant of preservation strategy, it may be of interest (e.g. if usage is a determinant of the continuing preservation of the objects). Similarly, at what level would users be defined (individuals, groups, geography, anonymous? Need to check with Rights (Content Issues) Working Group).
4.3.2.5 Format Processing:
PANDORA: Process - the formal states a tangible
format undergoes before and/or during each time its content is made available
to user; e.g. testing by installation and virus checking, conversion from
TIFF to JPEG format, rendering from SGML to HTML.
PANDORA: Process Type - the processes a given
tangible format may undergo.
PANDORA: Process Agent - what utilities have been
used (e.g. Dr Solomons, Norton, TIFF2JPG utility, MS PhotoShop)
PANDORA: Publisher-Preferred Browser (e.g. Netscape
4.0, JPEG viewer)
Format processing would need to be monitored, a "technology watch" facility would be useful to update this. Also, changes in format imply changes required in format processing.
4.3.3 Provenance Information
This documents the history of the Content Information; i.e. who has had custody of the Content Information since it was originated, and what was its source. This would include information about the processes that generated it. If the content was not "born digital", information about digitisation processes could be held here. Also, custodial information is also likely to refer to the rights implications regarding access to the content. Rights metadata issues will be further looked at by the Cedars CIWG.
[We will also need to include some kind of field which indicates if the content should be made visible to archive staff querying the system, in addition to whether end users can access it].
4.3.3.1 Publisher Information
Publisher
Publisher Name (e.g. Oxford University Press)
Publisher Contact Details (e.g. address, email,
fax, telephone number)
Rights Management
Access Profile
Access Rights
4.3.3.2 Capture Proceedure
[PANDORA deals primarily with Internet publications – I feel this section relates more to ongoing capture of on-line publications, which is a specific instance of an ACQUISITIONS PROCEDURE].
Capture Procedure Type
Gathering Frequency - Sampling techniques (e.g.
weekly, monthly, quarterly, annually, one-off)
Gathering Regime (e.g. the frequency and extent
of capture of a publication)
4.3.3.3 Rights information
Event Type (e.g. Publication selected, No publisher
response after x weeks, Publisher Permission obtained, Null publisher response)
Negotiation Status (e.g. First Contact Made, Follow-up
contact made, Agreement Finalised, On hold pending further negotiation,
Archived without agreement, De-selected)
Note: event type and negotiation status are interlinked,
the PANDORA model examples are cited.
Organisation (body which negotiates on behalf
of publisher – e.g. ASCAP in the case of musicians)
Copyright Warning
Publisher Copyright Statement
[Rights input from CIWG].
User Class (e.g. from a list of designated user
types – this would need to be developed in conjunction with the archive(s)
we are working with and the Rights group)
Restriction Status (T&C – e.g. restricted
only to archivists, i.e. no public access)
Access Conditions (T&C – e.g. may only be
accessed by universities)
Use Conditions (T&C – e.g. only for research
purposes)
4.3.4 Fixity Information
This documents authentication mechanism, and provides any authentication keys which may be needed, for example:
Fixity-type (the mechanism, e.g. digital watermark)
Fixity-key (e.g. key value, if applicable)
Fixity-value
Fixity-info-location (e.g. of documentation on
the authentication mechanism)
This is defined as information which, logically or actually, binds and relates the components of the package into an identifiable entity on specific media, e.g. on a CD-ROM, packaging info may include the ISO-9660 volume/file structure of a CD-ROM. These choices are the subject of local archive definitions or conventions.
The "Granularity" information in PDI-Reference may have relevance in this section. In this way, the binding of content and functionality can be expressed and be kept distinct.
This is the metadata used for use by access aids (which locate, analyse, retrieve, etc from an OAIS), for example:
Title
Author/Creator
Description
Contributor
Date
Language
This could be generally derived from the Content Information and PDI – so file identification/location, and other descriptive information will be replicated here. There may also be a role for metadata in existing schemes (MARC, EAD, etc) or in core formats like Dublin Core.
The following table is an alternative presentation of the draft metadata elements identified to date, with the addition of thoughts as to whether each suggested element will be required, if they can be repeatable, and the likely source of this metadata.
Content Information | |||||
Element | Examples/Definitions | Comments | Req. | Rep. | Source |
File Identifier | A unique identifier which indicates each data object being archived. | In the Cedars context this would most likely be an internally defined identifier or CRID. | Yes | No | Cedars nameserver |
File Size | The size of file in kilobytes, megabytes, etc. | There may be a need for a content standard for this field. | ? | No | ? |
Representation Information | |||||
Element | Examples/Definitions | Comments | Req. | Rep. | Source |
Language | The language of the content of
the data object being archived.
NISO Z39.53 codes could be used. |
To be used where language is a significant part of the resource. | No | Yes | |
Storage format: Type | The initial storage format of
the data object being archived. Examples might be: Diskette 3.5 inch
Diskette 5.25 inch CD-ROM Information server HFS path |
This may be important since the underlying software may require the emulation of mounting diskettes, for example. | Yes | Yes | To be supplied as pre-ingest data? |
Storage format: Version | The version of storage format being used | To be supplied as pre-ingest data? | |||
Storage format: Specification | A link to documentation about the source format. | The link would need to be to a reference list of formats that could be derived from the ‘type’ and ‘version’ fields | Yes | ? | |
Name of File Type | The name of the file type.
e.g. TIFF HTML Proprietary Collection |
A reference list of file formats would need to be maintained; formatting metadata may be required to understand how logical access to the bit-stream takes place. Collection would indicate this particular package contains a collection of files | Yes | Yes | |
Version of File Type | The version of file type | The reference list of file formats would include the differences between different versions of file types | |||
Number of files | The total number of files that comprise the archive object | ||||
Total file size | The total size (in KB, MB etc,) of the files that comprise the archive object. | This would give an indication as to how much file space this resource would require to host, and how much data would be transferred from the archive. | |||
Documentation | A link to the documentation about the file format; should be derived from type and version) | The link would need to be to a reference list of formats that could be derived from the ‘type’ and ‘version’ fields | |||
File Modality | This is a higher level of description, which is similar to the Dublin Core ‘Type’ element, i.e. text, graphic, sound, video, etc. Adapted from the Pittsburgh metadata set. | If a type of encoding is not initially recognised, one could be made aware of its generic type so that exploratory work can be done to assess the feasibility of accessing the file. | |||
Character Set
|
The character set which the data uses. (e.g. ASCII, Unicode) | For text-based data only | |||
Associated DTD | The Data Type Definition which the text requires for interpretation. The value would be the CRID of DTD | This is only required for text-based data which contain markup tags; i.e. SGML, HTML, XML. | |||
Encoding | For text-based data only | ||||
Resolution | The graphics resolution of an image; e.g. 600dpi | For graphics data only | |||
Colour | The colour depth of an image; e.g. 1-bit (black and white), 16-bit | This will allow users to be aware of the minimum requirement for viewing graphics. | Yes | No | Possibly from metadata within graphics files, or from pre-ingest information supplied |
Colour Management | Identify system, if any, that is used to improve consistency of color across capture, display and output of an image. EXAMPLES: Photo CD; OptiCal (color management system); Profile/80 (color sync profile maker); Softproof (Photoshop Plugin) | For graphics data only | No | No | |
Colour Bar/Grey Scale Bar | This links to a reference image which would allow colour comparison to take place; e,g, Kodak Q13 or Q14 Colour Separation Guide and Grey Scale; Kodak Q60 Colour Input Target | An image of the colour bar may
be required within the archive, in order for comparisons to be made.
For graphics data only |
No | No | This would need to be supplied by the depositor, at pre-ingest stage. |
Sampling Frequency | The sampling frequency at which audio was sampled: e.g. 8kHz, 44.1KhZ | This is only required for audio data | Yes | No | =TOP> |
File Encoding | The type of encoding used to create a digital audio file. For example, AIFF, AU, WAV. | This may be the same as file format name. This is only required for audio data. | Yes | No | |
Bits | The number of bits each audio sample contains; e.g. 8-bit, 16-bit. | This is only required for audio data. | Yes | No | |
Mono/Stereo | Self-explanatory. | This is only required for audio data. | Yes | No | |
Application-Dependency | This is used to indicate whether or not a file or collection is dependent on a specific application. This will be a Yes/No value. | Where some resources are given in a stand-alone format, the component data files may be in a proprietary format which requires a specific viewer and is not easily converted. | Yes | Yes | |
Application Name | The name of the application which a file or collection is dependent on; e.g. OpenText, DynaText, MS-Word. | In some collections, an application may not be required to view the files which comprise a collection, but may be required to achieve the functionality of the collection. This would mean while the collection is not dependent on the application, it can be noted that added value can be obtained from it using a particular application. This can inform future preservation decisions. | Yes | ? | |
Application Version | Version of the application which the e.g. version 5.0) | Yes | No | ||
Notes | Text information which is of importance to the successful use of the application; e.g. "Versions of OpenText later than 5.0 are unable to handle SGML files" | This can also be used to describe the functionality which the application adds to the collection. | Yes | No | |
Software Environment Dependency | Indicates whether the collection is dependent on a particular operating system. Value: yes/no | Dependencies on particular operating systems and environments. | Yes | Yes | |
Software Environment Name | Name of the operating system which is required for using he digital archive (e.g. Windows, Macintosh, UNIX, none) | In the absence of a particular operating system, an emulator might be used to facilitate use. This could be considered an application dependency, although this matching could be a distinct function, for ease of maintainance. | No | Yes | |
Software Environment Version | The version of the operating system in question: e.g. Windows 3.1, 95, 98, Mac System 7, System 8.1, AIX, IRIX | A list of operating systems and versions could be kept. This may also help with keeping track of emulator compatibility. | No | Yes | |
Hardware Dependency | Indicates if any specific hardware
is required for the successful use of the resource (yes/no).
|
This may be audio or visual minimum requirements. However, since CEDARS is not addressing time-based media (i.e. audio and video), this may not be as essential as previously thought: | Yes | Yes | |
Hardware Name | Name of the hardware device required: e.g. Mouse, Joystick, Monitor | The examples here are trivial, but could include a specific device | Yes | No | |
Hardware Type | Type of hardware device; e.g. Pointer Device, Colour Monitor | Yes | No | ||
Hardware Description | Description of why hardware is required. (e.g. mouse needed to interact with software; images served are large, high-quality colour, etc.) | The description of the hardware dependency will explain why it has been critical to list it. Some value judgement will need to be made as to what can be regarded as superfluous (e.g. monitor, mouse) | Yes | No | |
Specification | Provides specification of required hardware; e.g. suggested minimum size 17", refresh rate 75Hz, dot pitch 0.28mm | Yes | No | ||
Modality Served | Describes the type of resource the hardware is facilitating; e.g. text, graphics, sound, video, input | No | Yes | ||
Compression | Have files been compressed; yes/no value | With proprietary file formats, it may not be possible to know this. | No | No | |
Name | Name of type of compression used (e.g. TIFF-LZW, ZIP, TAR) | Yes | No | ||
Method | Description of parameters used, package and process used, etc. e.g. "compressed to 80% quality JPEG files using GraphicConverter on macintosh as a batch process" | Yes | No | ||
Required uncompression package | Name of software package required to decompress the files; e.g. graphics viewer package, PKUNZIP, etc. | This would be another instance of a file dependency. | |||
Location | location of file on archive system - an identifier (i.e. CRID) | ||||
Preservation Description Information | |||||
PDI: Context Information | |||||
Element | Examples/Definitions | Comments | Req. | Rep. | Source |
Identifer Name | Name of type of identifier; e.g. CRID,URL, ISBN, Author, Title, etc. | Preference must be given to persistent and unique identifiers. Internal references may be used within a package which do not correspond to newest incarnation; original filenames may be useful in these cases. | |||
Identifier Value | Value of identifier for this package. | ||||
Level of granularity
|
This has been identified as an important item of metadata. Describing this will vary according to collections, and so needs to be defined on a case-by-case basis. | (this could be defined as part of the archival schema; examples need to be worked upon using the exemplars we have) | Yes | Yes | |
Related item | Items related to this one: e.g. list of CRID/URNs of items | No | Yes | ||
Relation to item | Description of the relation to an item: e.g. it is an element/component of it | Describes a hierarchy/network of relations: we may decide that a textual description at collection level may suffice, if this is too much to model. | No | Yes | |
Is item required? | Will the object in question make sense without it, e.g. links required to understand object within collection. Yes/No value. | ||||
History: Submission:
Date |
Date when the digital preservation copy was created, should follow ISO 8601 date format | Yes | No | ||
History: Change: Formatting: | |||||
The RLG metadata refers to "Change History" - changes made to the formatting of the data under a migration strategy, or to decisions regarding its distinct.preservation If the Current File/Format fields have a "no" value, then information about the file/format type (and version) converted to; reason for conversion; date converted; and conversion agent is also captured. It can be envisaged that this information will be retained as conversions take place over time. This facilitates a history trail to be kept of files/formats. | |||||
FormattingChange: date | Date of change, in ISO 8601 format: ( CCYY-MM-DD). Example: 1985-04-12 | No | Yes | ||
FormattingChange: previous format | The previous format which data was stored in. | This may be a redundant field; if you have date and formats held, this could be inferred through a query. | No | Yes | |
FormattingChange: process | Process undergone to change formats. | No | Yes | ||
FormattingChange: reason | Reason for changing format. | This could be a change in storage policy, or file format in danger of being outdated. | No | Yes | |
FormattingChange: conversion agent | Details of who carried out the work. | No | Yes | ||
Preservation Strategy History | |||||
History: Change: Preservation Strategy: | This would link to changes in policy which would affect the preservation strategy. | No | No | ||
Removal Authority - | Defines who is authorised to purge a item from the archive system, e.g. the Oxford Text Archive. | This must point to a user ID within Cedars. | No | No | |
Retention Policy Citation - | Textual information of organisation’s policy for record retention, e.g. link to relevant document or inclusion of relevant text in metadata. Could be a URI. | This is similar to PANDORA: Decision Rule (refers to the decision for keeping a record archived) | No | No | |
Retention Authority Issuance | Textual information on legislative /government laws /regulations for record retention | This may not be necessary for Cedars frameworks, as this is from an evidential requirements framework. | No | No | |
Retention Period End Time | Date which indicates the end of the retention period (e.g. 2012-03-99 or indefinite) ISO 8601 | Again, this is from evidential requirements, and may not be necessary for Cedars. | No | No | |
Selection Status | Status of item’s selection within archive (e.g. withdrawn, selected, not selected) | Allows us to keep track of what has been withdrawn from archive. | No | No | |
Usage | |||||
Use-Type | Description of usage of the item - e.g. viewed, copied, edited, filed, indexed, classified, sent, disposed – identifies usage types permitted. | This will require a set of keywords to be identified (?) | No | Yes | |
Use Instance Time | Notes when the data was used (ISO 8641), e.g. 09/02/99 15:45. | No | Yes | ||
Use Instance User | who used the data – although this could be just at the level of organisation; e.g. user ID number, guest, email of user | The Use Evidential Consequences appears to be more related to the evidential focus of the Pittsburgh project. Here preservation is more concerned with frequency of usage. While this may not be the sole determinant of preservation strategy, it may be of interest (e.g. if usage is a determinant of the continuing preservation of the objects). Similarly, at what level would users be defined (individuals, groups, geography, anonymous? Need to check with Rights (Content Issues) Working Group). | No | Yes | |
Format Processing | |||||
Records the formal states a tangible format undergoes before and/or during each time its content is made available to user; e.g. testing by installation and virus checking, conversion from TIFF to JPEG format, rendering from SGML to HTML. | |||||
Process Type
|
The processes a given tangible format may undergo | ||||
Process Agent | Documents what particularly utilities
have been used (e.g. Dr Solomons, Norton, TIFF2JPG utility, MS PhotoShop)
|
||||
Publisher Preferred Browser | Recommended version of browser, e.g. Netscape 4.0, JPEG viewer) | Format processing would need to be monitored, a "technology watch" facility would be useful to update this. Also, changes in format imply changes required in format processing. | |||
PDI Provenance Information | |||||
This documents the history of the Content Information; i.e. who has had custody of the Content Information since it was originated, and what was its source. This would include information about the processes that generated it. If the content was not "born digital", information about digitisation processes could be held here. Also, custodial information is also likely to refer to the rights implications regarding access to the content. Rights metadata issues will be further looked at by the Cedars CIWG. | |||||
Element | Examples/Definitions | Comments | Req. | Rep. | Source |
Publisher Information | |||||
Publisher Name
|
The name of a publisher.
e.g. Oxford University Press |
Publisher in this context could mean information provider. | |||
Publisher Contact Details
|
Contact details (address, email, fax, telephone number) for the publisher | CIWG input needed | |||
Rights Management | Text field or link to rights management information | ||||
Access Profile | |||||
Access Rights | |||||
Capture Procedure | |||||
Capture Procedure Type | Describes how the archived object was originally created. | ||||
Gathering Frequency | Sampling techniques (e.g. weekly, monthly, quarterly, annually, one-off) | ||||
Gathering Regime | (e.g. the frequency and extent of capture of a publication) | ||||
Rights information | |||||
Event Type | Type of event in the rights acquisition process (e.g. Publication selected, No publisher response after x weeks, Publisher Permission obtained, Null publisher response) | This was taken from the PANDORA model; needs input from CIWG. | |||
Negotiation Status | Type of event in the rights negotiation process (e.g. First Contact Made, Follow-up contact made, Agreement Finalised, On hold pending further negotiation, Archived without agreement, De-selected) | Note: event type and negotiation
status are interlinked, the PANDORA model examples are cited.
Organisation (body which negotiates on behalf of publisher – e.g. ASCAP in the case of musicians) |
|||
Copyright Warning | Information on the copyright of the item. | Requires input from CIWG | |||
Publisher Copyright Statement | Statement from publisher regarding copyright. | Requires input from CIWG | |||
User Class
|
e.g. from a list of designated user types – this would need to be developed in conjunction with the archive(s) we are working with and the Rights group) | Requires input from CIWG | |||
Restriction Status | (T&C – e.g. restricted only to archivists, i.e. no public access) | Requires input from CIWG | |||
Access Conditions | (T&C – e.g. may only be accessed by universities) | Requires input from CIWG | |||
Use Conditions | (T&C – e.g. only for research purposes) | Requires input from CIWG | |||
PDI: Fixity Information | |||||
This documents authentication
mechanisms, and provides any authentication keys which may be needed, for
example:
|
|||||
Element | Examples/Definitions | Comments | Req. | Rep. | Source |
Authentication type | The authentication mechanism in use: e.g. digital watermark | Needs more research. | |||
Authentication key | The authentication key (if appropriate) | ||||
Authentication value | The value e.g. a hash value | ||||
Authentication documentation | Information on, or link to information, on the authentication mechanism in use | ||||
Packaging Information | |||||
This is defined as information
which, logically or actually, binds and relates the components of the package
into an identifiable entity on specific media, e.g. on a CD-ROM, packaging
info may include the ISO-9660 volume/file structure of a CD-ROM. These
choices are the subject of local archive definitions or conventions.
The "Granularity" information in PDI-Reference may have relevance in this section. In this way, the binding of content and functionality can be expressed and be kept |
|||||
Descriptive Information | |||||
This could be generally derived from the Content Information and PDI – so file identification/location, and other descriptive information will be replicated here. There may also be a role for metadata in existing schemes (MARC, EAD, etc) or in core formats like Dublin Core. | |||||
Element | Examples/Definitions | Comments | Req. | Rep. | Source |
Title | Title of archived information object | Yes | Yes | ||
Author/Creator | Creator of intellectual content of the archived information object | Yes | Yes | ||
Description | Brief textual description of the archived information object. | This could contain an abstract. | Yes | Yes | |
Contributor | Other contributors to the intellectual content of the archived information object. | Yes | Yes | ||
Date | Date that object was made available in its current form. ISO 8601 format. | This could be problematical – there will be a need to define specific types of dates. | Yes | Yes | |
Language | Language of resource. Probably in NISO X39.53. | Yes | Yes |
Consultative Committee for Space Data Systems, 1998, Reference Model for an Open Archival Information System (OAIS), ed. L. Reich and D. Sawyer.
CCSDS 650.0-W-4.0. White Book, Issue 4, 17 September.
Latest version available from: <URL:http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html>
Day, M.W., 1998, Metadata for Preservation. CEDARS Project Document AIW01. <URL:http://www.ukoln.ac.uk/metadata/cedars/AIW01.html>
Version | Date | Comments |
---|---|---|
Version 1 | 25-Feb-1999 | First Draft by Andy Stone and Michael Day. |
Cedars is a CURL Project funded by the Joint Information Systems Committee through its Electronic Libraries Programme (eLib).
Created and maintained by: Andy Stone (OUCS) and Michael Day (UKOLN) of the Cedars Project.
Page created: 25-Feb-1999.
Last updated: 05-Mar-1999.