A review of metadata: a survey of current resource description formats
Work Package 3 of Telematics for Research project DESIRE (RE 1004)
Table of Contents
The Text Encoding Initiative Guidelines were published in 1994 as a result of an international research project which started in 1987. The guidelines consist of a 1400 page manual available in print form or as an electronic document on the Internet.
Burnard describes the goal of the TEI project as
to define a set of generic guidelines for the representation of textual materials in electronic form, in such a way as to enable researchers in any discipline to interchange and re-use resources, independently of software, hardware, and application area. (Lou Burnard. The Text Encoding Initiative Guidelines. <URL:ftp://info.ox.ac.uk/pub/ota/TEI/doc/teij31.sgml>).
TEI is a joint project sponsored by three professional bodies: the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The project was funded jointly from the US National Endowment for the Humanities and the European Union 3rd Framework Programme for Linguistic Research and Engineering. At present the project has two years more funding from the US for tutorial and dissemination work. The academic community in the US and Europe have been involved in the project forming a number of committees to consider different aspects of the encoding guidelines.
The TEI initiative aimed to reach agreement on encoding text across a range of disciplines. The TEI Guidelines, despite their origins in the humanities and linguistics were designed to form an extensible framework which could be used to describe all kinds of texts.
The TEI Guidelines specify that every TEI text must be preceded by a TEI header that describes the text. The header specification was formulated as part of the project by the Committee on Text Documentation comprising librarians and archivists from Europe and North America and the overall layout is grounded in a cataloguing tradition .
The TEI header can be used in different operational settings. Firstly it can exist as part of a conformant text. In this context the header might be created by the author or publisher as part of the original encoding; or it might be created during the TEI encoding of an existing document when it is used in a research or archival environment. Researchers can use the header in the process of textual analysis or, as is the case in a growing number of text archives, TEI headers are used as a means of bibliographic control.
The TEI Guidelines suggest that headers can be used in a second way by those libraries, research sites and indeed text archives who wish to build up databases of records referring to TEI encoded text held at remote sites. The Guidelines lay down a framework for 'independent headers', that is headers that can be stored separately from the text to which they refer. Independent headers are free-standing TEI headers which can be used in catalogues or databases to refer to a remote TEI encoded text.
A third possibility, not outlined in the Guidelines, is that independent headers could be used to describe networked resources which are not necessarily themselves TEI encoded. It is in this third context that independent headers could be described as metadata in the sense defined in this review. (It is assumed that metadata should be capable of describing any networked resource, not that there must be a necessary relation between the structure of the electronic data in the resource and the metadata format.)
The level of difficulty in creating TEI headers depends on the amount of detailed information entered in the header, and the conformance of the content to external rules such as AACR2. If an independent header is to be created which contains the same content as a MARC record with the same adherence to cataloguing practice then the same level of skill would be required as for library cataloguing. If the header is to include details on encoding, profile and revision (see below) then this also requires detailed knowledge of the text. However the ethos of the TEI Guidelines is flexibility: the level of encoding detail can suit the requirements of the situation. Thus it would be possible for an author or 'publisher' of an electronic text to create a simple TEI header. This header could then be elaborated if required by an archive administrator.
Although the Guidelines recommend that TEI independent headers should be detailed, this recommendation is in the context of an archive. It would be possible for metadata records to be created using simplified content not in conformance to AACR2. Indeed the need for a simplified version of the full guidelines has been recognised. A subset comprising a 'manageable selection' of the full DTD has now been issued as TEI Lite (Lou Burnard. What is TEI Lite? <URL:ftp://info.ox.ac.uk:80/~archive/teij31/WHAT.html>). This subset includes the majority of the TEI core tag set and is designed to be sufficient to handle most texts to a reasonable level of detail. TEI Lite is in use by the Oxford Text Archive for the encoding of its own texts.
TEI headers are conformant to the international SGML standard. SGML is specified in an international standard ISO 8879-1986.
The TEI Guidelines define textual features in terms of Standard Generalized Markup Language (SGML) elements and attributes, grouped into sets of tags. SGML aims to provide for mark-up of text in schemas which are hardware, software and application independent. SGML allows for a family of encoding schemes each with their own document type definition (DTD). TEI is a particular instance of a DTD; one that offers an extensible framework consisting of a core set of features with a variety of optional additions. Within TEI it is possible to build a customised DTD, appropriate to the document being encoded, by declaration of tag sets being used. The independent header has its own auxiliary DTD set out in the Guidelines.
SGML provides a framework for defining data in terms of elements and attributes. In SGML schemes these terms have particular meanings different from usage in other metadata. An element is a textual unit such as a paragraph; within the header an element would be a unit such as a title or author. An attribute gives information about a particular occurrence of an element and would be structured as an attribute/value pair e.g. in the Profile Description there is a <textClass> element to identify the subject headings themselves, and the controlled vocabulary used is identified by an attribute <keywords scheme=LCSH>.
The various elements in TEI are grouped into tag sets:
· core sets: elements likely to be needed by all documents
· base sets: element sets appropriate for particular classed of document e.g. verse, prose, drama
· additional sets: elements appropriate for the specialised or detailed treatment of text in different classes of document.
· auxiliary sets: elements with specialised roles e.g. the independent header DTD
The tag sets are extensible to enable mark up of new sorts of material.
The TEI header forms one of the two core tag sets available by default to all TEI DTDs. Presence of the TEI header is mandatory in a TEI encoded text. The TEI header is made up of :
· File Description: the bibliographic characteristics of the document and its source
· Encoding Description: editorial decisions regarding treatment of the text and details of the editorial process as well as decisions on the treatment of blank lines, indents etc.
· Profile Description: additional non-bibliographic information giving the context in which the text was produced e.g. language, details of participants, subject classification
· Revision Description: details of updates, amendments to the text.
Within the header, elements may be indicated as being in free prose, or as being structured statements.
The independent header has the same structure as the TEI header but more guidelines on content. The independent header has more mandatory and recommended elements and the Guidelines recommend it should contain structured information rather than unstructured prose.
The File Description is the only mandatory part of the header and it contains bibliographic description of the resource in the form of title, edition, publication and series statements. Within each element there is detailed bibliographic information e.g. the title statement includes information on intellectual responsibility specifying author, sponsor, funder, principal researcher, and other contributions. The form of the author however is not included i.e. as being personal, corporate or a meeting.
Within the File Description the title, publication and source are mandatory for all TEI headers, but several more elements are recommended for independent headers.
The file description contains detailed structured information drawing on standards and practice in the library cataloguing tradition and is modelled on library cataloguing standards.
Within the Guidelines there is consideration of the conversion of TEI headers to USMARC records. The Guidelines include detailed suggestions for mapping particular TEI elements to USMARC tags, but acknowledge that human intervention would be required to create a quality MARC record. There is no attempt in TEI markup to identify the author 'main entry', neither is the personal name format prescribed. Much of the non-bibliographic information would have no definitive resting place in MARC and would need to be moved to Notes fields.
In the independent header the usefulness of the profile, encoding and revision descriptions would be limited for analysis purposes unless the text was TEI encoded. Much of their usefulness depends on pointers in the electronic text to the header, relating information together.
There is no provision for including location information, library call numbers or electronic addresses within the header. There is no consideration within the Guidelines of the description of services so there is no provision for host administrative details. However the flexible nature of TEI means that the tag sets could be extended to include this information.
In the Profile Description there is a <textClass> element to identify the subject headings for a text. If a controlled vocabulary is used to identify the subject keywords then the scheme is identified by an attribute e.g. <keywords scheme=LCSH>; for classification numbers schemes are identified in a similar way e.g. <classcode scheme=DDC19>; if a user defined scheme is used this is identified by the <catRef> attribute.
Any revision history of the resource itself can be included in the revision description. All changes of the machine readable data should be included in this part of the header.
There is no provision for data about the header itself to be included in the header.
Information on the source of the electronic text should be included in the source description. The editorial declaration within the encoding description also allows for an explanation of editorial policy in the encoding of the text e.g. if spellings were corrected.
Where structured information is included in appropriate elements then the Guidelines give rules which follow AACR2 and ISBD. Those elements that are unstructured contain free text.
The profile description is used to specify languages used in the document.
Ability to express multi-lingual characters depends on the implementation of the TEI header. The guidelines do not specify any one particular character set , as with all SGML markup the guidelines are software and application independent.
The source description allows for analytic references to be included, where an item is part of a larger collection. The type attribute can be used to distinguish the main title from subordinate, parallel or other titles.
The role of the TEI independent header is so flexible that it can include large amounts of detail to enable analysis of text or it can be used in a simplified version to provide a known audience with bibliographic access to a collection of documents. This flexibility might well lead to difficulties if record creation occurs in a distributed model as the level of tagging complexity, the richness of the record content, might vary considerably. It is desirable that all the headers in a particular database should have a comparable level of detail. Unless there is uniformity in the level of detail across the database, retrieval will suffer. This difficulty in controlling the level of detail would increase in a distributed environment and could lead to problems with interoperability and record sharing.
Independent headers can be manipulated, searched and retrieved by any software that deals with SGML records e.g. Panorama, but as yet there is no provision within Internet search and retrieve protocols for TEI headers. Some research work is proposed to incorporate SGML DTDs into the experimental URCs.
There are few implementations. The majority of present implementations
are in humanities archives e.g. the Oxford Text Archive (<URL:http://sable.ox.ac.uk/ota/>)
and the Electronic Text Center at the University of Virginia (<URL:http://www.lib.virginia.edu/etext/ETC.html>).
Related European projects include EAGLES - Expert Advisory Group
on Language Engineering Standards (<URL:http://coral.lili.uni-bielefeld.de/~gibbon/EAGLES/rwpaper/node5.html>)
and Multext-East (<URL:http://nl.ijs.si/ME/>).
|Next||Table of Contents|