BL RIC logo SARA (SGML Aware Retrieval Application) Workshop 29th June 1994



This report was prepared for The British Library Research and Development Department by

Marc Fresko
Imaging and Information Technology Consultancy

Little Brownings 45A Welcomes Road Kenley Surrey CR8 5HA
Tel. 0181 645 9835 Fax 0181 645 9835 Email 100116.1152@compuserve.com

The illustration depicts the results of an enquiry made with SARA on a subset of the British National Corpus. In this illustration, the SGML tags are displayed, at the user's request. CLAWS C5 wordclass codes have been used; so the element w indicates a word; the attribute VVN indicates a past participle, PRP a preposition, etc. The user could instead have selected a display without the SGML codes.

Contents

Overview
Background to the BNC and SARA
Requirements for SARA
SARA features
Architecture of SARA
Possible future developments
Workshop attendees' assessments
Sources of further information
Workshop programme
Attendees

Overview

A workshop was held to launch into the academic community "SARA", the SGML Aware Retrieval Application, in Oxford on 29th June 1994 . This brief report summarises the workshop. It is intended for academic and other researchers who may be interested in using SARA. The workshop programme and a list of those attending are at the end of this report. SARA is a software product which has been developed as a part of the British National Corpus (BNC) project. It is a powerful tool designed to:

Background to the BNC and SARA

The British National Corpus is a three year project to build a 100 million word corpus of contemporary (mostly 1975 to present) spoken and written British English. The corpus has been built from a wide variety of sources, including fiction and non-fiction books, academic journals, popular periodicals, unpublished materials, radio broadcasts, and transcriptions of spoken conversations. The figure of 100 million words has now been attained; the process of tagging each document and word is in its final stages (each document is described with SGML tags, and each word likewise is annotated with part-of-speech and other information). The entire tagged corpus is due to be released later in 1994. It will be freely available within the academic community. It is anticipated that the corpus will be used for diverse purposes, including reference book publishing, linguistic research, and the development of systems for natural language processing and artificial intelligence.

One of the requirements identified during the project was for software which will allow a wide range of users to perform analyses using the corpus. It was decided to develop software for this purpose, as no available package met the requirements (described below); this software is SARA.

The BNC project has been funded jointly by a grant from the Department of Trade and Industry and the (then) Science and Engineering Research Council's Joint Framework for Information Technology (JFIT). The project participants are Oxford University Press, Longman Group UK Ltd., Chambers Harrap, the British Library and the universities of Oxford (Oxford University Computing Services) and Lancaster (Unit for Computer Research in the English Language). Development of SARA has been funded by the British Library Research & Development Department.

Requirements for SARA

The main requirements for SARA software were identified as:

These requirements became an integral part of the design objectives of SARA.

SARA features

SARA allows enquiries on the content and structure of a corpus. The precise searches and enquiries possible on a given corpus will of course depend upon the nature and completeness of the markup applied to it. However, indicatively, SARA supports features such as the following:

By way of illustration, some of the markup in the BNC relates to the social class of the speaker (in the case of spoken words); markup is also used to signify parts of speech. Thus, in the case of the BNC, SARA can be used to formulate a query equivalent to: How often do speakers of social class C1 use the word "input" as a verb?

Architecture of SARA

The SARA system consists of three components:

Possible future developments

As SARA is at an early stage of its development, feedback is actively sought from volunteer beta testers. Possible enhancements have already been identified, in the areas of advanced TCP/IP support (to enhance performance for certain types of enquiry); subcorpus selection; host-based features; smoother lexicon integration; and documentation on the product and on how to index corpora other than the BNC. Arrangements for future development and support are currently being finalised.

Workshop attendees' assessments

There was a general feeling that SARA can serve a useful purpose, as a high-speed, SGML-based query tool. Attendees had opportunities for hands-on experimentation, and found the system functions well. For most queries, response times were good (albeit with a corpus much smaller than the entire BNC); and the ease of use objective has been met. Almost all those present were sufficiently interested in the product to sign up as beta testers.

One attendee commented that SARA contains some specificity to BNC, a point agreed by the developers; ideally, the system would become totally generic with future development, to allow limitation-free working with other corpora.

The question of ongoing support was discussed, together with the linked issue of placing the source code in the public domain. A number of attendees commented that a reliable support structure is essential if they are to release SARA for widespread use in their universities. Mechanisms for support are currently being considered.

Sources of further information

For further details about the BNC, please enquire by electronic mail to natcorp@computing-services.ox.ac.uk.

For further details about SARA, please enquire by electronic mail to Lou Burnard at lou@computing-services.ox.ac.uk.

A set of demonstration diskettes for SARA is available from Lou Burnard at the above address (the demonstration requires a PC with Windows and Winsock).

Workshop programme

Attendees

British Library Reports and Papers