SARA (SGML Aware Retrieval Application) Workshop

This report was prepared for The British Library Research and Development Department by

Marc Fresko
Imaging and Information Technology Consultancy
Little Brownings 45A Welcomes Road Kenley Surrey CR8 5HA
Tel. 0181 645 9835 Fax 0181 645 9835 Email 100116.1152@compuserve.com

The illustration depicts the results of an enquiry made with SARA on a subset of the British National Corpus. In this illustration, the SGML tags are displayed, at the user's request. CLAWS C5 wordclass codes have been used; so the element w indicates a word; the attribute VVN indicates a past participle, PRP a preposition, etc. The user could instead have selected a display without the SGML codes.

Overview
Background to the BNC and SARA
Requirements for SARA
SARA features
Architecture of SARA
Possible future developments
Workshop attendees' assessments
Sources of further information
Workshop programme
Attendees

Overview

A workshop was held to launch into the academic community "SARA", the SGML Aware Retrieval Application, in Oxford on 29th June 1994 . This brief report summarises the workshop. It is intended for academic and other researchers who may be interested in using SARA. The workshop programme and a list of those attending are at the end of this report. SARA is a software product which has been developed as a part of the British National Corpus (BNC) project. It is a powerful tool designed to:

allow rapid and sophisticated analysis of the BNC and other text corpora encoded using SGML;
allow the academic community access to the BNC as easily as possible.
On the day of the seminar, SARA was officially released in beta test form.

Background to the BNC and SARA

The British National Corpus is a three year project to build a 100 million word corpus of contemporary (mostly 1975 to present) spoken and written British English. The corpus has been built from a wide variety of sources, including fiction and non-fiction books, academic journals, popular periodicals, unpublished materials, radio broadcasts, and transcriptions of spoken conversations. The figure of 100 million words has now been attained; the process of tagging each document and word is in its final stages (each document is described with SGML tags, and each word likewise is annotated with part-of-speech and other information). The entire tagged corpus is due to be released later in 1994. It will be freely available within the academic community. It is anticipated that the corpus will be used for diverse purposes, including reference book publishing, linguistic research, and the development of systems for natural language processing and artificial intelligence.

One of the requirements identified during the project was for software which will allow a wide range of users to perform analyses using the corpus. It was decided to develop software for this purpose, as no available package met the requirements (described below); this software is SARA.

The BNC project has been funded jointly by a grant from the Department of Trade and Industry and the (then) Science and Engineering Research Council's Joint Framework for Information Technology (JFIT). The project participants are Oxford University Press, Longman Group UK Ltd., Chambers Harrap, the British Library and the universities of Oxford (Oxford University Computing Services) and Lancaster (Unit for Computer Research in the English Language). Development of SARA has been funded by the British Library Research & Development Department.

Requirements for SARA

The main requirements for SARA software were identified as:

Speed: the ability to perform enquiries rapidly on very large corpora (eg the BNC with 100 million words);
SGML-awareness: the ability to search not only on content but also on SGML elements;
Ease of use: usability without an understanding of SGML being a prerequisite;
Wide access: usability with commonly available hardware/communications configurations, without the need for exotic or especially powerful configurations or connections;
Generalisation: usability with any corpus, ie no assumptions specific to the BNC to be an integral part of the software.
Public domain source: availability of the SARA source code in the public domain once it is released.

These requirements became an integral part of the design objectives of SARA.

SARA features

SARA allows enquiries on the content and structure of a corpus. The precise searches and enquiries possible on a given corpus will of course depend upon the nature and completeness of the markup applied to it. However, indicatively, SARA supports features such as the following:

Searches on words, truncated words and phrases
Searches on SGML tags, attributes
Combinatorial Boolean operations
Frequency counts
Lexicon, to allow identification of similar words (eg gumboot, gum-boot, gum-boots etc)
Refinement
Storing searches
Limiting scope of queries
Presentation of Results
With or without SGML markup
Page or concordance format
Optional use of colour to enhance display

By way of illustration, some of the markup in the BNC relates to the social class of the speaker (in the case of spoken words); markup is also used to signify parts of speech. Thus, in the case of the BNC, SARA can be used to formulate a query equivalent to: How often do speakers of social class C1 use the word "input" as a verb?

Architecture of SARA

The SARA system consists of three components:

The Client Program: this component is the part of SARA which is visible to users. It consists of a user-friendly user interface executing on an MS-DOS PC with Windows software. It allows users to formulate and refine queries and receive and navigate around responses. It is written in Microsoft C++, and is connected to the Server Program (see below) using Winsock software. The client is written with no knowledge of the SGML markup on the corpus being worked on; this information is passed to it by the server each time connection is established between them. This allows the client to display dialogue boxes etc. which accurately reflect the SGML tags used to describe the documents and words in the corpus. The client additionally features extensive online help.
The Server Program: this component performs the queries requested by the user through the client program, then returns responses to the client. It executes on a Unix computer or on an MS-DOS PC which is also running the client; it is written in ANSI C. At present the server is running on a SparcStation at Oxford University Computing Services, and can be accessed using Internet; however, it can also be run on other systems.
The Index Program: this component takes as input the corpus, and builds from it the numerous indexes used by SARA to effect complex searches with acceptable speed. It is used once (for a stable corpus), or periodically (for a changing corpus), before SARA is used for analysis. The approach chosen for indexing is to index at word level, with the inclusion of a dictionary.
The use of a client-server architecture, and particularly of a Windows-based client, means that the minimum hardware requirement for using SARA to work with the BNC is low, in line with the design objectives.

Possible future developments

As SARA is at an early stage of its development, feedback is actively sought from volunteer beta testers. Possible enhancements have already been identified, in the areas of advanced TCP/IP support (to enhance performance for certain types of enquiry); subcorpus selection; host-based features; smoother lexicon integration; and documentation on the product and on how to index corpora other than the BNC. Arrangements for future development and support are currently being finalised.

Workshop attendees' assessments

There was a general feeling that SARA can serve a useful purpose, as a high-speed, SGML-based query tool. Attendees had opportunities for hands-on experimentation, and found the system functions well. For most queries, response times were good (albeit with a corpus much smaller than the entire BNC); and the ease of use objective has been met. Almost all those present were sufficiently interested in the product to sign up as beta testers.

One attendee commented that SARA contains some specificity to BNC, a point agreed by the developers; ideally, the system would become totally generic with future development, to allow limitation-free working with other corpora.

The question of ongoing support was discussed, together with the linked issue of placing the source code in the public domain. A number of attendees commented that a reliable support structure is essential if they are to release SARA for widespread use in their universities. Mechanisms for support are currently being considered.

Sources of further information

For further details about the BNC, please enquire by electronic mail to natcorp@computing-services.ox.ac.uk.

For further details about SARA, please enquire by electronic mail to Lou Burnard at lou@computing-services.ox.ac.uk.

A set of demonstration diskettes for SARA is available from Lou Burnard at the above address (the demonstration requires a PC with Windows and Winsock).

Workshop programme

10:15 Welcome and introduction to BNC and TEI...Lou Burnard and Gavin Burnage
11:15 What SARA is and how it works... Tony Dodd
12:45 Lunch
14:00 Hands-on demonstration
16:00 Review and discussion

Attendees

Glynis Baguley, Oxford University Computing Services
Gavin Burnage, Oxford University Computing Services
Lou Burnard, Oxford University Computing Services
Tony Dodd, SARA Project
Paul Ellison, University of Exeter
Steve Fligelstone, University of Lancaster
Nick Kemp, University of Sheffield
Mark Magennis, Glasgow University
Michael Oakes, University of Lancaster
Nick Porter, Survey of English Usage, University College London
Brian Robinson, University of Hertfordshire
Richard Tucker, University of Cambridge
Bill Worthington, University of Hertfordshire

British Library Reports and Papers

SARA (SGML Aware Retrieval Application) Workshop 29th June 1994

Contents