Creating a search from Scan results

Creating a Search from Scan Results
By Janifer Gatenby, Geac Computers
Version 1 July 9th 1999

The requirement

Scan returns results that consist of brief records representing rows from an
ordered list.  The results can be presented to an end user, enabling him or
her to browse forward and optionally backwards, then select a line for
further information or processing.  When a line is selected, the system will
format a search request in order to request an actual record.  Typical
examples are scans (browses) on AUTHOR, SUBJECT and TITLE.


Database Models

The method of constructing a search from scan results needs to accommodate
various different database models.  There are the following  possible
models:

*       The scan index is derived from a database that is totally separate
from the bibliographic database  Example: Authority database on LC and
bibliographic database somewhere else.

*       The scan index is derived from a database that is separate from the
bibliographic database but it contains pointers to the linked bibliographic
records.  The same data occurs in both databases. Example: an Integrated
Library Management System with an authority database linking to a
bibliographic database of full MARC records that contains authorised data,
including authors and subjects.

*       The scan index is derived from a database, e.g. an authority
database that is inter-linked with the bibliographic databases with the
records in the bibliographic database containing links to the associated
database and vice versa, with no repetition of data.  Example: an Integrated
Library Management System with an authority database linking to a
bibliographic database.  To construct a full record, it is necessary to
integrate data from both databases.

*       The scan index is derived from the bibliographic database. Example:
a title index.


The way that Z39.50 scan "formally works" at the moment only really suits
the first model.   To do a follow on search, the origin uses same USE
attributes for the search as it used for the scan and uses data from the
TERM as the search data.

The problem with using TERM for the follow search is that the resulting
search may not be precise enough.  There are a number of reasons for this.
Firstly, the term may have been truncated and actually lacks significant
words, important for the precision.  Secondly, the target may not support
position attributes such as FIRST IN FIELD or the structure attribute PHRASE
and therefore the search is constructed in an imprecise way such that it can
retrieve unexpected records even when a single seemingly unique line has
been extracted from a SCAN.  This is exacerbated when the TERM itself has
multiple occurrences, e.g. for a title that has only common words such as
'Psychology".

What is required is a means of using database links where they exist to
assist in the precision of the follow on search.

Scan elements

Which data elements of the scan results can an origin use in order to
construct a follow on search?   The scan results contain TERM that
represents the data that was matched against the scan attributes and is
normally the data that is used by the target for sequencing the scan
results.  The scan results also include DISPLAY TERM that gives the display
version of the term, e.g. data in upper and lower case, including
diacritical marks and initial articles.  The other element that could carry
significant information is otherTermInfo.

When database models 2, 3  or 4 are in use, it is desirable to send some
retrieval information in the SCAN results.

The Proposal

The proposal is to include this retrieval information in otherTermInfo in
the form of a Z39.50 url.  The urls given should relate to each term
occurrence.  This means that the "docid" to be given  in otherTermInfo
relates to the term occurrence and not to database records related to the
term such as bibliographic occurrences of an authority record.   The
identifier also needs to contain an identifier type to indicate whether it
is identifying an authority or bibliographic or holding record*.
(Personally, I would rather see the docid definition within the Z39.50 url
to be broken into three elements, namely attribute set, attribute and
identifier, rather than just being defined as attribute set Bib1, Use
attribute 12.  However, this requires a change to the url that has already
been registered as an RFC.)

Structure of the Z39.50 url indicating contents in the context of data to be
returned in otherTermInfo.

zscheme         =       "z39.50r" (always)
Database        =       name of database to search at the server host site
Docid   =       the bibliographic or authority or holding identifier
preceded by identifier type
Elementset      =       Blank - unspecified, therefore origin's choice
Recordsyntax    =       Blank - unspecified, therefore origin's choice

Examples:

1.      A title scan performed on an index of a bibliographic database
(bib.file) produces a scan entry with a term occurrence of 5.  In
otherTermInfo, there are 5 urls containing identifiers of 5 bibliographic
records (111, 222, 333, 444 and 555), preceded by something (BREF)
indicating that they are bibliographic identifiers.  This could be a USE
attribute.

z3950r://bib.file/BREF111;  Z3950r//bib.file/BREF222  etc.

2.      An authority scan, e.g. author or subject performed on an index of
an authority database (auth.file) produces a scan entry with a term
occurrence of 1.  There are actually 3 bibliographic records associated with
this authority record.  In otherTermInfo of the scan response, there is one
url containing the identifier of the authority record (4544), preceded by
something (AREF) indicating that it is an authority identifier.

        z3950r://auth.file/AREF4544
        z3950r://bib.file/AREF4544

        Where an authority file is linked to a bibliographic file as per
database models 2 and 3, it is possible that to:

*       Scan the authority file, then search the authority file, e.g. to
retrieve a MARC authority record (first url used)
*       Scan the authority file, then search the bibliographic file, e.g. to
retrieve the MARC bibliographic record or records associated with the
authority record (second url used)


Upon receiving such a URL, the client initiates a session at the specified
host, port 210 (if necessary).  It follows with a Search Request with the
following parameters:

element set name          = as desired
preferredRecordSyntax = as desired
Query:
attribute set           = Bib-1
use attribute   = 1032 (docId)
structure attr. = 104 (urx)
term            = "BREF111"  or "BREF222" or "AREF4544" etc.