Metadata

Extending ROADS to Support Retrieval by URI

ROADS 


[This document is a part of DC-ROADS: ROADS as a (Dublin Core) Metadata Management Environment ]

Introduction

In many situations it would be useful to retrieve the metadata in a ROADS database that is associated with a particular URI. There is often no viable way to obtain the handle of the resource in order to retrieve its associated metadata.

ROADS records cannot be retrieved by URI because URIs do not survive the indexing process intact. This document describes how this problem can be addressed and deals with the problems that are introduced by indexing URIs.

The solution requires modifications to Index.pm and mkinv.pl, the module and script that control the creation of and access to the index. Two additional configuration files are also introduced.

Approach

Indexing

The first problem to address is that ROADS splits terms up based on punctuation. URIs need to survive intact. ROADS does allow the splitting to be controlled by setting the $IndexSplitPattern variable. However, we only want to give special treatment to URI fields (and not necessarily all of those).

We need to allow a different split pattern to be specified for each attribute (it may be that it should be extended to allow different patterns for each attribute in each template).

This is achieved using a configuration file indexspecial, located in the config directory by default. The configuration file has IAFA style entries (attribute name: pattern). The mkinv.pl script that performs the indexing reads this file and uses the specified split pattern for attributes that appear in the indexspecial file; the default ROADS split pattern is used for attributes that do not have a specified pattern.

Unfortunately, allowing URIs to survive intact in the index file causes problems. For a start, colon is a special character in the index file and URIs contain colons. %-encoding of URIs goes some way towards resolving this problem. Basic %-encoding is not sufficient since it will leave the colons unencoded. %-encoding of characters that match [^-A-Za-z0-9] is used in the current implementation.

%-encoding URIs does not mean that the default split pattern can be used, this would split on % characters. Since all whitespace characters within a URI will be %-encoded it is sufficient to split on whitespace characters (this would permit multiple whitespace separated %-encoded URIs to be stored against a single attribute).

At this stage encoded URIs survive intact in the index and do not contain special characters.

Unfortunately, the % characters used in the %-encoding cause a problem when performing a search over the index. Queries containing %-encoded strings are acceptable in WHOIS++ (confirmation?) and pass through to the server correctly. It is only when the query reaches the Index.pm module that problems occur. This was remedied by adding the % character to the list of acceptable characters in the regular expressions used in the term subroutine in Index.pm. Hopefully this doesn't cause any unforeseen problems!?

With these modifications to ROADS indexing it is possible to retrieve ROADS templates by URI.

This approach also support multiple URIs associated with the same record and expressed as variants. The record can be retrieved by any of the URIs.

Multiple records with the same URI will all be returned. If a unique response is required then care should be taken to ensure that there are no duplicates, the ROADS admin centre has a tool for checking for duplicate URIs.

Encoding URIs

In order for the above approach to work, URIs must be %-encoded before they are entered into the database.

This approach is okay if the entries are automatically generated by is they are created by hand it is not suitable. If records containing %-encoded strings are loaded into the CGI-based ROADS editor then they will be decoded when they are saved.

A more general apporach is to add support for %-encoding to ROADS itself. This was achieved by modifying mkinv.pl to look for attributes to be encoded in the configuration file config/percentencode which include attribute-encoding pairs. Terms corresponding to attributes which have a specified encoding are %-encoded before being entered into the index. Combining this technique which an appropriate split pattern (as described above) allows selected attributes to be %-encoded in the index.

This approach has the added advantage that the existing URI attribute can be used for both indexing and pointing to the resource during searching or browsing via the ROADS interface. If the actual entry was encoded then it couldn't be used to refer to the resource when searching or browsing the ROADS database; a separate %-encoded attribute would need to be introduced.

Of course, if you want a value to appear in the index both encoded and unencoded (with appropriate split patterns in each case) then the attribute will need to be repeated in the template. This is unlikely to occur very often in practice and the workaround could easily be automated when it does.

Retrieving a ROADS Record by URI

In order to retrieve a ROADS record by URI it is necessary to encode the URI using the same algorithm used to encode URIs as they are entered into ROADS. Note that care should be taken when encoding URIs that may already be partially %-encoded!

The WHOIS++ query should specify an exact match to avoid partial URI matches. Of course, there may be cases where a partial match is useful.

Software

Hopefully, this functionality will be available in a future version of the ROADS distribution.

Please get in touch with the author if you would like to use this technique with a current version of ROADS.

Summary

We have suggested an approach to retrieving ROADS records by URI. An extra attribute is added to a record for each URI by which the record should be retrievable; the value of this attribute is the URI %-encoded to a particular specification.

The mkinv.pl program was extended to allow index split patterns to be specified on a per attribute basis and to %-encode specified attributes.

A minor change was made to Index.pm to allow % characters to appear in search strings.

Using this approach it is possible to retrieve a ROADS record by URI provided the URI is first encoded to the specification given in this document.

[Next: Browser Extension for ROADS Metadata Management ]


Web page authored and maintained by:
Page created on: 10-May-99
Page last revised on: Monday, 05-Jul-1999 14:01:01 UTC

[<A HREF="http://www.ukoln.ac.uk/metadata/">Metadata</A>] [<A HREF="http://www.ukoln.ac.uk/">UKOLN</A>]