4 The WebWatch Robot


Robot Software

The primary software concern was for a suitable robot that could satisfy the WebWatch robot requirements. Since the majority of available robots were optimised for indexing, most did not satisfy our criteria.

Initially, the Harvest gatherer [4] was used as the WebWatch robot. The gatherer is the robot component of the Harvest indexing suite of programs. The gatherer identifies resources based on configurable file naming heuristics (generally file suffixes) and on the Unix file type determining utility. When a resource is recognised, it may be summarised by a type-specific summariser. Such a summariser will take, as input, the appropriate resource (e.g. an HTML document) and output a summary in SOIF (Summary Object Interchange Format). Thus when a robot crawl is completed, the end product will be a file of SOIF summaries for the encountered resources.

The gatherer may be configured to crawl depth-first or breadth-first and can ignore resources based upon type or upon a regular expression matched against each URL. Generally, the gatherer crawls through the initial server (although it can be configured to crawl any number of additional servers found from links). Default constraints on robot navigation (such as maximum depth and maximum URLs) can also be altered.

After the initial WebWatch crawl, we observed that the gatherer had a number of deficiencies with respect to our evolving robot criteria. These were:

* No resource identification by HTTP MIME types

* Non-configurable link extraction from HTML documents

* No differentiation between inline objects and hyperlinked objects

* The long winded nature of altering the gatherer 'C' source code

Although we attempted to resolve some of the problems by altering the gatherer code, this quickly proved to be a long-winded process. Consequently, the next version of the robot was custom-written in Perl. The Perl based robot has gone through three different versions as it became optimised for our purposes. The Perl robots operate in a similar fashion to the gatherer described above, but with the identified deficiencies corrected. The output is still SOIF.

Recently, a Perl implementation of the Harvest gatherer has been released called Harvest-NG. Based on informal communications with the Harvest-NG developers and reading Harvest-NG documentation, use of Harvest-NG should be considered for future WebWatch-type work.

Processing Software

From the SOIF summary, each SOIF template represents a single resource. It contains resource-specific information, for example, an HTML summary contains details on all the HTML elements and attributes used. Some information is common to all resources, such as collection time and the HTTP headers associated with it.

Each SOIF template is mapped onto a command separated variable (CSV) record. All such records can then be loaded into a spreadsheet (we have been using a mixture of Microsoft Excel and SPSS for Windows) for statistical analysis.

Availability of the WebWatch Software

The WebWatch robot software will be made freely available for non-commercial use. It should be noted, however, that installation and use of the software will require Unix expertise. In addition we are unable to provide any support in use of the software.

Further information about the availability of the software is available at
<URL: http://www.ukoln.ac.uk/web-focus/webwatch/software/ >