UKOLN WebWatch

Architectural Issues



Data Collection

Our robot is based on the "look" of Harvest. It takes a list of RootNode (for expansion) and LeafNodes (for individual summaries). For HTML documents, we summarise all tag information (including attributes and counts), HTTP header information and the sizes of included objects such as scripts and stylesheets. Images are summarised by specifying their format, number of colours and so on. Any type can be defined based on a URL recognition scheme and a MIME type recognition scheme (and hopefully at a later time a UNIX file(1) based recognition) and each type can have its own summariser. For those items without specific summarisers we simply give the file size. The robot output consists of concatenated SOIF records.

Development is on-going, and attempts to be as open as possible and to allow the idea of extension and adaptation. Hopefully the software will be gradually developed into a more modular form (in the Perl sense) so that the modules will be resources for other Robot developers. Other ideas include a "GUI" and /or an interface via a browser. This could possibly involve a port to a more platform-independent language such as Java.

Data analysis & Tools

As mentioned on the previous page, our final analyses have been of CSV records in off-the-shelf packages. One of the problems with this, is that depending on the needs of the analysis, a different structured record is produced for each trawl, which means more tailored Perl script.

Looking towards scalability, compatibility and interaction (human or machine) we have been considering the idea of inserting our data into a database which could then be queried via standard SQL.

Analysis of log files will be used to determine things like user agents (and to infer platforms). These can be evaluated per server and/or amalgamated to produce some larger body of results. There is plenty of existing software for producing server statistics.

Next | Previous | Contents