UKOLN - WebWatch - Technical Considerations

Disk space

Currently this has not been an issue, though as our crawls become more ambitious this could become an influence on policy. Also archives of previous crawls (raw and processed data) will creep up over time - we have been considering keeping these archives on CDROM or otherwise archiving them.

Indexing frequency

Our indexing is generally at non-peak times, at night and/or over weekends. With a suitable delay between requests, we have indexed each community in one go (note that our current traversal algorithm doesnt hit each server repeatedly anyway).

LAN/WAN bottleneck issues

Only one request is issued per resource required and we use a local Squid cache. This could, in the future become a sibling in a cache hierarchy.

Robot exclusion

The latest version of our robot, 2.01, was designed with REP in mind, and this will be implemented where possible. Shortly we plan to include support for robot directives implemented via meta tags.

Next| Previous | Contents