3 Background to Robot Technologies

Introduction

Robots are automated software agents designed to carry out work that is repetitive or, because of sheer scale, not possible as an individual human undertaking.

The definition of a software agent is subject to debate, but one relevant notion of an autonomous agent is "a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future".

Autonomous agents are not really a new phenomena on the Internet. Unix, the operating system over which the Internet was borne utilizes network daemons, which are essentially simple automata for dealing with network transactions. Newsbots, IRCbots and other application-specific autonomous agents are now common-place on the Internet.

Agents and the Web

The WebWatch project is concerned with automated Web agents. These are robots that traverse the Web, navigating through hyperlinks and performing a job, such as recognising and indexing resources. Such Web crawling robots are known by a variety of aliases, including crawlers, trawlers, spiders and worms.

The current generation of Web robots is, perhaps surprisingly, large. A glance at the Web Robots Pages [1] reveals a list of over 160 well-known robots. These robots are written in a variety of different languages. Some may use libraries that are written to aid Web agent design. Two popular libraries used for this purpose are libwww for the 'C' language and LWP for Perl.

A majority of robots navigating the Web at this time are used by search-engines to index keywords and other metadata to build indices for resource retrieval. Other robots are available to maintain hypertext structures or to summarise resources in other ways (e.g. monitoring technologies rather than indexing).

Web robot behaviour can be broadly classified into:

* Exploratory Web crawling (e.g. resource discovery or broken link discovery)

* Directed Web crawling (e.g. of URLs submitted to a search engine)

* Restricted Web crawling (e.g. all URLs within a DNS domain).

The Role of Robots in the Web

Maintaining the Web

The dynamic nature of the Web, and the fact that it is composed of administratively independent "Web sites" leads to the danger that states of disrepair can arise because extensive checking can be too big and difficult an undertaking. This maintenance is necessary to keep up with changes in the structure of Web-space and changes in technologies and standards. For example:

* When pages change, come into existence or cease to exist, references to those pages must be correspondingly changed, created or removed.

* In order that pages are displayed as required, there must be conformance to standards such as HTML.

Understanding the Web

As a widely used and strategically important communications medium, the Web warrants analysis of its structure to provide insight into its nature and to monitor its development. The results of these analyses can help to solve the problems that are associated with the Web. Monitoring change and growth may help predict future trends and development and to recognise the potential or necessity for enhancements and developments. This may be true on the small scale of Web areas under the same administrative control.

This is an area in which the World Wide Web Consortium is actively involved through the HTTP/NG's Web Characterisation working group [2].

Benefits

Spiders offer an effective solution to obtaining a user view of remote Web-space. Simulating a human user, they have the advantage of being able to repetitively cover large subspaces of Web and keep accurate logs and summaries. Additionally, they can be run for long periods or at regular periods without fatiguing and can monitor changes that may not be apparent to a normal user.

General benefits include:

* User satisfaction from search directed access to resources and easier browsability (via maintenance and advancements of the Web resulting from analyses).

* Reduced network traffic in document space resulting from search-directed access.

* Effecting archiving/mirroring, and populating caches (to produce associated benefits).

* Monitoring and informing users of changes to relevant areas of Web-space.

* "Schooling" network traffic into localised neighbourhoods through having effected mirroring, archiving or caching.

* Multi-functional robots can perform a number of the above tasks, perhaps simultaneously.

Robot Ethics

All ethical spider users should balance the advantages gained against the disadvantages that may arise from the use of a Web crawler. There are some generally accepted guidelines for producing an ethical crawler [3]. These guidelines, known as the Robot Exclusion protocol (REP) are designed to minimise problems for users, servers and networks. It is also considered ethical to use a spider that will parse and obey robot exclusion protocols. Robot exclusion refers to methods that can be implemented server-side to direct robot accesses.

Disadvantages of Web Robots

Network Performance

Robots traditionally have a bad press in discussions on bandwidth, even though the functions of some well-written and ethical robots are ultimately to conserve bandwidth.

There are points to consider on the bandwidth front, since robots can span relatively large portions of Web-space over short periods. Bottlenecks can arise locally though high bandwidth consumption, particularly if the robot is in frequent or permanent use, or if it is used during network peak times. The problem is exacerbated if the frequency of requests for resources is unregulated.

Server-side Concerns

So-called "rapid-fire" requests (successive HTTP requests to a single server without delays) have been shown to be very resource consuming for a server under current HTTP implementations (in fact, this is the basis of several "denial of service" attacks). Here again, an unregulated robot can cause problems. Suitable delays and an ethical traversal algorithm can help resolve this.

The skewing of server logs is another issue that causes concern. A robot that indexes an entire site will distort logs if a recognised "user-agent" is not supplied. These may be hard to distinguish from regular users.

Unethical Robots

A small number of rogue robots are in use. The tasks of these robots are such that they are particularly unwelcome by servers. Such tasks include email culling, for the production of large lists of email addresses that can be sold to advertisers and copyright violation through copying entire sites.

Additionally robots can contribute to a site "hit quota" and consume bandwidth which the site may pay for.