6 Observations

During the analysis of the results of the trawls, the WebWatch project made a number of observations which will be of interest to users of the Web. These are listed below.

For Information Providers

Directory Structure

During the analysis of UK Public Library Web sites in particular, it was noted that a number of Web sites used a very flat directory structure, with the library departmental menu page located at the root directory of the Web, along with other departmental menu pages. This can make it difficult for an automated tool to limit itself to the relevant area. This will make it difficult to restrict the coverage of robot software, including indexing and auditing tools.

Recommendation: The menu page for departments should be located beneath the departmental directory. For example, http://www.foo.ac.uk/library/library.html and not http://www.foo.ac.uk/library.html

Broken Links

During all of the trawls broken links were found.

Recommendation: Institutions should systematically run link checkers on their Web sites, and define procedures for fixing broken links.

Metadata

A relatively low proportion of metadata was found, even on key Web pages, such as institutional and departmental entry points. Although there may be problems with widespread deployment of metadata (difficulties of maintaining the metadata, lack of tools to create, edit and validate the metadata, volatility of metadata standards, etc.) the deployment of a small amount of metadata using stable conventions (primarily "AltaVista" metadata) will improve the quality of resource discovery.

Recommendation: Institutions should deploy "AltaVista" type metadata on a small number of key entry points.

HTML Conformance

Web pages were found which failed to validate against published HTML DTDs (Document Type Definitions). Although browsers tend to be tolerant of HTML errors, other user agents, such as indexing and auditing robots, may have problems in processing such pages.

Recommendation: Institutions should develop and implement procedures for ensuring that HTML pages conform to a published DTD.

Use of Frames

The initial version of the WebWatch robot, which used the Harvest software, could not process Web sites which made use of frames. Although the latest version of the WebWatch robot can handle Web sites which make use of frames, many indexing robots cannot. The use of frames may mean that indexing robots will not index such Web sites.

Recommendation: Information providers should be aware of the problems that frames have on robot software, and should either avoid use of frames, or make use of techniques to overcome this problem.

Size of Institutional Entry Points

During the trawl of UK University entry points the largest entry points were found to contain animated GIFs. The use of animated GIFs not only results in large file sizes, they can also cause accessibility problems.

Recommendation: Institutions should avoid be aware of the dangers in use of animated GIFs.

Use of "Splash Screens"

During the trawl of UK University entry points a small number of pages were found to contain "splash screens". Splash screens may cause accessibility problems. In addition, although modern browsers support them, older browsers and other user agents may not.

Recommendation: Institutions should avoid be aware of the dangers in use of "splash screens".

Use of Java and JavaScript to Provide Scrolling Text

During the trawl of UK University entry points a small number of pages were found to make use of Java and JavaScript. Subsequent manual analysis showed that in some cases Java and JavaScript were used to provide scrolling text. Although this may be of use in certain circumstances, scrolling text can cause accessibility problems.

Recommendation: Institutions should be aware of the dangers in use of Java or JavaScript to provide scrolling text.

Use of Hyperlinks

The trawl of UK University entry points indicates differences in approaches to the provision of hyperlinks on the main institutional entry point. Increasingly it appears that organisations are reducing the number of hyperlinks on key pages.

Recommendation: Institutions should be aware of the dangers in providing too many hyperlinks.

Use of Hyperlinks to Remote Resources

With the introduction of charging for network traffic from the US, increasing numbers of Universities are reducing the numbers of links to US resources.

Recommendation: Institutions should provide hyperlinks to UK resources where possible.

For Webmasters

System Configuration

Web sites were found with misconfigured server configuration files. For example, on several Web sites image files were configured with a MIME type of text/html. Although this has no noticeable affect on commonly used browsers, it can cause problems with robot software (for example, indexing software may index the contents of the binary file).

Recommendation: Webmasters should ensure that server configuration files are configured correctly.

Server Software

In the initial survey of UK University entry points, several unusual Web server software packages were found, such as Microsoft's Personal Web Server. In the second trawl the use of such software had declined.

Recommendation: Webmasters should take note of the profile of server software used within their community, and be aware of the implications of running unusual software (such as possible lack of support and expertise).

HTTP/1.1

Analysis of the trawls of UK University entry points indicate that about 50% use servers which support HTTP/1.1 HTTP/1.1 has many performance benefits for the server and the network over HTTP/1.0.

Recommendation: Webmasters should consider deploying server software which implements HTTP/1.1

The robots.txt File

Many Web sites do not have a robots.txt file. A robots.txt file can improve the server performance by ensuring that certain areas of the Web site are not indexed. It can improve the quality of the contents of search engines by ensuring that areas of the Web site containing poor quality information are not indexed.

Recommendation: Webmasters should create a robots.txt file and configure it so that robots do not access resources unnecessarily.

Monitoring Robots

The existence of unethical robots may present a problem to the administrator of a Web site. Webmasters should monitor their server logs in order to identify unethical robots.

Recommendation: Webmasters should consider installing software to report on visits by robots.

URL Conventions

The trawl of eLib project Web sites indicated that several projects used URLs for their main entry point which were either long or made use of personal home directories (e.g. the ~name convention). There is a danger that such projects will be renamed with a more persistent URL, and that access to the original URL could be discontinued.

Recommendation: URLs for projects which are likely to have a long life-span should aim to be persistent.

For Robot Developers

Writing Robot Software

Implementers of robots should follow the guidelines on robot ethics.

Recommendation: Robot developers should reuse existing spiders rather than creating new ones.
Robots should identify themselves to servers.
Robots should be thoroughly tested locally before being unleashed on the Web.
Robots should retrieve only resources that they are interested in.
The results of using a robot should be disseminated

Memory Leakage

Implementers of robots should be aware of memory leakage problems in versions of the LWP module and Perl itself.

Recommendation: Robot developers should be aware of the possible need for a controller process to kill and re-start the robot on large trawls.

Restarting Trawls

There may be a need to restart trawls if memory leakages cause a trawl to be abandoned.

Recommendation: A history file should be built up for the robot's use, that can allow the robot to stop and to later re-start at that point. Such a file should include the canonicalised URL for a resource and a content-based checksum (such as MD5) and also some resource metadata, such as an expiry date. An error log file is also very useful.

Trawling Guidelines

Implementers of robots should be aware of memory leakage problems in versions of the LWP module and Perl itself.

Recommendation: Hierarchical crawling can be useful where collections within Web sites are grouped by directory. After directory and DNS canonicalisation, each URL should be checked against that in the history file and HTTP headers such as "Expires" should be looked for. There should also be default limits on total number of URLs encountered and on depth. These are used to prevent useless crawling of a "black hole". Stop-lists based on resource type and URL regular expressions can also be used to prevent unnecessary trawling.

Registration of Robot

Implementers of robots should register their robot.

Recommendation: Robot developers should use the form at <URL: http://www.botwatch.com/addbots/ > to register their robot.

User-Agent Negotiation

Web servers which make use of server-side scripting may choose to serve different content to "unusual" user-agents (i.e. user-agents other than Netscape and Internet Explorer).

Recommendation: Robot developers should be aware that Web servers may choose to serve different content to "unusual" user-agents.

For Protocol Developers

Web Collections

Robot indexing and auditing software developers require more sophisticated mechanisms for defining collections of related resources than is currently provided.

Recommendation: Protocol developers should develop site mapping protocols which will enable information providers and Webmasters to define areas which robot software should access.

Robot Exclusion Protocol

A more sophisticated robot exclusion protocol is required which (a) can be managed by information providers, without requiring the intervention of a Webmaster and (b) provide greater control.

Recommendation: Protocol developers should develop a more sophisticated robot exclusion protocol.