UKOLN - WebWatch - eLib Report

The third major WebWatch crawl took place in November, 1997. The WebWatch robot software analysed eLib project web sites. A report on the analysis follows.

Background

The WebWatch project analyzed eLib project websites as defined at <URL:http://www.ukoln.ac.uk/services/elib/projects/> following a trawl that took place on 3 occasions in November 1997. This report gives a summary of the findings. The report is intended primarily for eLib project webmasters, but eLib project managers may also find it of interest.

The Trawl

The trawl took place on 14/15, 21/22 and 25 of November. Although it was initially intended to carry out the trawl in one run, the size of eLib project websites revealed a number of problems with the WebWatch robot and so a number of runs were needed.

Since the runs provided different sets of data, this report is based on a combination of data. Note that eLib project websites were not completely covered.

Two files of data from separate trawls have been analyzed which contain samples of HTML data, images and other resources.

Initialization

Fifty-five eLib project sites were cited for crawling. Some of these were not indexed as fully as intended as a result of problems including time-outs and various interpretation difficulties (see later).

Where relevant, we refer to % of sites rather than actual numbers to avoid misinterpretations over our analysis of two differing summary files.

Of the 55 sites considered:

1 no longer existed
35 had no robots.txt at the top level (http://foo.ac.uk/robots.txt)
11 had their own domain name (in which we include omni.ac.uk, sosig.ac.uk etc.)
13 had a significant machine name (e.g. cain.ulst.ac.uk. Not including naming schemes like omni.ac.uk).
3 contained a tilde (~) in their path
18 had entry points at the top level (/) (e.g. http://intarch.ac.uk/).

Initial observations

The following are expressed as percentages of all HTML-typed files encountered.

19% contained a question mark, '?', in the path.
12% contained meta elements compliant with search engine recommendations (such as Alta-Vista).
5.5% contained meta elements based on Dublin Core metadata specifications.
1% contained meta elements based on PICS metadata specifications.
0.1% contained meta elements with the http-equiv='refresh' attribute.

Analyses

Web Server Software

Figure 1 shows a pie chart of the Web servers encountered.

HTTPD server usage
Figure 1 - HTTP server software usage

The top three web servers are Apache(~62%), NCSA(~22%) and Netscape-Enterprise(~17%).

File sizes

Figure 2 shows the frequency of HTML file sizes.

Frequency distribution of HTML filesizes
Figure 2 - HTML file size

The majority of HTML documents were under 25 Kb, the mean being about 10 Kb. There were a number of extreme values for HTML file-size (the standard deviation is roughly 49 Kb), on closer inspection these are usually large indices or server logs. The smallest document encountered was 49 bytes - a CGI generated error message and the largest document was about 1 Mb - site server statistics.

Figure 3 shows the frequency of image file sizes encountered.

Frequency distribution of Image filesizes
Figure 3 - Image file size

The rough attenuation of the tail compared to HTML file sizes is not really an effect of the different interval-sizes. This might be due to our incomplete trawling of some sites.

We encountered mostly JPEG and GIF formats, the later being roughly 9% more prevalent than the former. The GIF size distribution is of a similar shape to Figure 3 (but with mean 18,732 bytes). The JPEG size distribution is slightly different and had mean 34,865 bytes.

Figure 4 shows the frequency distribution of the size of the entry point to the site. This is defined as the sum of the filesizes of the HTML page and all inline components (mostly inline images).

Figure 4 - Size of entry point to site

The mean value is 5 Kb. Over a 28.8 Kbps modem connection, this would take about 1.4 seconds to download.

General HTML element usage

Figure 5 shows the number of unique HTML elements found on each page.

Frequency distribution of unique
HTML elements (per page)
Figure 5 - Number of unique HTML elements per page

The number of unique HTML elements on each page of HTML trawled peaks at 25. However we suspect that the distribution is approximately normal and that the peaks at 9 and 25 are due to 'in-house style' of a number of sites that were trawled completely, dwarfing others that were incompletely trawled.

In contrast, Figure 6 shows the count of total number of HTML elements per page.

Frequency distribution of total HTML elements
(per page)
Figure 6 - Total number of HTML elements per page

The mean number of elements for Figure 5 is 187, with standard deviation 532.

The ten most popular elements used per page are shown in Figure 6.

Figure 7 - Top ten HTML elements (per page)

Figure 7 shows that the A element is by far the most widely used HTML element in the eLib project pages which were analyzed. This could be due to extensive linking to external resources, or extensive cross-linking within the website. The Access to Network Resources (ANR) projects (such as OMNI, which was completely indexed,) are likely to contain large numbers of hyperlinks. It is perhaps surprising that eLib projects generally contain such a high proportion of hyperlinks.

Use of the META element

We looked at specific uses of the META tag, namely for HTTP-EQUIV=refresh, search-engine metadata specifications (e.g. as recommended by Alta-Vista), Dublin Core metadata specifications and PICS metadata specifications. Around 19% of all trawled pages of HTML contained such instances of META usage. See Figure 7 (HTTP-EQUIVs are not shown).

Figure 8 shows usage of the META element.

Chart of usage of the META element
Figure 8 - Use of the META element

A more in-depth look at the use of Dublin Core metadata is presented in Figure 9.

Chart of DC metadata elements
Figure 9 - Number of DC attribute values per page

As can be seen from Figure 9, up to 18 DC metadata attributes per page were used. eLib pages containing DC metadata tended to make extensive use of the DC attributes, with only a small number using a handful of values.

Use of inline scripts

We monitored usage of the SCRIPT element. Event handlers within tags were not analyzed on this trawl.

A brief summary of the SCRIPT element is shown in Figure 10.

Chart of use of the SCRIPT element
Figure 10 - Use of the SCRIPT element

Analysis of links

An analysis of absolute URL references (i.e. http://foo.com/blah.html) within the usual hyperlink elements (A, AREA, LINK, MAP) provides information on the top-level domains linked to.

Figure 11 shows the ten most popular linked-to domains. Every link in each document was considered and the top-ten calculated. Note that the y-axis is logarithmic.

Top ten domains linked-to (overall)
Figure 11 - Top ten linked-to top level domains (evaluated overall)

The Robot

We have analyzed two files of data from separate trawls and have a large sample of HTML, images and other resources. When all bugs are ironed out from the software one file will suffice. Some problems with the new version of the robot meant that some sites were not being trawled completely. It may be useful to bear this in mind while considering this analysis.

Recommendations

Based on this crawl we are able to make a number of recommendations:

Set your server to generate appropriate RCs when there is an error. Robots normally index or audit resources if they receive an appropriate response code (RC). Servers that are not configured to produce the correct response code may cause problems if a robot accesses your site. For example, in a small number of cases our robot audited an error message from a CGI script.
Within a page, use relative links or ensure the machine name is the same as the publicized machine name. Since the WebWatch robot (like most other robots) does not perform DNS lookups it will assume a different hostname is an external host. For example for the site http://www.ambridge.ac.uk/ all hyperlinks should be of the form http://www.ambridge.ac.uk/papers/paper.html not http://ambridge.ac.uk/papers/paper.html.

Issues

A number of issues emerge from the survey.

Domain Naming: Some eLib project sites had their own domain name or used a significant machine name in the domain. A small number had an entry point at the top level. However most projects had an entry point which was located within the institution's directory structure and a small number used the tilde (~) convention. The persistency of these URLs may be a concern in the longer term.
Metadata: Dublin Core metadata does not yet appear to be widely used. This is perhaps due to the uncertainty of the HTML conventions for embedding Dublin Core metadata within HTML documents, which has only recently been resolved.
Server Usage: Most eLib project websites use one of a small number of server software packages (Apache, NCSA, Netscape). Projects which make use of a little-used server package may wish to consider migrating to a more widely deployed package.

Feedback

We would appreciate any feedback or comments, especially from the eLib community. Comments should be sent to webwatch@ukoln.ac.uk

Future Trawls

Plans for future trawls include:

a more detailed analysis of the use of technologies (e.g. Java, Javascript, SGML, etc)
a more detailed analysis of the use of hyperlinks
a more detailed analysis of the use of Dublin Core metadata
a detailed report on broken links and other errors
validation of HTML (and possibly stylesheets etc.)

An object-oriented re-write of the software (currently under development) will simplify the addition of future enhancements. Ideas for this version include validation and detailed reports of errors.

Report of WebWatch crawl of eLib web sites