Appendix 5 Trawl of eLib Projects

This appendix is based on an article originally published on the web. See <URL: http://www.ukoln.ac.uk/web-focus/webwatch/reports/elib-nov1997/ >.

This trawl took place in November 1997.

Report of WebWatch Trawl of eLib Web Sites

The third major WebWatch crawl took place in November, 1997. The WebWatch robot software analysed eLib project web sites. A report on the analysis follows.

Background

The WebWatch project analyzed eLib project web sites as defined at <URL: http://www.ukoln.ac.uk/services/elib/projects/ > following a trawl that took place on 3 occasions in November 1997. This report gives a summary of the findings. The report is intended primarily for eLib project webmasters, but eLib project managers may also find it of interest.

The Trawl

The trawl took place on 14/15, 21/22 and 25 of November. Although it was initially intended to carry out the trawl in one run, the size of eLib project websites revealed a number of problems with the WebWatch robot and so a number of runs were needed.

Since the runs provided different sets of data, this report is based on a combination of data. Note that eLib project websites were not completely covered.

Two files of data from separate trawls have been analyzed which contain samples of HTML data, images and other resources.

Initialization

Fifty-five eLib project sites were cited for crawling. Some of these were not indexed as fully as intended as a result of problems including time-outs and various interpretation difficulties (see later).

Where relevant, we refer to % of sites rather than actual numbers to avoid misinterpretations over our analysis of two differing summary files.

Of the 55 sites considered:

* 1 no longer existed

* 35 had no robots.txt at the top level (http://foo.ac.uk/robots.txt)

* 11 had their own domain name (in which we include omni.ac.uk, sosig.ac.uk etc.)

* 13 had a significant machine name (e.g. cain.ulst.ac.uk. Not including naming schemes like

* omni.ac.uk).

* 3 contained a tilde (~) in their path

* 18 had entry points at the top level (/) (e.g. http://intarch.ac.uk/).

Initial observations

The following are expressed as percentages of all HTML-typed files encountered.

* 19% contained a question mark, '?', in the path.

* 12% contained meta elements compliant with search engine recommendations (such as Alta-Vista).

* 5.5% contained meta elements based on Dublin Core metadata specifications.

* 1% contained meta elements based on PICS metadata specifications.

* 0.1% contained meta elements with the http-equiv='refresh' attribute.

Analyses

Web Server Software

Figure A5-1 shows a pie chart of the Web servers encountered.

Figure A5-1 - HTTP Server Software Usage

The top three web servers are Apache (~62%), NCSA (~22%) and Netscape-Enterprise (~17%).

File Sizes

Figure A5-2 shows the frequency of HTML file sizes.

Figure A5-2 - HTML File Size

The majority of HTML documents were under 25 Kb, the mean being about 10 Kb. There were a number of extreme values for HTML file-size (the standard deviation is roughly 49 Kb), on closer inspection these are usually large indices or server logs. The smallest document encountered was 49 bytes - a CGI generated error message and the largest document was about 1 Mb - site server statistics.

Figure A5-3 shows the frequency of image file sizes encountered.

Figure A5-3 - Image File Size

The rough attenuation of the tail compared to HTML file sizes is not really an effect of the different interval-sizes. This might be due to our incomplete trawling of some sites.

We encountered mostly JPEG and GIF formats, the later being roughly 9% more prevalent than the former. The GIF size distribution is of a similar shape to Figure 3 (but with mean 18,732 bytes). The JPEG size distribution is slightly different and had mean 34,865 bytes.

Figure A5-4 shows the frequency distribution of the size of the entry point to the site. This is defined as the sum of the file sizes of the HTML page and all inline components (mostly inline images).

Figure A5-4 - Size of Entry Point to Site

The mean value is 5 Kb. Over a 28.8 Kbps modem connection, this would take about 1.4 seconds to download.

General HTML element usage

Figure A5-5 shows the number of unique HTML elements found on each page.

Figure A5-5 - Number of Unique HTML Elements per Page

The number of unique HTML elements on each page of HTML trawled peaks at 25. However we suspect that the distribution is approximately normal and that the peaks at 9 and 25 are due to 'in-house style' of a number of sites that were trawled completely, dwarfing others that were incompletely trawled.

In contrast, Figure A5-6 shows the count of total number of HTML elements per page.

Figure A5-6 - Total number of HTML Elements per Page

The mean number of elements for Figure 5 is 187, with standard deviation 532.

The ten most popular elements used per page are shown in Figure A5-7.

Figure A5-7 - Top ten HTML Elements (per page)

Figure A5-7 shows that the A element is by far the most widely used HTML element in the eLib project pages which were analyzed.

This could be due to extensive linking to external resources, or extensive cross-linking within the website. The Access to Network Resources (ANR) projects (such as OMNI, which was completely indexed,) are likely to contain large numbers of hyperlinks. It is perhaps surprising that eLib projects generally contain such a high proportion of hyperlinks.

Use of the META Element

We looked at specific uses of the META tag, namely for HTTP-EQUIV=refresh, search-engine metadata specifications (e.g. as recommended by Alta-Vista), Dublin Core metadata specifications and PICS metadata specifications. Around 19% of all trawled pages of HTML contained such instances of META usage. See Figure A5-7 (HTTP-EQUIVs are not shown).

Figure A5-8 shows usage of the META element.

Figure A5-8 - Use of the <META> Element

A more in-depth look at the use of Dublin Core metadata is presented in Figure A5-9.

Figure A5-9 - Number of DC Attribute Values per Page

As can be seen from Figure A5-9, up to 18 DC metadata attributes per page were used. eLib pages containing DC metadata tended to make extensive use of the DC attributes, with only a small number using a handful of values.

Use of Inline Scripts

We monitored usage of the SCRIPT element. Event handlers within tags were not analyzed on this trawl.

A brief summary of the SCRIPT element is shown in Figure A5-10.

Figure A5-10 - Use of the <SCRIPT> Element

Analysis of Links

An analysis of absolute URL references (i.e. http://foo.com/blah.html) within the usual hyperlink elements (A, AREA, LINK, MAP) provides information on the top-level domains linked to.

Figure A5-11 shows the ten most popular linked-to domains. Every link in each document was considered and the top-ten calculated. Note that the y-axis is logarithmic.

Figure A5-11 - Top Ten Linked-to Top Level Domains (Evaluated Overall)

The Robot

We have analyzed two files of data from separate trawls and have a large sample of HTML, images and other resources. When all bugs are ironed out from the software one file will suffice. Some problems with the new version of the robot meant that some sites were not being trawled completely. It may be useful to bear this in mind while considering this analysis.

Recommendations

Based on this crawl we are able to make a number of recommendations:

* Set your server to generate appropriate response codes (RCs) when there is an error. Robots normally index or audit resources if they receive an appropriate response code. Servers that are not configured to produce the correct response code may cause problems if a robot accesses your site. For example, in a small number of cases our robot audited an error message from a CGI script.

* Within a page, use relative links or ensure the machine name is the same as the publicized machine name. Since the WebWatch robot (like most other robots) does not perform DNS lookups it will assume a different hostname is an external host. For example for the site http://www.ambridge.ac.uk/ all hyperlinks should be of the form http://www.ambridge.ac.uk/papers/paper.html not http://ambridge.ac.uk/papers/paper.html.

Issues

A number of issues emerge from the survey.

Domain Naming

Some eLib project sites had their own domain name or used a significant machine name in the domain. A small number had an entry point at the top level. However most projects had an entry point which was located within the institution's directory structure and a small number used the tilde (~) convention. The persistency of these URLs may be a concern in the longer term.

Metadata

Dublin Core metadata does not yet appear to be widely used. This is perhaps due to the uncertainty of the HTML conventions for embedding Dublin Core metadata within HTML documents, which has only recently been resolved.

Server Usage

Most eLib project web sites use one of a small number of server software packages (Apache, NCSA, Netscape). Projects which make use of a little-used server package may wish to consider migrating to a more widely deployed package.

Future Trawls

Plans for future trawls include:

* More detailed analyses of the use of technologies (e.g. Java, Javascript, SGML, etc)

* More detailed analyses of the use of hyperlinks.

* More detailed analyses of the use of Dublin Core metadata.

* Detailed reports on broken links and other errors.

* Validation of HTML (and possibly stylesheets etc.).

An object-oriented re-write of the software (currently under development) will simplify the addition of future enhancements. Ideas for this version include validation and detailed reports of errors.