UKOLN - WebWatch - Reports - Third Crawl of UK Academic Entry Points

Introduction

On 25 November 1998, the WebWatch robot crawled the entry points for UK academic Web sites. This report is an analysis of the findings. This is the third Web crawl of the UK HEI entry points and completes a series of three snapshots of this community. The first crawl is available from the reports area of the WebWatch pages [1] and the second was published in the Journal of Documentation [2].

The input file of URLs obtained from NISS for the previous crawl was used. Of the 170 sites in this list, 150 sites were successfully crawled. Network/connection errors, out of date URLs and so on account for the 20 unexplored sites.

Size Metrics

Figure 1 shows a histogram of the total size of entry points. Total size is defined as the HTML page with inline images. A number of linked resources which may be downloaded by modern browsers, including external style sheets, external client-side scripts, resources requiring 'plugins' and background images are not included.

Figure 1 - Total Size of Entry Points

The range of sizes spans from around 5kb (<URL: http://www.rcm.ac.uk/>) to around 200Kb (<URL: http://www.kiad.ac.uk/>). The second large outlier at 192Kb corresponds to <URL: http://www.scot.ac.uk/>.

Hyperlinks

Figure 2 shows the number of hyperlinks within each site. These are obtained from the A element and from image map AREA elements. This data may include duplicate URLs where more than one hyperlink to the same URL exists.

Figure 2 - Total Number of Hyperlink Elements per Site

Note that the outlier corresponds to <URL: http://www.rhbnc.ac.uk/>.

HTTP Servers

Figure 3 shows a pie chart of the server software encountered during the crawl. This information is based upon the HTTP Server header returned by the web server.

Figure 3 - Server Software Encountered

The Other category consists of the following servers:

Server	Count
Borderware	1
Lotus Domino	1
Novell	2
OSU	2
SWS-1.0	1
WebSTAR	4
WinHttpd	1

Figure 4 - Components of the 'Other' slice from Figure 3

A more detailed table of the servers found is shown in Figure 5.

Server	Count
Apache/1.0.0	1
Apache/1.0.3	1
Apache/1.1.1	1
Apache/1.1.3	1
Apache/1.2.0	2
Apache/1.2.1	2
Apache/1.2.1 PHP/FI-2.0b12	1
Apache/1.2.4	7
Apache/1.2.4 FrontPage/3.0.2	1
Apache/1.2.5	12
Apache/1.2.6	6
Apache/1.2b10	2
Apache/1.2b7	1
Apache/1.3.0 (Unix)	7
Apache/1.3.0 (Unix) Debian/GNU	1
Apache/1.3.0 (Unix) PHP/3.0	1
Apache/1.3.1 (Unix)	6
Apache/1.3.2 (Unix)	1
Apache/1.3.3	2
Apache/1.3.3 (Unix)	6
Apache/1.3.3 Ben-SSL/1.28 (Unix) PHP/3.0.5 od_perl/1.16	1
Apache/1.3.3 UUOnline/1.4 (Unix)	1
Apache/1.3a1	1
Apache/1.3b3	1
Apache/1.3b5	1
BorderWare/2.	1
CERN/3.0	8
CERN/3.0A	3
HTTPS/2.12	1
Lotus-Doino/4.5	1
Microsoft-IIS/2.0	3
Microsoft-IIS/3.0	4
Microsoft-IIS/4.0	15
Microsoft-Internet-Inforation-Server/1.0	1
NCSA/1.	2
NCSA/1.4.	1
NCSA/1.5.1	3
NCSA/1.5.2	7
Netscape-Comunications/1.1	1
Netscape-Comunications/1.12	1
Netscape-Enterprise/2.01	3
Netscape-Enterprise/2.0a	2
Netscape-Enterprise/3.0	4
Netscape-Enterprise/3.0F	2
Netscape-Enterprise/3.0K	1
Netscape-Enterprise/3.5-For-NetWare	1
Netscape-Enterprise/3.5.1	4
Netscape-FastTrack/2.0	1
Netscape-FastTrack/2.01	1
Netscape-FastTrack/2.0a	1
Netscape-FastTrack/2.0c	1
Novell-HTTP-Server/2.5R	1
Novell-HTTP-Server/3.1R	1
OSU/1.9b	1
OSU/3.2	1
SWS-1.0	1
WebSTAR	2
WebSTAR/1.2.5 ID/13089	1
WebSTAR/2.0 ID/44693	1
WinHttpd/1.4a (Shareware Non-Comercial License	1
Total	150

Figure 5 - Table of all Servers Encountered

Of these servers, 40% used HTTP/1.0 and 60% used HTTP/1.1.

The Queso [3] software was used to get an idea of platforms. The high level results are summarised in Figure 6. A more detailed breakdown is presented in Figure 6.

	Estimated
OS	Min	Max
UNIX	97	108
OS2	0	5
MacOS	6	11
Netware	3	3
Windows NT/95/98	20	20
Other	7	7
Unknown	6	6

Figure 6 - Operating Systems as Reported by Queso

Note that the 'Other' category in Figure 6 corresponds to the Queso output categories Figure 7) 'Cisco...' and the 'Unknown' category corresponds to the Queso output categories 'Unknown OS', 'Firewalled host/port or network congestion' and 'Dead Host, Firewalled port or Unassigned IP'.

Note that the estimated minimum and maximum values in Figure 6 may be skewed because of the Queso unknowns referred to above.

Operating system	Count
BSDi or IRIX	1
Berkeley: Digital, HPUX, SunOs4, AIX3, OS/2 WARP-4, others...	5
Berkeley: HP-UX B.10.20	1
Berkeley: IRIX 5.x	3
Berkeley: usually Digital Unix, OSF/1 V3.0, HP-UX 10.x	14
Berkeley: usually HP/UX 9.x	1
Berkeley: usually SunOS 4.x, NexT	5
Cisco 11.2(10a), HP/3000 DTC, BayStack Switch	7
Dead Host, Firewalled Port or Unassigned IP	2
FreeBSD, NetBSD, OpenBSD	1
IBM AIX 4	2
IRIX 6.x	2
Linux 1.3.xx, 2.0.0 to 2.0.34	5
Linux 2.0.35 to 2.0.9	1
MacOS-8	6
Novell Netware TCP/IP	3
Reliant Unix from Siemens-Nixdorf	1
Solaris 2.x	60
Standard: Solaris 2.x, Linux 2.1.???, MacOS	5
Windows 95/98/NT	20
Firewalled Solaris 2.x	1
Firewalled host/port or network congestion	3
Unknown OS	1
Total	150

Figure 7 - Queso Output

Metadata Profile

The attributes of the HTML META element were examined for known metadata conventions. Figure 8 shows the results.

Metadata	No. `META` elements	No. sites
PICS	1	1
HTTP-EQUIV="Refresh"	9	9
Reply-To	3	3
Search Engine	190	95
Dublin Core	102	11
HTTP-EQUIV="(Dublin Core)"	8	1

Figure 8 - Types of Metadata Encountered

Technologies

Scripting

29 pages used the SCRIPT element to include a client-side script block. Of these, 23 pages included the attribute-value LANGUAGE="JavaScript".

All HTML elements were searched for the set of defined JavaScript event handlers. The results are shown in Figure 9.

Handlers	Count	Sites
`onChange`	1	1
`onClick`	13	4
`onLoad`	10	8
`onMouseOver`	320	36

Figure 9 - Event Handlers Encountered

Java

Two Java applets were referenced by the site <URL: http://www.uwic.ac.uk/>.

The site <URL: http://www.luton.ac.uk/> referenced a plugin using the OBJECT element.

Splash Screens and Frames

21 sites used framesets. 10 sites use HTTP-EQUIV="refresh" to update the entry point page.

Cachability

Figure 10 shows a summary of the cachability of crawled resources.

Cachable resources	72.5% of HTML pages, 80.9% of images
Non-cachable resources	4.4% of HTML pages, 0.2% of images

Figure 10 - Cachability of Resources Encountered

Additionally, 40% of HTML pages and 45% of images contained the HTTP/1.1 Etag header.

A resource is defined as cachable if:

It contains an Expires header showing that the resource has not expired
It contains a Last-modified header with a modification date greater than one day prior to the robot crawl
It contains the Cache-control: public header

A resource is defined as not cachable if:

It contains an Expires header showing that the resource has expired
It contains a Last-Modified header with a modification date coinciding with the day of the robot crawl
It contains the Cache-control: no-cache or Cache-control: no-store headers
It contains the Pragma: nocache header

The cachability of resources is not determined if the resource used the Etag HTTP/1.1 header, since this would require additional testing at the time of the trawl which was not carried out.

Comparisons with Previous Crawls

Server Profiles

As shown in Figure 11, the Apache and Microsoft servers have shown increasing adoption. The Netscape server has fluctuated (perhaps due to a period of experimentation). The NCSA and CERN servers have shown a decrease in usage.

The growth of Apache and Microsoft servers has also resulted in a decrease of the 'Other' category, i.e. sites are subscribing to the more popular servers.

Figure 11 - Use of Server Software Over Three Crawls

A chart showing the growth of various servers is shown in Figure 12. This chart shows the contribution of growth for the period Oct 1997 - Jul 1998 and Jul 1998 - Nov 1998. Note that negative growth is interpreted as decline.

Figure 12 - Growth of Servers over Three Crawls

Size of Entry Points

A set of sites was isolated, for which reliable measurements of size exist for two previous web crawls. The results are shown in Figure 13.

Figure 13 - Changes in Size of Entry Points

Note that a majority of sites have not undergone great fluctuations in size. The outlier corresponds to <URL: http://www.scot.ac.uk>. The pages for this site are different since this site has become part of a larger institution.

"Splash Screens"

The number of institutional entry points which make use of "splash screens" or redirect has shown a steady increase from five sites (Oct 97) to seven sites (July 1998) to ten sites in the current trawl.

Hyperlink Profiles

The domains referenced by hyperlinks in the three crawls have been dominated by ac.uk and this domain has shown an overall increase. Figure 14 shows the contribution of different types of domain name as a percentage of all hyperlinks in the site.

Domain	October 1997	July 1998	November 1998
`Total .uk`	97.31%	97.13%	98.00%
`ac.uk`	96.63%	95.94%	97.68%
`net`	0.30%	0.16%	0.11%
`com`	0.82%	0.61%	0.63%
`org`	0.34%	0.08%	0.18%
Other	0.15%	0.08%	0.10%
IP address	0.00%	0.12%	0.04%
Badly formed URL	1.10%	1.72%	0.91%

Figure 14 - Domains Referenced in Hyperlinks

Note in Figure 14, that the ac.uk data is a subset of the uk data.

Use of Metadata

In each crawl, we have looked for search-engine (SE) type metadata and Dublin-Core (DC) metadata. The findings over the 3 crawls are shown in Figure 15.

Figure 15 - Trends in Metadata Usage

Figure 15 shows that the use of Dublin Core metadata has increased considerably over the three crawls, from one site in October 1997 to 11 sites in November 1998.

References

A Survey of UK Academic Library Web Sites <URL: http://www.ukoln.ac.uk/web-focus/webwatch/reports/hei-lib-may1998/>
How Is My Web Community Doing? Monitoring Trends in Web Service Provision Journal of Documentation,Vol. 55 No. 1 January 1999, pp 82-95.
Questo <URL: http://www.apostols.org/projectz/queso/>