UKOLN Third Crawl of UK Academic Entry Points



Introduction

On 25 November 1998, the WebWatch robot crawled the entry points for UK academic Web sites. This report is an analysis of the findings. This is the third Web crawl of the UK HEI entry points and completes a series of three snapshots of this community. The first crawl is available from the reports area of the WebWatch pages [1] and the second was published in the Journal of Documentation [2].

The input file of URLs obtained from NISS for the previous crawl was used. Of the 170 sites in this list, 150 sites were successfully crawled. Network/connection errors, out of date URLs and so on account for the 20 unexplored sites.

Size Metrics

Figure 1 shows a histogram of the total size of entry points. Total size is defined as the HTML page with inline images. A number of linked resources which may be downloaded by modern browsers, including external style sheets, external client-side scripts, resources requiring 'plugins' and background images are not included.


Figure 1 - Total Size of Entry Points

The range of sizes spans from around 5kb (<URL: http://www.rcm.ac.uk/>) to around 200Kb (<URL: http://www.kiad.ac.uk/>). The second large outlier at 192Kb corresponds to <URL: http://www.scot.ac.uk/>.

Hyperlinks

Figure 2 shows the number of hyperlinks within each site. These are obtained from the A element and from image map AREA elements. This data may include duplicate URLs where more than one hyperlink to the same URL exists.


Figure 2 - Total Number of Hyperlink Elements per Site

Note that the outlier corresponds to <URL: http://www.rhbnc.ac.uk/>.

HTTP Servers

Figure 3 shows a pie chart of the server software encountered during the crawl. This information is based upon the HTTP Server header returned by the web server.


Figure 3 - Server Software Encountered

The Other category consists of the following servers:

ServerCount
Borderware1
Lotus Domino1
Novell2
OSU2
SWS-1.01
WebSTAR4
WinHttpd1

Figure 4 - Components of the 'Other' slice from Figure 3

A more detailed table of the servers found is shown in Figure 5.

ServerCount
Apache/1.0.01
Apache/1.0.31
Apache/1.1.11
Apache/1.1.31
Apache/1.2.02
Apache/1.2.12
Apache/1.2.1 PHP/FI-2.0b121
Apache/1.2.47
Apache/1.2.4 FrontPage/3.0.21
Apache/1.2.512
Apache/1.2.66
Apache/1.2b102
Apache/1.2b71
Apache/1.3.0 (Unix)7
Apache/1.3.0 (Unix) Debian/GNU1
Apache/1.3.0 (Unix) PHP/3.01
Apache/1.3.1 (Unix)6
Apache/1.3.2 (Unix)1
Apache/1.3.32
Apache/1.3.3 (Unix)6
Apache/1.3.3 Ben-SSL/1.28 (Unix) PHP/3.0.5 od_perl/1.16 1
Apache/1.3.3 UUOnline/1.4 (Unix)1
Apache/1.3a11
Apache/1.3b31
Apache/1.3b51
BorderWare/2.1
CERN/3.08
CERN/3.0A3
HTTPS/2.121
Lotus-Doino/4.51
Microsoft-IIS/2.03
Microsoft-IIS/3.04
Microsoft-IIS/4.015
Microsoft-Internet-Inforation-Server/1.01
NCSA/1.2
NCSA/1.4.1
NCSA/1.5.13
NCSA/1.5.27
Netscape-Comunications/1.11
Netscape-Comunications/1.121
Netscape-Enterprise/2.013
Netscape-Enterprise/2.0a2
Netscape-Enterprise/3.04
Netscape-Enterprise/3.0F2
Netscape-Enterprise/3.0K1
Netscape-Enterprise/3.5-For-NetWare1
Netscape-Enterprise/3.5.14
Netscape-FastTrack/2.01
Netscape-FastTrack/2.011
Netscape-FastTrack/2.0a1
Netscape-FastTrack/2.0c1
Novell-HTTP-Server/2.5R1
Novell-HTTP-Server/3.1R1
OSU/1.9b1
OSU/3.21
SWS-1.01
WebSTAR2
WebSTAR/1.2.5 ID/130891
WebSTAR/2.0 ID/446931
WinHttpd/1.4a (Shareware Non-Comercial License1
Total150

Figure 5 - Table of all Servers Encountered

Of these servers, 40% used HTTP/1.0 and 60% used HTTP/1.1.

The Queso [3] software was used to get an idea of platforms. The high level results are summarised in Figure 6. A more detailed breakdown is presented in Figure 6.

Estimated
OSMinMax
UNIX97108
OS205
MacOS611
Netware33
Windows NT/95/982020
Other77
Unknown66

Figure 6 - Operating Systems as Reported by Queso

Note that the 'Other' category in Figure 6 corresponds to the Queso output categories Figure 7) 'Cisco...' and the 'Unknown' category corresponds to the Queso output categories 'Unknown OS', 'Firewalled host/port or network congestion' and 'Dead Host, Firewalled port or Unassigned IP'.

Note that the estimated minimum and maximum values in Figure 6 may be skewed because of the Queso unknowns referred to above.

Operating systemCount
BSDi or IRIX 1
Berkeley: Digital, HPUX, SunOs4, AIX3, OS/2 WARP-4, others... 5
Berkeley: HP-UX B.10.201
Berkeley: IRIX 5.x3
Berkeley: usually Digital Unix, OSF/1 V3.0, HP-UX 10.x14
Berkeley: usually HP/UX 9.x1
Berkeley: usually SunOS 4.x, NexT5
Cisco 11.2(10a), HP/3000 DTC, BayStack Switch7
Dead Host, Firewalled Port or Unassigned IP2
FreeBSD, NetBSD, OpenBSD 1
IBM AIX 4 2
IRIX 6.x 2
Linux 1.3.xx, 2.0.0 to 2.0.34 5
Linux 2.0.35 to 2.0.91
MacOS-8 6
Novell Netware TCP/IP 3
Reliant Unix from Siemens-Nixdorf1
Solaris 2.x60
Standard: Solaris 2.x, Linux 2.1.???, MacOS5
Windows 95/98/NT 20
Firewalled Solaris 2.x1
Firewalled host/port or network congestion 3
Unknown OS1
Total150

Figure 7 - Queso Output

Metadata Profile

The attributes of the HTML META element were examined for known metadata conventions. Figure 8 shows the results.

MetadataNo. META elementsNo. sites
PICS11
HTTP-EQUIV="Refresh"99
Reply-To33
Search Engine19095
Dublin Core10211
HTTP-EQUIV="(Dublin Core)"81

Figure 8 - Types of Metadata Encountered

Technologies

Scripting

29 pages used the SCRIPT element to include a client-side script block. Of these, 23 pages included the attribute-value LANGUAGE="JavaScript".

All HTML elements were searched for the set of defined JavaScript event handlers. The results are shown in Figure 9.

HandlersCountSites
onChange11
onClick134
onLoad108
onMouseOver32036

Figure 9 - Event Handlers Encountered

Java

Two Java applets were referenced by the site <URL: http://www.uwic.ac.uk/>.

The site <URL: http://www.luton.ac.uk/> referenced a plugin using the OBJECT element.

Splash Screens and Frames

21 sites used framesets. 10 sites use HTTP-EQUIV="refresh" to update the entry point page.

Cachability

Figure 10 shows a summary of the cachability of crawled resources.

Cachable resources72.5% of HTML pages, 80.9% of images
Non-cachable resources4.4% of HTML pages, 0.2% of images

Figure 10 - Cachability of Resources Encountered

Additionally, 40% of HTML pages and 45% of images contained the HTTP/1.1 Etag header.

A resource is defined as cachable if:

A resource is defined as not cachable if:

The cachability of resources is not determined if the resource used the Etag HTTP/1.1 header, since this would require additional testing at the time of the trawl which was not carried out.

Comparisons with Previous Crawls

Server Profiles

As shown in Figure 11, the Apache and Microsoft servers have shown increasing adoption. The Netscape server has fluctuated (perhaps due to a period of experimentation). The NCSA and CERN servers have shown a decrease in usage.

The growth of Apache and Microsoft servers has also resulted in a decrease of the 'Other' category, i.e. sites are subscribing to the more popular servers.


Figure 11 - Use of Server Software Over Three Crawls

A chart showing the growth of various servers is shown in Figure 12. This chart shows the contribution of growth for the period Oct 1997 - Jul 1998 and Jul 1998 - Nov 1998. Note that negative growth is interpreted as decline.


Figure 12 - Growth of Servers over Three Crawls

Size of Entry Points

A set of sites was isolated, for which reliable measurements of size exist for two previous web crawls. The results are shown in Figure 13.


Figure 13 - Changes in Size of Entry Points

Note that a majority of sites have not undergone great fluctuations in size. The outlier corresponds to <URL: http://www.scot.ac.uk>. The pages for this site are different since this site has become part of a larger institution.

"Splash Screens"

The number of institutional entry points which make use of "splash screens" or redirect has shown a steady increase from five sites (Oct 97) to seven sites (July 1998) to ten sites in the current trawl.

Hyperlink Profiles

The domains referenced by hyperlinks in the three crawls have been dominated by ac.uk and this domain has shown an overall increase. Figure 14 shows the contribution of different types of domain name as a percentage of all hyperlinks in the site.

DomainOctober 1997July 1998November 1998
Total .uk97.31%97.13%98.00%
ac.uk96.63%95.94%97.68%
net0.30%0.16%0.11%
com0.82%0.61%0.63%
org0.34%0.08%0.18%
Other0.15%0.08%0.10%
IP address0.00%0.12%0.04%
Badly formed URL1.10%1.72%0.91%

Figure 14 - Domains Referenced in Hyperlinks

Note in Figure 14, that the ac.uk data is a subset of the uk data.

Use of Metadata

In each crawl, we have looked for search-engine (SE) type metadata and Dublin-Core (DC) metadata. The findings over the 3 crawls are shown in Figure 15.


Figure 15 - Trends in Metadata Usage

Figure 15 shows that the use of Dublin Core metadata has increased considerably over the three crawls, from one site in October 1997 to 11 sites in November 1998.

References

  1. A Survey of UK Academic Library Web Sites <URL: http://www.ukoln.ac.uk/web-focus/webwatch/reports/hei-lib-may1998/>
  2. How Is My Web Community Doing? Monitoring Trends in Web Service Provision Journal of Documentation,Vol. 55 No. 1 January 1999, pp 82-95.
  3. Questo <URL: http://www.apostols.org/projectz/queso/>