Appendix 7 Third Trawl of UK Academic Entry Points

Introduction

On 25 November 1998 the WebWatch robot trawled the entry points for UK academic Web sites. This report is an analysis of the findings. This is the third Web crawl of the UK HEI entry points and completes a series of three snapshots of this community. The first crawl is available from the reports area of the WebWatch pages [1] and the second was published in the Journal of Documentation [2].

The input file of URLs obtained from NISS for the previous crawl was used. Of the 170 sites in this list, 150 sites were successfully crawled. Network/connection errors, out of date URLs and so on account for the 20 unexplored sites.

Size Metrics

Figure A7-1 shows a histogram of the total size of entry points. Total size is defined as the HTML page with inline images. A number of linked resources which may be downloaded by modern browsers, including external style sheets, external client-side scripts, resources requiring 'plugins' and background images are not included.

Figure A7-1 - Total Size of Entry Points

The range of sizes spans from around 5kb (<URL: http://www.rcm.ac.uk/>) to around 200Kb (<URL: http://www.kiad.ac.uk/>). The second large outlier at 192Kb corresponds to <URL: http://www.scot.ac.uk/>.

Hyperlinks

Figure A7-2 shows the number of hyperlinks within each site. These are obtained from the A element and from image map AREA elements. This data may include duplicate URLs where more than one hyperlink to the same URL exists.

Note that the outlier corresponds to <URL: http://www.rhbnc.ac.uk/>.

Figure A7-2 - Total Number of Hyperlink Elements per Site

HTTP Servers

Figure A7-3 shows a pie chart of the server software encountered during the crawl. This information is based upon the HTTP Server header returned by the web server.

Figure A7-3 - Server Software Encountered

The Other category consists of the servers listed in Table A7-1.

Server


Count


Borderware


1


Lotus Domino


1


Novell


2


OSU


2


SWS-1.0


1


WebSTAR


4


WinHttpd


1


Table A7-1 - Components of the 'Other' slice from Figure A7-3

A more detailed table of the servers found is shown in Table A7-2.

Server


Count


Apache/1.0.0


1


Apache/1.0.3


1


Apache/1.1.1


1


Apache/1.1.3


1


Apache/1.2.0


2


Apache/1.2.1


2


Apache/1.2.1 PHP/FI-2.0b12


1


Apache/1.2.4


7


Apache/1.2.4 FrontPage/3.0.2


1


Apache/1.2.5


12


Apache/1.2.6


6


Apache/1.2b10


2


Apache/1.2b7


1


Apache/1.3.0 (Unix)


7


Apache/1.3.0 (Unix) Debian/GNU


1


Apache/1.3.0 (Unix) PHP/3.0


1


Apache/1.3.1 (Unix)


6


Apache/1.3.2 (Unix)


1


Apache/1.3.3


2


Apache/1.3.3 (Unix)


6


Apache/1.3.3 Ben-SSL/1.28 (Unix) PHP/3.0.5 od_perl/1.16


1


Apache/1.3.3 UUOnline/1.4 (Unix)


1


Apache/1.3a1


1


Apache/1.3b3


1


Apache/1.3b5


1


BorderWare/2.


1


CERN/3.0


8


CERN/3.0A


3


HTTPS/2.12


1


Lotus-Doino/4.5


1


Microsoft-IIS/2.0


3


Microsoft-IIS/3.0


4


Microsoft-IIS/4.0


15


Microsoft-Internet-Inforation-Server/1.0


1


NCSA/1.


2


NCSA/1.4.


1


NCSA/1.5.1


3


NCSA/1.5.2


7


Netscape-Comunications/1.1


1


Netscape-Comunications/1.12


1


Netscape-Enterprise/2.01


3


Netscape-Enterprise/2.0a


2


Netscape-Enterprise/3.0


4


Netscape-Enterprise/3.0F


2


Netscape-Enterprise/3.0K


1


Netscape-Enterprise/3.5-For-NetWare


1


Netscape-Enterprise/3.5.1


4


Netscape-FastTrack/2.0


1


Netscape-FastTrack/2.01


1


Netscape-FastTrack/2.0a


1


Netscape-FastTrack/2.0c


1


Novell-HTTP-Server/2.5R


1


Novell-HTTP-Server/3.1R


1


OSU/1.9b


1


OSU/3.2


1


SWS-1.0


1


WebSTAR


2


WebSTAR/1.2.5 ID/13089


1


WebSTAR/2.0 ID/44693


1


WinHttpd/1.4a (Shareware Non-Commercial License


1


Total


150


Figure A7-2 - Table of all Servers Encountered

Of these servers, 40% used HTTP/1.0 and 60% used HTTP/1.1.

The Queso [3] software was used to get an idea of platforms. The high level results are summarised in Table A7-3. A more detailed breakdown is presented in Figure A7-7.


Estimated



OS


Min


Max


Unix


97


108


OS2


0


5


MacOS


6


11


Netware


3


3


Windows NT/95/98


20


20


Other


7


7


Unknown


6


6


Table A7-3 - Operating Systems as Reported by Queso

Note that the 'Other' category in Table A7-3 corresponds to the Queso output categories Figure A7-7) 'Cisco...' and the 'Unknown' category corresponds to the Queso output categories 'Unknown OS', 'Firewalled host/port or network congestion' and 'Dead Host, Firewalled port or Unassigned IP'.

Note that the estimated minimum and maximum values in Table A7-4 may be skewed because of the Queso unknowns referred to above.

Operating System


Count


BSDi or IRIX


1


Berkeley: Digital, HPUX, SunOs4, AIX3, OS/2 WARP-4, others...


5


Berkeley: HP-UX B.10.20


1


Berkeley: IRIX 5.x


3


Berkeley: usually Digital Unix, OSF/1 V3.0, HP-UX 10.x


14


Berkeley: usually HP/UX 9.x


1


Berkeley: usually SunOS 4.x, NexT


5


Cisco 11.2(10a), HP/3000 DTC, BayStack Switch


7


Dead Host, Firewalled Port or Unassigned IP


2


FreeBSD, NetBSD, OpenBSD


1


IBM AIX 4


2


IRIX 6.x


2


Linux 1.3.xx, 2.0.0 to 2.0.34


5


Linux 2.0.35 to 2.0.9


1


MacOS-8


6


Novell Netware TCP/IP


3


Reliant Unix from Siemens-Nixdorf


1


Solaris 2.x


60


Standard: Solaris 2.x, Linux 2.1.???, MacOS


5


Windows 95/98/NT


20


Firewalled Solaris 2.x


1


Firewalled host/port or network congestion


3


Unknown OS


1


Total


150


Table A7-4 - Queso Output

Metadata Profile

The attributes of the HTML <META> element were examined for known metadata conventions. Table A7-5 shows the results.

Metadata


Number of
META elements


No. sites


PICS


1


1


HTTP-EQUIV="Refresh"


9


9


Reply-To


3


3


Search Engine


190


95


Dublin Core


102


11


HTTP-EQUIV="(Dublin Core)"


8


1


Table A7-5 - Types of Metadata Encountered

Technologies

Scripting

29 pages used the <SCRIPT> element to include a client-side script block. Of these, 23 pages included the attribute-value LANGUAGE="JavaScript".

All HTML elements were searched for the set of defined JavaScript event handlers. The results are shown in Table A7-6.

Handlers


Count


Sites


onChange


1


1


onClick


13


4


onLoad


10


8


onMouseOver


320


36


Table A7-6 - Event Handlers Encountered

Java

Two Java applets were referenced by the site <URL: http://www.uwic.ac.uk/>.

The site <URL: http://www.luton.ac.uk/> referenced a plugin using the OBJECT element.

Frames and "Splash Screens"

A total of 21 sites used framesets to provide a framed interface to the institutional entry point.

A total of 10 sites use HTTP-EQUIV="refresh" to provide a client-side redirect of a "splash screen" for the entry point.

Cachability

Table A7-7 shows a summary of the cachability of crawled resources.

Cachable resources


72.5% of HTML pages, 80.9% of images


Non-cachable resources


4.4% of HTML pages, 0.2% of images


Table A7-7 - Cachability of Resources Encountered

Additionally, 40% of HTML pages and 45% of images contained the HTTP/1.1 Etag header.

A resource is defined as cachable if:

* It contains an Expires header showing that the resource has not expired

* It contains a Last-modified header with a modification date greater than one day prior to the robot crawl

* It contains the Cache-control: public header

A resource is defined as not cachable if:

* It contains an Expires header showing that the resource has expired

* It contains a Last-Modified header with a modification date coinciding with the day of the robot crawl

* It contains the Cache-control: no-cache or Cache-control: no-store headers

* It contains the Pragma: nocache header

The cachability of resources is not determined if the resource used the Etag HTTP/1.1 header, since this would require additional testing at the time of the trawl which was not carried out.

Comparisons with Previous Crawls

Server Profiles

As shown in Figure A7-4, the Apache and Microsoft servers have shown increasing adoption. The Netscape server has fluctuated (perhaps due to a period of experimentation). The NCSA and CERN servers have shown a decrease in usage.

Figure A7-4 - Use of Server Software Over Three Crawls

The growth of Apache and Microsoft servers has also resulted in a decrease of the 'Other' category, i.e. sites are subscribing to the more popular servers.

A chart showing the growth of various servers is shown in Figure A7-5. This chart shows the contribution of growth for the period Oct 1997 - Jul 1998 and Jul 1998 - Nov 1998. Note that negative growth is interpreted as decline.

Figure A7-5 - Growth of Servers over Three Crawls

Size of Entry Points

A set of sites was isolated, for which reliable measurements of size exist for two previous web crawls. The results are shown in Figure A7-6.

Figure A7-6 - Changes in Size of Entry Points

Note that a majority of sites have not undergone great fluctuations in size. The outlier corresponds to <URL: http://www.scot.ac.uk/ >. The pages for this site are different since this site has become part of a larger institution.

"Splash Screens"

The number of institutional entry points which make use of "splash screens" or redirect has shown a steady increase from five sites (Oct 97) to seven sites (July 1998) to ten sites in the current trawl.

Hyperlink Profiles

The domains referenced by hyperlinks in the three crawls have been dominated by ac.uk and this domain has shown an overall increase. Figure A7-8 shows the contribution of different types of domain name as a percentage of all hyperlinks in the site.

Domain


October 1997


July 1998


November 1998


Total .uk


97.31%


97.13%


98.00%


ac.uk


96.63%


95.94%


97.68%


net


0.30%


0.16%


0.11%


com


0.82%


0.61%


0.63%


org


0.34%


0.08%


0.18%


Other


0.15%


0.08%


0.10%


IP address


0.00%


0.12%


0.04%


Badly formed URL


1.10%


1.72%


0.91%


Table A7-8 - Domains Referenced in Hyperlinks

Note in Table A7-8, that the ac.uk data is a subset of the uk data.

Use of Metadata

In each crawl, we have looked for search-engine (SE) type metadata and Dublin-Core (DC) metadata. The findings for the three crawls are shown in Figure A7-7.

Figure A7-7 - Trends in Metadata Usage

Figure A7-7 shows that the use of Dublin Core metadata has increased considerably over the three crawls, from one site in October 1997 to 11 sites in November 1998.

References

1. A Survey of UK Academic Library Web Sites
<URL: http://www.ukoln.ac.uk/web-focus/webwatch/reports/hei-lib-may1998/ >

2. How Is My Web Community Doing? Monitoring Trends in Web Service Provision,
Journal of Documentation, Vol. 55 No. 1 January 1999, pp 82-95

3. Questo
<URL: http://www.apostols.org/projectz/queso/ >