PRIDE - LB 5624 - D2.2.1 PRIDE Requirements and Success Factors

PRIDE Requirements and Success Factors
Work Package 2 of Telematics for Libraries project PRIDE (LB 5624)

2.5 Harvesting

2.5.1 WWW Spiders

2.5.1.1 Introduction

2.5.1.1.1 Definition of a spider

Spiders (a.k.a. robots, web-crawlers etc.) are automated software agents designed to carry out work that is repetitive or, because of sheer scale, not possible as an individual undertaking.

There are a variety of notions of software agents. One relevant definition of an autonomous agent [19] is "a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what is senses in the future". Agent technology is currently an area of increasing popularity. Agents are incorporating various levels of "intelligence" and protocols are being proposed or adapted for agent presence on the Web.

Autonomous agents are not a new phenomena on the Internet. UNIX, the operating system on which the Internet was first developed, utilises network daemons that are essentially simple automata for dealing with network transactions. To deal with later development of the Internet, newsbots, IRCbots and various application-specific autonomous agents emerged.

WWW spiders are automated web agents that traverse the Web, navigating though links and performing some activity, such as recognising and indexing resources or noting broken links. The current generation of web crawlers is, perhaps surprisingly, large. A glance at the Web Robots Pages reveals a list of over 160 well-known robots.

Search engines use Web spiders to index keywords and other metadata. This is used to build indices that can then be exploited for resource discovery and retrieval purposes. Other robots exist to maintain hypertext structures or summarise resources in other ways.

Web Spider behaviour could be broadly classified into:

Exploratory Web crawling (e.g. resource discovery or broken link discovery)
Directed Web crawling (e.g. of URLs submitted to a search engine)
Restricted Web crawling (e.g. all URLs within a DNS domain).

2.5.1.1.2 The role of robots in the WWW

Understanding the Web. Arguably, the uniqueness of the Web warrants analysis of its structure to provide insight into its nature and to monitor its development. The results of these analyses can aid solution of the correspondingly unique problems that are associated with the web. Monitoring change and growth may help predict future trends and development and recognise the potential or necessity for enhancements and developments. This may be true on the small scale of Web areas under the same administrative control.

Maintaining the Web. The dynamic nature of the Web, and the fact that it is composed of administratively independent "modules" leads to the danger that states of disrepair can arise because extensive checking it too big an undertaking. This maintenance is necessary to keep up with changes in the structure of web-space and changes in technologies and standards. For example:

When pages change, come into existence or cease to exist, references to those pages must be correspondingly changed, created or removed.

In order that pages are displayed as required, there must be conformance to standards such as HTML. On a large scale, such validation can be automated.

Exploiting the Web. The Web is an increasingly rich source of information. However, an information-space must have certain properties to be effective in its role, such as sufficient indexing of available resources. Taking areas of web-space as information-space, we can impose more structure and distinguish document-space and index-space, associated with local and global navigation respectively [1]. Effective entry points into document-space obviously require a rich and well-constructed index-space. The production of index-space must then include mechanisms for resource discovery and cataloguing.

Mirroring and archiving arise in connection with moulding a more human-centric document-space that is easily accessible, cheap to use and has some history.

2.5.1.2 Benefits

Spiders offer an effective solution to obtaining a user view of remote web-space. Simulating a human user, they have the advantage of being able to repetitively cover large subspaces of the Web and keep accurate logs and summaries. Additionally, they can be run for long periods or at regular periods without fatiguing and can monitor changes that may not be apparent to a normal user.

General benefits include:

User satisfaction from search directed access to resources and easier browsibility (via maintenance and advancements of the Web resulting from analyses)
Reduced network traffic in document space resulting from search-directed access
Effecting archiving/mirroring, and populating caches (to produce associated benefits)
Monitoring and informing users of changes to relevant areas of web-space
"Schooling" network traffic into localised neighbourhoods through having effected mirroring, archiving or caching

Multi-functional robots can perform any number of the above tasks, perhaps simultaneously.

2.5.1.3 Implementation issues

2.5.1.3.1 Robot ethics

All ethical spider users should balance the advantages gained against the disadvantages that may arise from the use of a Web crawler.

There are some generally accepted guidelines for producing an ethical crawler. These are designed to minimise problems for users, servers and networks. They include:

Re-using existing spiders rather than creating new ones
Spiders should identify themselves to servers
Spiders should be thoroughly tested locally before being unleashed on the wider web
There should be frequency limits on requests to servers
Spiders should retrieve only resources that they are interested in
Spiders should monitor and log their performance during a browsing session
The results of using a spider should be disseminated

It is also considered ethical to use a spider that will parse and obey robot exclusion protocols. Robot exclusion refers to methods that can be implemented server-side to direct robot accesses. There are advantages for both sides: the server can hide material not suitable for indexing, or completely block certain robots from the site and the robot is saved from traversing directories of irrelevant material (e.g. icons or binaries) or from stumbling into a "black hole".

Currently, two common robot exclusion protocols exist. The scope of the first method is server-wide and implements a set of directives in a top level /robots.txt file. The second method is implemented with the HTML META element per page.

2.5.1.3.2 Problems caused through use of spiders

Network performance

Robots traditionally have a bad press in discussions on bandwidth, even though the functions of some well-written and ethical robots are ultimately to conserve bandwidth.

There are points to consider on the bandwidth front, since robots can span relatively large portions of Web-space over short periods. Bottlenecks can arise locally though high bandwidth consumption, particularly if the robot is in frequent or permanent use, or if it is used during network peak times. The problem is exacerbated if the frequency of requests for resources is unregulated.

Server-side concerns

So-called "rapid-fire" requests (successive HTTP requests to a single server without delays) have been shown to be very resource consuming for a server under current HTTP implementations (in fact, this is the basis of several "denial of service" attacks). Here again, an unregulated robot can cause problems. Suitable delays and an ethical traversal algorithm can help resolve this.

The skewing of server logs is another issue that causes concern. A robot that indexes an entire site will obviously distort logs; if a recognised "user-agent" is not supplied these may be hard to distinguish from regular users.

Unethical robots

A small number of rogue robots are in use. The tasks of these robots are such that they are particularly unwelcome by servers. Such tasks include email culling, for the production of large lists of email addresses that can be sold to advertiser and copyright violation through copying entire sites. Additionally robots can contribute to a sites "hit quota" and consume bandwidth which the site may pay for.

Remedies

Spider implementers should follow the guidelines on robot ethics. Server administrators have various options to reduce adverse effect from robot hits, these include:

Using a server that supports keeping the HTTP connection open
Using load-balancing software for large server-distributed services
Recognising robot user-agents in server logs or using log analysis software that recognises robots
Recognising robots disguised as other browsers by repetitive browsing patterns
Robots may be excluded from accessing the server by:
Using Robot Exclusion Protocols to block robots from parts of the server
Recognising unwelcome robots by IP address and rejecting access at the TCP level.

2.5.1.4 Software

This section gives details of software that can function as a WWW spider. The main function of these spiders is indexing. A number of tools also exist to check for broken hyperlinks.

A variety of indexing spider software exists across different platforms. Here we concentrate on free software, much of which is UNIX-based. Note that some web servers now include indexing functionality. See the `Botspot' site <URL:http://www.botspot.com/main.html> robot software.

Popular libraries

Lib-www for the C language was the first popular library used for building robots. Lib-www-Perl (LWP) is the popularly used equivalent for the Perl programming language. Use of such libraries makes it much easier to construct Web traversing tools. Note that it is considered unethical to write a robot from scratch if one already exists to perform the required tasks.

Excite for Web Servers (EWS): <URL:http://www.excite.com/navigate>

FreeScout

Glimpse: <URL:http://glimpse.cs.arizona.edu/>

Harvest

Harvest is an integrated set of tools used to gather, extract, organise, search, cache, and replicate relevant information across the Internet. With modest effort, users can tailor Harvest to digest information in many different formats from many different machines, and offer custom search services on the Web.

Funding for Harvest ended in August 1996, and the project is now officially over, but the software lives on:

As free software

Maintained by Harvest users. The latest release is version 1.5. Some parts of Harvest (e.g. Glimpse) are still maintained and extended by its original team. Some parts of the software have developed into products in their own right - the Squid Web caching software was developed from the Harvest Internet Object Cache.

As commercial derivatives

Netscape's Catalog Server is based on the Harvest design. There is also commercial effort to promote co-operative caching based on the Harvest object cache.

There has been recent work on re-coding the gathered component of Harvest in Perl using LWP. This work is known as harvest-ng.

Ht://Dig: <URL:http://htdig.sdsu.edu>

ICE: <URL:http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html>

Isearch: <URL:http://www.cnidr.org/>

SWISH: <URL:http://www.best.com/~pjl/software.html;> <URL:http://sunsite.berkeley.edu/SWISH-E/>

Thunderstone Webinator: <URL:http://www.thunderstone.com/webinator>

2.5.1.4.1 Search engines and commercial products

Many commercial search engines provide an Intranet version of their indexing software, which includes a spider. Examples include Excite for Web Servers (EWS) which is freely available and the Infoseek Ultraseek system, which requires a license.

AltaVista Search Extensions: <URL:http://www.altavista.software.digital.com>

Excalibut: <URL:http://www.xrs.com>

Fulcrum: <URL:http://www.fulcrum.com/>

InQuizit: <URL:http://www.itpinc.com/>

Infoseek Ultraseek Server: <URL:http://software.infoseek.com/>

InText: <URL:http://intext.com>

Maxum Phantom: <URL:http://www.maxum.com>

Netscape Compass Server: <URL:http://home.netscape.com/compass/v3.0/evalguide/>

Open Text: <URL:http://www.open.text.com/>

Verity Information Server: <URL:http://www.verity.com/>

WiseBot: <URL:http://www.tetranetsoftware.com>

2.5.1.4.2 Link checking software

This list is reproduced from a list prepared by Helen Varley Sargan.

Dr Watson: <URL:http://addy.com/watson/>

Inspector Web: <URL:http://greenpac.com/inspector/sample.shtml>

LinkChecker: <URL:http://www.ag.ohio-state.edu/~jray/linkcheck2.html>

Netmechanic: <URL:http://www.netmechanic.com/link_check.htm>

URL Checker: <URL:http://www2.ucsc.edu/cats/sc/services/www/urlchecker.shtml>

Web Check: <URL:http://test.vsource.com/webcheck.html>

Astra SiteManager (windows): <URL:http://www.ns.dk/mercury/astra/astradata.htm>

CyberSpyder Link Test (windows): <URL:http://www.cyberspyder.com/cslnkts1.html>

HTML Powertools (windows NT): <URL:http://www.tali.com/tools.html>

InfoLink Link Checker (windows 95/NT): <URL:http://www.biggbyte.com/>

LinkBot Pro (windows 95/NT): <URL:http://www.tetranetsoftware.com/>

LinScan (UNIX/NT): <URL:http://www.elsop.com/linkscan/>

MOMSpider (UNIX): <URL:http://www.ics.uci.edu/pub/websoft/MOMspider/>

2.5.1.5 Standards

The area of spiders is sadly lacking in official standards. The robots.txt standard is now an expired Internet draft. There is some work on organising new REP, which would become a standard.

2.5.1.6 Comparisons and relevance

PRIDE may use spider software to reap information that may form indices to aid resource discovery and retrieval.

2.5.1.7 Future development

Though reports of spiders eventually forming part of the IP stack are perhaps fanciful, there is support for the idea that automated Web agents will become essential tools as the Web continues to grow. Some speculate that they will facilitate a more user-oriented view of the Web.

Features of spiders should continue to grow, based upon the lessons learned from the current Web crawlers. More interestingly, we will see the next generation of spiders emerge. `Personal agents' which scour the net in accordance with user-specified search requirements and related technologies have recently emerged. Whether such developments will produce significantly more network traffic remains to be seen.

The need for effective resource discovery is also giving rise to "intelligent" agents. This introduces ideas such as intelligent filtering, intelligent discovery and agent interaction and collaboration. In search terms, this will allow optimisation of results though context or concept type queries. The coupling of AI and the Web looks set to produce an interesting outcome.

The development of new robots must be mirrored by the development of methods to guide them through sites. The current methods of robot exclusion may not be sufficient in the future. There is some work on looking at robot guidance protocols, which may be based on technologies such as RDF.

2.5.1.8 Related information

Hermans, B. Intelligent Software Agents on the Internet. An Inventory of Currently Offered Functionality in the Information Society and a Prediction of (Near) Future Developments, <URL:http://www.firstmonday.dk/issues/issue2_3/ch_123/index.html>

Aas, G. Fielding, R. Koster, M. LWP, <URL:http://www.ics.uci.edu/pub/websoft/libwww-perl/>

W3C. LibWWW - the W3C Sample Code Library, <URL:http://www.w3.org/Library/>

Boisot, M.H. Information Space. A framework for learning in organisations, institutions and culture. Routledge 1995.

Touch, J. Haidermann, J. Obraczka, J. Analysis of HTTP Performance, <URL:http://www.is..edu/lsam/publications/http-perf>

Eichmann, D. Ethical Web Agents, <URL:http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/eichmann/ethical/eichmann.html>

Koster, M. Guidelines for Robot Writers, <URL:http://info.webcrawler.com/mak/projects/robots/guidelines.html>

Robots exclusion, <URL:http://info.webcrawler.com/mak/projects/robots/exclusion.html>

2.5.2 Z39.50 Explain

2.5.2.1 Introduction

ANSI/NISO Z39.50 standard is an information retrieval protocol specification (see section 2.4.3). One of its newly added features, the Explain facility, can be used to help automate harvesting of Z39.50 databases.

The Explain facility appeared in the Z39.50-1995 (version 3) protocol. The facility allows interrogation of a target by an origin for various meta-information about that target. This may include details on Z39.50 server-implementation, databases available for searching, search attribute sets, diagnostic sets and supported record syntax.

Explain is layered above the Z39 search service. Data are retrieved using the Z39 Present service using structured retrieval records:

"The target maintains this information in a database that the origin may access via the Z39.50 Search and Present facilities.

This "explain" database appears to the origin as any other database supported by the target, but it has a well-known name and a pre-defined record syntax. Also, certain search terms, corresponding to information categories, are predefined in order to allow a semantic level of interoperability. Each information category has its own record layout, and all are included in the Explain syntax." [1]

2.5.2.2 Implementation

Explain data is stored in a database, which is accessed using a database name specified as IR-Explain-1. Clients query the server using the Explain attribute set Exp-1. An Explain database may be an adaptation of an existing bibliographic DBMS or a system integrated with the Z39.50 server.

The facility provides for increased interoperability of client and server during a session. The client may dynamically configure itself in anticipation of a session with a target, based on server-provided information rather than relying on a pre-set configuration. This is useful when the server deviates from suggested defaults and removed overhead of having to reset client configurations when querying different servers. The meta-information returned to the client can also include human readable information such as descriptions of databases and times that the target is in operation.

The criteria for clients to support Explain includes support for searching with the Exp-1 attribute set, support for the Explain record syntax and (for dynamic configuration) the ability to self-configure based upon information retrieved from the target. Servers must contain an up-to-date IR-Explain-1 database, and support queries for Explain records.

Only a handful of vendors have implemented the Explain service at this time. The reasons for this are likely to be the relatively small number of servers supporting Z39.50-1995, the complexity of the Explain subsystem and pre-programmed vendor specific clients.

UK-ZUG, <URL:http://www.mailbase.ac.uk/lists/uk-zug/>

Collection Level Description. A review of existing practice, <URL:http://www.ukoln.ac.uk/metadata/cld/study/toc/>

1999-01-22

PRIDE Requirements and Success Factors