Indexing Your Web Server(s)

Helen Varley Sargan
University of Cambridge Computing Service

Once a web server has more than one page, for the sake of both webmaster and user alike, there needs to be a search engine available. Finding a search engine that suits your requirements is not particularly easy. In the past searches were accomplished with flat indexes that were generated using a tool that followed your directory structure. Over the past few years search tools have moved over to being spider based, and generating the index by spidering through links on webservers. This change has removed one of the problem areas, since spider based tools can easily index a group of specified servers, and has focussed on another, often ignored, problem area of controlling any robot or spider's access to your web based information.

Current situation

Looking at UK HE sites in July/August 1999 (results published in Ariadne <http://www.ariadne.ac.uk/issue21/webwatch/> in September 1999) Brian Kelly found the following search engines were being used:

Name

Total

Details

ht://dig

25

Bath- Birkbeck - Bradford - Brighton - Bristol - Brunel - City - Coventry - Durham - Glasgow - Goldsmiths - Imperial - Keele - Kent - Leicester - London School of Hygiene - LSE - Manchester - Oxford - Portsmouth - Salford - UMIST - Worcester - York

eXcite

19

Aberdeen - Birmingham - Central Lancashire - Edge Hill - Exeter - Glasgow Caledonian - Kingston - Loughborough - UMIST - MMU - Nottingham - Northern College - Oxford Brookes - Sunderland - St George's - Thames Valley - UWE - Wolverhampton - Worcester

Microsoft

12

Aberystwyth - Canterbury Christ Church - Essex - Liverpool John Moore - Manchester Business School - NTU - Middlesex - Paisley - Scottish Agricultural College - Southampton Institute - UWIC - Westminster

Harvest

8

Anglia - DMU - Cranfield - Liverpool - Queen's University Belfast - Reading - Southampton - Swansea

Ultraseek

7

Cambridge - Edinburgh - Newport - Royal Holloway - Sussex - Ulster - UNL

SWISH / SWISH-E

5

KCL - Lancashire - London Guildhall - Sheffield Hallam - UCE

Thunderstone's Webinator

4

Newcastle - UEA - NWI - Sheffield

Netscape (Compass/Web Publisher)

3

Bangor - LMU - UCL

wwwwais (formerly available from <http://www.eit.com/>)

3

Cardiff - Hull - UWCM

FreeFind (Remote Index)

2

Northampton - St Mary's College

Muscat

1

Surrey

Maestro

1

Dundee

AltaVista (Product)

1

MMU

AltaVista (Public Service)

1

Derby

WebStar>

1

SOAS

WebFind

1

TASC

Other (Not known, home-grown)

6

University of London - Open University - South Bank - Surrey Institute - Queen Margaret University College - UNN

None (or not easily found)

59

Abertay - Aston - Bath Spa - Bolton - Bournemouth - Bretton - Bucks - CSSD - Cheltenham - Chester - Chichester - COGS - Dartington - East London - Falmouth - Glamorgan - Glasgow School of Arts - Greenwich - Harper Adams - Heriot-Watt - Herts - Huddersfield - Institute of Education - Kent Institute - King Alfred's - Lampeter - Lincolnshire - Liverpool Hope - London Business School - London Institute - Napier - Newman - North Riding - Northern School - Norwich - Plymouth - Ravensbourne - Ripon - RGU - Roehampton - Royal Academy - Royal College of Art - Royal College of Music - Royal Northern College of Music - Royal Scottish Academy of Music and Drama - Royal Veterinary College - St Andrews - St Mark - Stirling - Strathclyde - Swansea Institute - Trinity College of Music - Trinity College - Warwick - Welsh College - Westhill College - Westminster College - Writtle - Wye

Web site or search engine not available at time of survey

1

Staffordshire

Total

160

 

It would be interesting to know when these search engines were installed and whether the software is being maintained. Questions that spring to mind follow:

· Is the version of Muscat used by Surrey the free version available for a time?

· Are the users of Excite quite happy with the security and that development seems to have ceased?

· Are users of local search engines that don't use robots.txt happy with what other search engines can index on their sites (you have got a robots.txtfile haven't you?)

Different kinds of search facilities

There are several very different kinds of mechanism that a search facility can use and it is this mechanism that determines the functionality of the search and the extent of the index. Broadly, search backends are either direct indexes, which are created from scanning the local or networked file structure (for instance Microsoft index server), or robot/spider generated indexes, creation of which is to linked files only and is controlled by the robots.txt file on the web server. The former reflects the files present, where the latter reflects the active structure.

Because the first strategy is not 'Internet mediated' it is suitable for networked or one-off servers and is the technology used by previous generations of current tools, such as swish (++ and E). The index will probably have to be on the web server, which can be problematic if it gets to be large and well used.

The second strategy allows for much more elastic indexing of a number of servers. In general the approach is easy to administer and problem free - controls for which servers are indexed and how frequently this happens lie with the indexing software but there are also several means for the server administrator to control how the server is indexed too (see below). Problems can arise when indexing from built-in or add-on indexing software, such as Microsoft Site Server, Netscape Catalog Server or WebStar indexing software, which may be vendor specific. This is because of the differences in the ways web servers respond when the indexing robot approaches them (the API of the web server). The APIs of the above servers are subtly different from, say, Apache, and the indexing software may have been written with their particular API in mind, so may baulk at unexpected server responses. This problem is more likely to arise when you are indexing a large number of servers (and encounter more different types of server software during the process).

Coping with indexer controls

There will be some directories that you do not want your indexer to look at and index. When you are using an indexer such as older versions of Swish or Harvest, you have to add specific controls to tell the indexer where to go and not to go. When using a spider or robot based indexer, controls over indexing are through a number of means and will be observed by the Internet indexers such as AltaVista, Go, and HotBot, as well as your local indexer. Obviously, if you can kill all your indexing requirements with one stone it will save you work in the long run. These controls are:

· the robots.txt file

· description and keywords in metadata tags of individual files

· robots metadata tag giving noindex and nofollow information (and combinations) in individual files

These controls are observed to a greater or lesser degree by search engines at large. SearchEngineWatch will give you all the information you need about this (http://www.searchenginewatch.com/webmasters/features.html) All 'proper' search engines will observe a robots.txt file and do what it says, and almost all will observe the robots metadata tag. You cannot depend upon the description and keywords metadata tags being used - Google, NorthernLight, Lycos and Excite will ignore them - but if they work for your local search facility and make it more valuable, they must be worth pursuing. No support of Dublin core metadata should be assumed.

At another level, access to branches of a web server can be limited by the server software. Combining access control with use of metadata can give information to those within the access domain and some limited information to those outside.

Finally, it is a useful lesson that if you don't want people to read files, then they shouldn't be on the web server. Adding a new indexing facility should remind people to 'spring clean' their files and remove all the information that is no longer pertinent.

Security implications

There have been security alerts in the past, most notably with Excite free version (January 1998), which now has a patch to correct the problem. Bear in mind potential security problems stemming form the underlying OS, particularly if you are running an indexing facility on a separate server and the server is not the OS you use most often. Windows NT is a particular minefield for the uninitiated.

First make a shopping list

It is essential that you start off the exercise with a clear ides of what you are looking for. Things to think about may include the following:

· Do I want to/ am I able to run this on the web server, on a separate machine, or have someone else host it?

· What platform do I want to use (is there the expertise or facilities for using a different platform)?

· How many servers do I want to index (ballpark figure of number of pages to be indexed useful here too)?

· Is the data to be indexed subject to frequent change, if so in part or as a whole?

· What manpower and/or money is available for the project?

· What type of files do I want indexed (just HTML, or including PDF, Office files, etc.)

· What type of search facilities do I want to offer (keyword, phrase, natural language, constrained searches)?

The answers to some of these may be obvious but you may have to discuss others. Starting with a well-defined list will clarify where you may make compromises if need be.

Categories of search products

There are broad types of search facility, suitable for different circumstances, as follows:

· free services hosted remotely

· products built into (or added onto) your web server software

· free search facilities

· commercial search facilities

Free services

May be limited number of pages indexed (500, 5000 or unlimited in these examples) and will probably be deleted if the indexes are not used for between 5 and 7 days. Access to indexing is prey to Internet traffic and server availability (also note may generate incoming transatlantic traffic for UK users since these services are all in the US). You may get advertising appearing on your search results page. May be a stop-gap solution for small UK HE Institutions. See:

· Atomz search (http://www.atomz.com/) chargable on indexes above 500 entries or above 5000 requests / month

· FreeFind (http://www.freefind.com/)

· Pinpoint (http://pinpoint.netcreations.com/)

· Thunderstone (http://index.thunderstone.com/texis/indexsite/)

· Tippecanoe (http://www.tippecanoe.com/) from August 1999

· Whatuseek IntraSearch (http://www.whatuseek.com/)

Built-in search facilities

Several types of server software come with an in-built search facility, so before you go any further it is worth double checking, especially if you are using a Microsoft or Netscape server, or using WebStar version 3 or higher, WebTen 2.1.10 (both Macintosh - WebTen now comes with ht://dig) or WebSite Pro (Windows NT). In addition more sophisticated add-on search facilities are available for Microsoft and Netscape server products.

Free search facilities

Some of the products listed below will index either a single server or a group of servers (marked with **). If there are no asterisks this does not necessarily mean that multiple servers can't be indexed - but the available information might not mention it.

Search engine

Version

Platforms

Memory and disk space

Searchable document formats

Notes

Alkaline

1.3 July 99

Linux, (intel/Alpha) FreeBSD SGI Irix Solaris BSDI, BSD/OS Win NT

 

HTML, ASCII, filters for PDF, User defined filters may be made

Free to non-commercial sites

Excite For Web Servers (EWS)

http://www.excite.com/navigate/

1. Oct 96

Patched version is 1.1.1

Solaris 2.4 or x86 SGI Irix 5.3 HP-UX 9.05 IBM AIX 3.2 (v 1.0 only) BSDI 2.0 Linux SunOS

 

HTML ASCII

Known security bug needing patches dated 14 January 1998.

NT versions n/a Aug 99

**freeWAIS-sf

http://ls6-www.cs.uni-dortmund.de/ir/projects/freeWAIS-sf/

2.2.12 April 99

Only tested on SunOS 5.6 and Linux Not known HTML ASCII Not supported but newsgroup information usually good SF-Gate provides web interface (http://ls6-www.cs.uni-dortmund.de/ir/projects/SFgate/)
Glimpse & Webglimpse http://webglimpse.net/

4.12.5 and 1.7.5 respectively July 1999

UNIX of various sorts, with more coming   HTML, ACSII New effort (March 99)not connected with original developers

Free for non-commercial use

Webglimpse is the spider, Glimpse the indexing software.

**Harvest

http://www.tardis.ed.ac.uk/harvest/

1.5 1997 last patch produced June 1998

UNIX   ‘a wide variety’ Original development ended but co-operative development of newer version ongoing. Works with local gatherers feeding a central broker rather than a spider model. New version no longer uses Glimpse for indexing.

**ht://Dig
http://www.htdig.org/

3.1.2 March 1999

Sun Solaris 2.5 SPARC (using gcc/g++ 2.7.2) Sun SunOS 4.1.4 SPARC (using gcc/gcc 2.7.0) HP/UX A.09.01 (using gcc/g++ 2.6.0) IRIX 5.3 (SGI C++ compiler. Unknown version) Debian Linux 2.0 (using egcs 1.1b)

disk space – approx 12KB for each document for wordlist database, 7.5KB without wordlist.

HTML ASCII You will need a Unix machine, a C compiler and a C++ compiler. With libstdc++ installed and Berkeley 'make' Will index multiple servers understanding http 1.0 protocol.

Developer site at http://dev.htdig.org/

ICE

http://www.objectweaver.de/ice/

1.5b3r1 Sept 1998

New release summer 99

Anything running Perl

 

 

Requires Perl and runs as CGI gateway.

Email support from author.

Isearch

http://www.cnidr.org/ no longer maintained
http://www.etymon.com/Isearch/

v 1.42 available Aug 98

Unix machines from Linux PCs to Crays

 

wide range of document types, with facilities to add new types

No support – development is through second website.

Mailing list active for development

See also the Advanced Search Facility project, for supplying resource location system free of charge http://www.etymon.com/asf/

Lycos/Inmagic site spider

http://www.lycos.com/software/software-intranet.html

 

Windows NT

 

 

Free of charge. Commercial supported version available from Inmagic (see below)

SWISH ++

http://www.best.com/~pjl/software.html

3.0 July 1999

UNIX Windows NT

disk space – approx 1-5% of size of HTML data

ASCI HTML Office files extracted and PDF files filtered before indexing

Unix supporting the mmap(2) system call, a C++ compiler, and a version of STL (Standard Template Library)

Unsupported but information available on discussion lists and newsgroups

SWISH-E (SWISH Enhanced)

http://sunsite.berkeley.edu/SWISH-E/

1.3.1 or 1.3.2, depending on platform

Jan 1999

UNIX Linux (intel) Windows (all 32-bit varieties)

disk space – approx 1-5% of size of HTML data

ASCI HTML

Information and compiled versions available from http://www.geocities.com/CapeCanaveral/Lab/1652/software.html#swishe

Discussion lists and newsgroups and on website

Thunderstone Webinator

http://www.thunderstone.com/webinator/

2.5

Lts of Unix flavours Linux (Intel) Windows NT x86

 

ASCII HTML

Other formats in commercial versions

V 2.5 incompatible with indexes of previous versions

Technical support via a mailing list. The free version is limited to 10,000 pages per index

Tippecanoe Systems

http://www.tippecanoe.com/

New version due late 99. Offering free service from Summer 99

Webglimpse - see Glimpse, above

 

Commercial products (not supposed complete)

All of these products will cost real money but many will negotiate a price, so do not be put off from asking about prices or immediately write off using a commercial product. The money spent may well be saved by staff having no development work to do and having access to ready technical support. Many of these products have a limited-time trial version for you to assess before you commit yourself to buying, but you may have to pre-register with them to get access to trial software. Information on web sites varies enormously, but check the basic facts there before you go any further with assessment.

Commercial products are marketed primarily to companies, not to academic institutions, and information about them reflects this. It may not be readily apparent how or if the software will work in your particular environment until you investigate, particularly if you are seeking to index a group of independent servers that are not on an intranet, or are wishing to produce indexes of several subgroups of information.

Some of these products will support metadata but the information is not readily available so no information about metadata has been recorded. Support of Dublin core metadata is almost non-existent.

Search engine

Version/Price

Platforms

Searchable document formats

Notes

ALISE

http://www.alise.com/

2.0

Starting at US$2000

Visual Basic

   

AltaVista Intranet Search and Developers Kit

http://www.altavista-software.com/

Price scales on size of index.

Academic discount.

Alpha NT Windows NT DIGITAL UNIX Sun Solaris

Over 200

 

Excalibur

http://www.xrs.com/

6.7

Windows/NT Sun/Solaris IBM/AIX Hewlett-Packard/HP-UX Silicon Graphics/IRIX Digital Alpha/UNIX Digital Alpha/NT

Over 200

Spider and WebExpress products may be suitable

FastSearch

http://www.fast.no/

April 99

Solaris Intel: NT, Linux, BSD (FreeBSD), Solaris Alpha: Digital Unix, NT.

 

Comes with hardware option for generating and searching extremely large indexes

Demos available

InQuizit

http://www.inquizit.com/

 

Windows 95 NT UNIX

   

**Infoseek Ultraseek Server

http://software.infoseek.com/

3.1 (Aug 99)

Large discount for academic use,

Sun Solaris 2.5 and above Linux Windows NT 4.0 and above

Many

Numerous awards

**Inmagic/Lycos Site Spider

http://www.inmagic.com/

 

Windows NT

Wide range

 

InText

http://intext.com/

 

Windows NT UNIX

Wide range

 

Limit Point

http://www.limit-point.com/

Boolean Search 2.2 Summer 99

US$297

Macintosh

   

Maxum Phantom

http://wwww.maxum.com/

2.2

July 99

Education price US$296.25

Windows NT Macintosh

Wide range

Full documentation, FAQ and mailing list support. Email support as well.

Mondosearch

http://www.mondosearch.com/

3.31

Windows NT

 

Indexes frames

Muscat

http://www.muscat.com/

 

Many

Almost no technical information

Open Text

http://wwww.opentext.com/

 

   

A knowledge management system rather than indexing

Oracle

http://www.oracle.com/

 

   

Indexer for sites generated by an Oracle database

Quadralay WebWorks Search

http://www.quadralay.com/

 

   

Not currently available Aug 99

PCDocs/Fulcrum SearchServer

http://www.pcdocs.com/

 

Windows NT

Over 200

Supports Korean and Asian languages and Java

Site Surfer

http://www.devtech.com/SiteSurfer/

1.0 Feb99

about US$250

Any with version 1.1.5 or higher of a Java runtime to build applet

Many

Java applet 1.1 or later, so will only work with Java-enabled browsers

Will also give site maps and indexes

Thunderstone Webinator

http://www.thunderstone.com/webinator/

2.5

Solaris SPARC Linux Intel SGI Irix 5/6 Unixware Solaris x86 BSDI 4.0 SGI Irix 4 AT&T SVR4 386 SunOs 4 SCO 5 DEC Alpha Unix 4 HP UX 10 Windows NT x86 IBM AIX 4.2

Other OSs may be available on demand

Two commercial versions give different file support see web information

PDF plug-in available at extra cost.

Technical support via a mailing list. The free version is limited to 10,000 pages per index

Verity Information Server

http://www.verity.com/

Very expensive!

Windows NT major UNIX systems

Any document format including databases  

WiseBot

http://www.tetranetsoftware.com

2.0

Windows 95/98/NT

 

Java search engine. Free trial version.

Synopsis

Search engine software that is available free of charge is generally either a cut-down version of a commercial product that is limited to producing a small index (Lycos Site Spider), or a product that might require quite advanced expertise to set it up correctly and keep it running smoothly (there are, of course exceptions to this). Maintenance of products is a problem area - for a server manager to install and configure a search engine only to find its development is discontinued or it is turned into a commercial product is a blow. Many free products are for Unix platforms since this is where such expertise and enthusiasm for free software lies. Excite generated much interest from the less technical managers of Unix systems, but it became apparent that it was not well maintained (there was a security hole identified later and it has not been updated since) and the support offered was on a commercial basis.

The Perl-based search engines suffer from the disadvantage that the whole index needs to be loaded before a search can be done, and these products might have a limited life when more engines written in Java are available. Java-based search engines have the problem that users have to be running Java-enabled browsers to use them, and many users prefer to disable Java because of security problems. Several other Perl and Java based search engines are available, other than those listed here -see list at http://www.searchtools.com/tools/tools.html

While they requires some technical expertise SWISH-E or ht://dig do accomplish the job with no direct cost and little day-to-day intervention. Two diverged versions of SWISH are available, the other being SWISH ++ - both (at present) are being actively updated.

The only way to find commercial products that really are suitable for your needs is to pay close attention to making your 'shopping list', investigate available information about the products capabilities, then talk to the local contact. We were seeking a product that was well supported, had a good interface, ran under Unix, was essentially self-managing, and could index a large number of diverse web servers. The product that appeared most suitable was Ultraseek. We were able to download a trial version (restricted to one months' use) and use it to confirm its suitability before buying a licence for the product. I would suggest that if you cannot use the product on a trial basis first, you shouldn't buy it.

If your shopping list does not match ours then the final choice will probably not be the same.

With these figures in mind, here are brief case studies of sites using three different search engines, all of whom are more-or-less satisfied with their performance (the current maintainer of the Oxford search engine is not the person who set it up and finds a certain amount of obscurity in its settings).

Case Studies

Microsoft Site Server - University of Essex

Platform: Windows NT

Number of servers searched: 16

Number of entries: approx 11,500

File types indexed: Office files, html and txt. Filters available for other formats

Index updating: Configured with windows task scheduler. Incremental updates possible.

Constrained searches possible: Yes

Configuration: follows robots.txt but can take a 'back door' route as well. Obeys robots meta tag

Logs and reports: Creates reports on crawling progress. Log analysis not included but can be written as add-ons (asp scripts)

Pros: Free of charge with Windows NT.

Cons: Needs high level of Windows NT expertise to set up and run it effectively. May run into problems indexing servers running diverse server software. Not compatible with Microsoft Index server (a single server product). Likes to create several catalog files, which may create network problems when indexing many servers.

ht://Dig - University of Oxford

Platform: Unix

Number of servers searched: 131

Number of entries: approx 43, 500 (specifically 9 levels down as a maximum on any server)

File types indexed: Office files, html and txt. Filters available for other formats

Index updating: Configured to reindex after a set time period. Incremental updates possible.

Constrained searches possible: Yes but need to be configured on the ht://dig server

Configuration: follows robots.txt but can take a 'back door' route as well.

Logs and reports: none generated in an obvious manner, but probably available somehow.

Pros: Free of charge. Wide number of configuration options available.

Cons: Needs high level of Unix expertise to set up and run it effectively. Index files are very large.

Ultraseek - University of Cambridge

Platform: Unix

Number of servers searched: 232

Number of entries: approx 188,000

File types indexed: Many formats, including PDF, html and txt.

Index updating: Intelligent incremental reindexing dependent on the frequency of updates of the files - can be limited to time periods and/or days of the week. Manual incremental updates easily done.

Constrained searches possible: Yes easily configured by users and can be added to configuration as a known constrained search, thereby taking shortcut in processing.

Configuration: follows robots.txt and meta tags. Configurable weighting given to terms in title and meta tags. Thesaurus add-on available to give user-controlled alternatives to search terms entered (especially suitable for obscure local names)

Logs and reports: Logs and reports available for every aspect of use - search terms, number of terms, servers searched, etc.

Pros: Very easy to install and maintain. Gives extremely good results in a problematic environment. Technical support excellent.

Cons: Relatively expensive.

Conclusions

If you have a search solution that is more than about 18 months old, then it is time to review it. How often does it stop working, how long does it take to fix, have you tested it for accuracy recently (is it doing what you think it's doing, and how often is the index being updated)? New products and new versions of existing products have made facilities available that you are probably not using currently. Look at how you could use searching to improve your site - giving constrained searches for certain bodies of information could help a user to find immediately appropriate information. Moving to a spider or robot based indexer could change the way you can run your website and how your website is indexed by the major external search engines, and make your index more reliable for users.

Further information

For further information see:

BotSpot http://www.botspot.com/

Search Engine Watch http://www.searchenginewatch.com/

Search Tools http://www.searchtools.com/tools/tools.html

Web Compare http://webcompare.internet.com/

Web Developers Virtual Library http://WWW.Stars.com/