ROADSHarvester v1a1 README

ROADSHarvester version 1 alpha 1

README

Version: $Id: README.html,v 1.20 1997/02/19 15:21:25 lisap Exp $

This file is also available at <URL:http://www.ukoln.ac.uk/roads/harvester/v1a1/README.html>

What is this

This is the first alpha release of the ROADSHarvester package. This package provides a combine-harvester for the ROADS software - adding the following functionality:

Automatic generation of metadata to 'pump-prime' ROADS records as part of the process of manually creating resource descriptions. Records created in this way can currently be based on either the DOCUMENT or the SERVICE template type.
Web robot based bulk-harvesting of records into a ROADS database based on the URLs listed in another ROADS database. Typically, all the records created in this way will be based on the DOCUMENT template type.

In the current release all this functionality relies on the Harvest suite of tools. These must be installed on the system that is running ROADS prior to installing this package.

Please note...

This documentation is not yet fully complete!

...but it should be enough for you to get things working. And finally...

THIS IS ALPHA SOFTWARE - Use with care!

Dependencies

ROADS version 1 (or above)
Harvest version 1.4 (pl 2 ?)
Perl MD5 package (from CPAN)

Installation

I'm going to assume that you've already installed ROADS and Harvest. If you're really reading this without having already installed the ROADS software then I suggest that you start at the very beginning (it's a very good place to start) and read the ROADS home page at <URL:http://www.ukoln.ac.uk/roads/>. Full details of how to install Harvest are given in the "Harvest User's Manual" that is available from the Harvest home page at <URL:http://harvest.cs.colorado.edu/>.

Here's a quick overview of what you're going to have to do. I strongly suggest that you read all the way thru this document before actually doing anything!

Configure ROADS databases for your ROADSHarvester generated resource descriptions
Convert any existing records in your ROADS database as necessary
Set up a Harvest gatherer for ROADS
Install the ROADSHarvester package
Configure the new Harvest gatherer
Set up some 'cron' jobs

OK, here we are in more detail...

Configure databases for your ROADSHarvester generated resource descriptions
The ROADSHarvester relies on multiple 'logical' ROADS databases in order to work:
- Normal, hand built, resource descriptions are put into one database - refered to in this document as the main database.
- Descriptions that are individually harvested are put into another - by default called 'Harvested'. These records are typically moved into the main database very quickly as part of the normal procedure of generating them. Any that are left lying around will be removed on a regular basis by a 'cron' job (see later).
- Robot generated (bulk-harvested) descriptions are put into a third database - by default called 'BulkHarvested'.
A full description of setting up multiple ROADS databases can be found in the ROADS FAQ at <URL:http://www.ukoln.ac.uk/roads/faq/>. The rest of these instructions assume that you are not already using multiple databases and that the single database listed in your 'config/databases' file is your main database.
Briefly, you need to add two new lines to your 'config/databases' file, one for the new 'Harvested' database and one for the new 'BulkHarvested' database. So, if your existing 'config/databases' file looks like this:
```
    # Known Whois++ servers
    #
    # Format is:
    #
    #   friendly name:hostname:port number:value of Destination:server handle
    #
    # The destination attribute may be left blank
    #
    Science Resources:roads.ukoln.ac.uk:6664::roadsbathacuk01
```
then you need to add two lines at the end, e.g.:
```
    Harvested Science Resources:roads.ukoln.ac.uk:6664:Harvested:ukoln01
    Robot generated Science Resources:roads.ukoln.ac.uk:6664:BulkHarvested:ukoln01
```
It is important to remember that there is no way of explicitly searching for records with an empty 'Destination' attribute (though you can search for 'ALL' records) so, having enabled the use of 'Destination' in some records, normally you will want to add a 'Destination' to every record. So, as well as adding the two new lines to 'config/databases' you also need to add a 'destination' to the existing line.
All in all the new 'config/databases' file should look something like this:
```
    # Known Whois++ servers
    #
    # Format is:
    #
    #   friendly name:hostname:port number:value of Destination:server handle
    #
    # The destination attribute may be left blank
    #
    Science Resources:roads.ukoln.ac.uk:6664:Science:roadsbathacuk01
    Harvested Science Resources:roads.ukoln.ac.uk:6664:Harvested:ukoln01
    Robot generated Science Resources:roads.ukoln.ac.uk:6664:BulkHarvested:ukoln01
```
Remember that whenever you create a new resource description from now on you need to fill in the 'Destination' attribute. The simplest way to do this is to edit the 'config/outlines' files and set a default value for 'Destination' in each file.
Finally you need to make sure that you can search on the 'Destination' attribute. Make sure that 'Destination' is listed in 'config/search-restrict'.
Convert any existing records in your ROADS database as necessary
If you have any existing records in your ROADS database then you need to add a 'Destination' to each one. You can either do this by hand using a text editor, or use the ROADS Web based update form or use the 'bin/newdest.pl' program. For example, typing:
```
    bin/newdest.pl -n Science
```
will look through all your records and put 'Science' into any empty 'Destination' attributes.
Configure a Harvest gatherer for ROADS
The "Harvest User's Manual" describes how to set up a new gatherer. Follow these instructions to create a new gatherer for ROADS - the rest of these instructions assume that you've called your gatherer 'ROADS'. Make sure that there is a 'lib' sub-directory in the 'ROADS' gatherer directory. If not go to the gatherer directory and type:
```
    mkdir lib
```
Install the ROADSHarvester package
The ROADSHarvester software distribution is packaged up as a compressed UNIX tar file and is available from <URL:ftp://ftp.ukoln.ac.uk/roads/roadsharvester-v1a1.tar.Z>. To extract the installation from this file use:
```
    zcat roadsharvester-v1a1.tar.Z | tar xvf -
```
Make sure that you use binary mode if you transfer the compressed tar file across the network using FTP otherwise it will be damaged in transit and will not be usable.
Once the tar file has been expanded you should find that you have a directory called 'roadsharvester-v1a1'.
To install the software, enter the 'roadsharvester-v1a1' directory and type:
```
    ./Configure
```
If this doesn't work, try
```
    perl ./Configure
```
You may need to give the full path to Perl, e.g.
```
   /usr/local/bin/perl ./Configure
```
This runs a configuration script which will find out about your machine, ask you some simple questions about where ROADS and Harvest are installed and then proceed to install the software. The ROADSHarvester software does NOT need to run as the "root" superuser on your machine but it is important that the user the Harvest gatherer runs as (e.g. harvest) is able to read the files which the ROADS software generates and create records in the ROADS 'source' directory. It is probably simplest if the Harvest gatherer is run using the same username as that used by the ROADS software.

Configure the new Harvest gatherer
Having installed the ROADSHarvester package you can now go back and finish off setting up the Harvest gatherer. Go to the Harvest gatherer directory and modify the configuration file, e.g. ROADS.cf. Add the following directive to the top section:

Post-Summarizing: lib/roadsrules

Replace the current <RootNodes> list with:

    |lib/roadslist.pl Depth=2 URL=250 Delay=2 Host=1

where the 'Depth', 'URL', 'Delay' and 'Host' values should be set according to your own requirements as detailed in the "Harvest User's Manual".

A completed configuration file might look like this:

    #
    #  ROADS.cf - configuration file for the
    #  ROADS Gatherer
    #
    #  Created by webmaster@ukoln.ac.uk on Sat Mar 16 21:16:13 GMT 1996
    #
    Gatherer-Name:  ROADS
    Gatherer-Port:  9111
    Top-Directory:  /usr/local/harvest/gatherers/ROADS
    Post-Summarizing: lib/roadsrules
    
    <RootNodes>
    |lib/roadslist.pl Depth=2 URL=250 Delay=2 Host=1
    
    </RootNodes>
    
    <LeafNodes>
    # Enter URLs for LeafNodes here
    
    </LeafNodes>

Set up some 'cron' jobs
You're almost finished! The final task is to set up a couple of 'cron' jobs to remove 'old' ROADSHarvester generated records from the ROADS databases. You don't have to set the 'cron' jobs up now. It might be best to have a play with the ROADSHarvester first and check that everything else is working. If you do this, don't forget to come back later and set up your 'cron' jobs.
I suggest that you run both jobs on a nightly basis. In both cases you can use the 'bin/removeold.pl' program to do this.
The first 'cron' job removes old records from the 'Harvested' database. In this case 'old' means more than one day old because records in the 'Harvested' database are normally moved immediately into your main database. Here's an example 'cron' entry:
```
     0 5 * * * /usr/local/roads/bin/removeold.pl -d Harvested 1
```
The second 'cron' job removes old records from the 'BulkHarvested' database. In this case you need to match the length of time that records stay in the ROADS database with the length of time records stay in the Harvest gatherer database. This is specified by the gatherer's 'Time-To-Live' configuration variable and is typically this is set to 1 month. Set the time to live for records in the ROADS 'BulkHarvested' database to slightly longer than this value, for example:
```
    30 5 * * * /usr/local/roads/bin/removeold.pl -d BulkHarvested 32
```
which will remove records that are over 32 days old.

That's it, installation of the ROADSHarvester package is now complete!

Using the ROADS Harvester

{ This section needs a lot more work - and will ultimately expand into some sort of user's guide. For the time being here's a very brief overview of how to use what you've just installed... }

Individual harvesting

The ROADSHarvester offers a new CGI script in the 'admin-cgi' directory called 'harvesttemp.pl'. Access this script using your Web browser.

This user interface is pretty much naff at the moment but is primarily there to demonstrate functionality. It will be improved in future releases!

Select the type of template you want to create - only 'DOCUMENT' and 'SERVICE' are available currently. Type the URL of the resource you want to catalogue into the text box. Click on the 'Submit selection' button. After a brief pause while the resource is harvested you will be given a second form. Click the 'Update harvested ROADS record now' and you'll be put into a normal 'mktemp.pl' update form. At the very least you must change the 'Destination' attribute from 'Harvested' to the name of your main database.

Finally submit the record to the ROADS database as normal.

Bulk-Harvesting

You've more or less done it all by following the installation procedure above. You can run the new ROADS Harvest gatherer by hand using the RunGatherer script in the gatherer directory. When you're happy, set up a cron job to run the gatherer on a regular basis, say weekly, as described in the "Harvest User's Manual".

The starting URLs for the gatherer are taken from all the URI-v?'s listed in all the records in your main database.

Currently the ROADS database is not re-indexed automatically when the gatherer finishes so you'll have to remember to run 'bin/mkinv.pl', e.g.

    bin/mkinv.pl -a

if you run the gatherer by hand. If you set up a cron job to start the gatherer then add the above line to it as well.

List of files installed with the ROADSHarvester

{ Again, this section needs expansion }

ROADS files:

    bin/listURIs.pl
    bin/harvest.pl
    bin/soif2roads.pl
    bin/newdest.pl
    bin/removeold.pl
    admin-cgi/
    admin-cgi/harvesttemp.pl
    config/multilingual/
    config/multilingual/UK-English/
    config/multilingual/UK-English/harvesttemp/
    config/multilingual/UK-English/harvesttemp/addedok.html
    config/multilingual/UK-English/harvesttemp/cannotcreatereal.html
    config/multilingual/UK-English/harvesttemp/cannotcreatetmp.html
    config/multilingual/UK-English/harvesttemp/cannotindex.html
    config/multilingual/UK-English/harvesttemp/fail.html
    config/multilingual/UK-English/harvesttemp/handleexists.html
    config/multilingual/UK-English/harvesttemp/introform.html

Harvest files:

    harvest/examples/
    harvest/examples/ROADS.cf
    harvest/lib-gatherer/
    harvest/lib-gatherer/rebuild.pl
    harvest/lib-gatherer/roadspostproc.pl
    harvest/lib-gatherer/roadsrules
    harvest/lib-gatherer/roadslist.pl

Problems?

Problems parsing HTML

Harvest bases it's parsing of HTML Web pages on the SGML DTD for HTML 2.0. In many cases Web pages do not conform to this (or any other!) DTD. This may well result in ROADS records with less information in them then you would otherwise expect.

Some pages, like the UKOLN home page for example, don't seem to parse well at all :-(. This may result in some gibberish in the ROADS records that are created. (Set the 'Language-v1' attribute to 'gibberish' :-) ). Typically you may see bits of JavaScript code or odd HTML tags finding their way into the 'Description'.

The only real way round this one is for someone to re-write the current Harvest HTML parser.

Destinations with spaces in them

There may be problems if you try to use a Destination attribute with spaces in it like 'Bulk Harvested'. Best avoided for the time being!

Long keywords lists

Depending on the content of the Web pages it is summarising, Harvest can sometimes generate very long keyword lists. I'm not currently sure how best to handle this. Reviewing your ROADS stoplist in 'config/stoplist' may help a little.

Andy Powell - a.powell@ukoln.ac.uk