ROADSHarvester version 1 alpha 1

README

Version: $Id: README.html,v 1.20 1997/02/19 15:21:25 lisap Exp $

This file is also available at <URL:http://www.ukoln.ac.uk/roads/harvester/v1a1/README.html>

What is this

This is the first alpha release of the ROADSHarvester package. This package provides a combine-harvester for the ROADS software - adding the following functionality:

In the current release all this functionality relies on the Harvest suite of tools. These must be installed on the system that is running ROADS prior to installing this package.

Please note...

This documentation is not yet fully complete!
...but it should be enough for you to get things working. And finally...
THIS IS ALPHA SOFTWARE - Use with care!

Dependencies

Installation

I'm going to assume that you've already installed ROADS and Harvest. If you're really reading this without having already installed the ROADS software then I suggest that you start at the very beginning (it's a very good place to start) and read the ROADS home page at <URL:http://www.ukoln.ac.uk/roads/>. Full details of how to install Harvest are given in the "Harvest User's Manual" that is available from the Harvest home page at <URL:http://harvest.cs.colorado.edu/>.

Here's a quick overview of what you're going to have to do. I strongly suggest that you read all the way thru this document before actually doing anything!

  1. Configure ROADS databases for your ROADSHarvester generated resource descriptions
  2. Convert any existing records in your ROADS database as necessary
  3. Set up a Harvest gatherer for ROADS
  4. Install the ROADSHarvester package
  5. Configure the new Harvest gatherer
  6. Set up some 'cron' jobs
OK, here we are in more detail...
  1. Configure databases for your ROADSHarvester generated resource descriptions
    The ROADSHarvester relies on multiple 'logical' ROADS databases in order to work:

    A full description of setting up multiple ROADS databases can be found in the ROADS FAQ at <URL:http://www.ukoln.ac.uk/roads/faq/>. The rest of these instructions assume that you are not already using multiple databases and that the single database listed in your 'config/databases' file is your main database.

    Briefly, you need to add two new lines to your 'config/databases' file, one for the new 'Harvested' database and one for the new 'BulkHarvested' database. So, if your existing 'config/databases' file looks like this:

        # Known Whois++ servers
        #
        # Format is:
        #
        #   friendly name:hostname:port number:value of Destination:server handle
        #
        # The destination attribute may be left blank
        #
        Science Resources:roads.ukoln.ac.uk:6664::roadsbathacuk01
    

    then you need to add two lines at the end, e.g.:

        Harvested Science Resources:roads.ukoln.ac.uk:6664:Harvested:ukoln01
        Robot generated Science Resources:roads.ukoln.ac.uk:6664:BulkHarvested:ukoln01
    

    It is important to remember that there is no way of explicitly searching for records with an empty 'Destination' attribute (though you can search for 'ALL' records) so, having enabled the use of 'Destination' in some records, normally you will want to add a 'Destination' to every record. So, as well as adding the two new lines to 'config/databases' you also need to add a 'destination' to the existing line.

    All in all the new 'config/databases' file should look something like this:

        # Known Whois++ servers
        #
        # Format is:
        #
        #   friendly name:hostname:port number:value of Destination:server handle
        #
        # The destination attribute may be left blank
        #
        Science Resources:roads.ukoln.ac.uk:6664:Science:roadsbathacuk01
        Harvested Science Resources:roads.ukoln.ac.uk:6664:Harvested:ukoln01
        Robot generated Science Resources:roads.ukoln.ac.uk:6664:BulkHarvested:ukoln01
    

    Remember that whenever you create a new resource description from now on you need to fill in the 'Destination' attribute. The simplest way to do this is to edit the 'config/outlines' files and set a default value for 'Destination' in each file.

    Finally you need to make sure that you can search on the 'Destination' attribute. Make sure that 'Destination' is listed in 'config/search-restrict'.

  2. Convert any existing records in your ROADS database as necessary
    If you have any existing records in your ROADS database then you need to add a 'Destination' to each one. You can either do this by hand using a text editor, or use the ROADS Web based update form or use the 'bin/newdest.pl' program. For example, typing:

        bin/newdest.pl -n Science
    

    will look through all your records and put 'Science' into any empty 'Destination' attributes.

  3. Configure a Harvest gatherer for ROADS
    The "Harvest User's Manual" describes how to set up a new gatherer. Follow these instructions to create a new gatherer for ROADS - the rest of these instructions assume that you've called your gatherer 'ROADS'. Make sure that there is a 'lib' sub-directory in the 'ROADS' gatherer directory. If not go to the gatherer directory and type:
        mkdir lib
    

  4. Install the ROADSHarvester package
    The ROADSHarvester software distribution is packaged up as a compressed UNIX tar file and is available from <URL:ftp://ftp.ukoln.ac.uk/roads/roadsharvester-v1a1.tar.Z>. To extract the installation from this file use:

        zcat roadsharvester-v1a1.tar.Z | tar xvf -
    

    Make sure that you use binary mode if you transfer the compressed tar file across the network using FTP otherwise it will be damaged in transit and will not be usable.

    Once the tar file has been expanded you should find that you have a directory called 'roadsharvester-v1a1'.

    To install the software, enter the 'roadsharvester-v1a1' directory and type:

        ./Configure
    

    If this doesn't work, try

        perl ./Configure
    

    You may need to give the full path to Perl, e.g.

       /usr/local/bin/perl ./Configure
    

    This runs a configuration script which will find out about your machine, ask you some simple questions about where ROADS and Harvest are installed and then proceed to install the software. The ROADSHarvester software does NOT need to run as the "root" superuser on your machine but it is important that the user the Harvest gatherer runs as (e.g. harvest) is able to read the files which the ROADS software generates and create records in the ROADS 'source' directory. It is probably simplest if the Harvest gatherer is run using the same username as that used by the ROADS software.

  5. Configure the new Harvest gatherer
    Having installed the ROADSHarvester package you can now go back and finish off setting up the Harvest gatherer. Go to the Harvest gatherer directory and modify the configuration file, e.g. ROADS.cf. Add the following directive to the top section:

        Post-Summarizing: lib/roadsrules
    

    Replace the current <RootNodes> list with:

        |lib/roadslist.pl Depth=2 URL=250 Delay=2 Host=1
    

    where the 'Depth', 'URL', 'Delay' and 'Host' values should be set according to your own requirements as detailed in the "Harvest User's Manual".

    A completed configuration file might look like this:

        #
        #  ROADS.cf - configuration file for the
        #  ROADS Gatherer
        #
        #  Created by webmaster@ukoln.ac.uk on Sat Mar 16 21:16:13 GMT 1996
        #
        Gatherer-Name:  ROADS
        Gatherer-Port:  9111
        Top-Directory:  /usr/local/harvest/gatherers/ROADS
        Post-Summarizing: lib/roadsrules
        
        <RootNodes>
        |lib/roadslist.pl Depth=2 URL=250 Delay=2 Host=1
        
        </RootNodes>
        
        <LeafNodes>
        # Enter URLs for LeafNodes here
        
        </LeafNodes>
    

  6. Set up some 'cron' jobs
    You're almost finished! The final task is to set up a couple of 'cron' jobs to remove 'old' ROADSHarvester generated records from the ROADS databases. You don't have to set the 'cron' jobs up now. It might be best to have a play with the ROADSHarvester first and check that everything else is working. If you do this, don't forget to come back later and set up your 'cron' jobs.

    I suggest that you run both jobs on a nightly basis. In both cases you can use the 'bin/removeold.pl' program to do this.

    The first 'cron' job removes old records from the 'Harvested' database. In this case 'old' means more than one day old because records in the 'Harvested' database are normally moved immediately into your main database. Here's an example 'cron' entry:

         0 5 * * * /usr/local/roads/bin/removeold.pl -d Harvested 1
    

    The second 'cron' job removes old records from the 'BulkHarvested' database. In this case you need to match the length of time that records stay in the ROADS database with the length of time records stay in the Harvest gatherer database. This is specified by the gatherer's 'Time-To-Live' configuration variable and is typically this is set to 1 month. Set the time to live for records in the ROADS 'BulkHarvested' database to slightly longer than this value, for example:

        30 5 * * * /usr/local/roads/bin/removeold.pl -d BulkHarvested 32
    

    which will remove records that are over 32 days old.
That's it, installation of the ROADSHarvester package is now complete!

Using the ROADS Harvester

{ This section needs a lot more work - and will ultimately expand into some sort of user's guide. For the time being here's a very brief overview of how to use what you've just installed... }

Individual harvesting

The ROADSHarvester offers a new CGI script in the 'admin-cgi' directory called 'harvesttemp.pl'. Access this script using your Web browser.

This user interface is pretty much naff at the moment but is primarily there to demonstrate functionality. It will be improved in future releases!

Select the type of template you want to create - only 'DOCUMENT' and 'SERVICE' are available currently. Type the URL of the resource you want to catalogue into the text box. Click on the 'Submit selection' button. After a brief pause while the resource is harvested you will be given a second form. Click the 'Update harvested ROADS record now' and you'll be put into a normal 'mktemp.pl' update form. At the very least you must change the 'Destination' attribute from 'Harvested' to the name of your main database.

Finally submit the record to the ROADS database as normal.

Bulk-Harvesting

You've more or less done it all by following the installation procedure above. You can run the new ROADS Harvest gatherer by hand using the RunGatherer script in the gatherer directory. When you're happy, set up a cron job to run the gatherer on a regular basis, say weekly, as described in the "Harvest User's Manual".

The starting URLs for the gatherer are taken from all the URI-v?'s listed in all the records in your main database.

Currently the ROADS database is not re-indexed automatically when the gatherer finishes so you'll have to remember to run 'bin/mkinv.pl', e.g.

    bin/mkinv.pl -a
if you run the gatherer by hand. If you set up a cron job to start the gatherer then add the above line to it as well.

List of files installed with the ROADSHarvester

{ Again, this section needs expansion }

ROADS files:

    bin/listURIs.pl
    bin/harvest.pl
    bin/soif2roads.pl
    bin/newdest.pl
    bin/removeold.pl
    admin-cgi/
    admin-cgi/harvesttemp.pl
    config/multilingual/
    config/multilingual/UK-English/
    config/multilingual/UK-English/harvesttemp/
    config/multilingual/UK-English/harvesttemp/addedok.html
    config/multilingual/UK-English/harvesttemp/cannotcreatereal.html
    config/multilingual/UK-English/harvesttemp/cannotcreatetmp.html
    config/multilingual/UK-English/harvesttemp/cannotindex.html
    config/multilingual/UK-English/harvesttemp/fail.html
    config/multilingual/UK-English/harvesttemp/handleexists.html
    config/multilingual/UK-English/harvesttemp/introform.html

Harvest files:

    harvest/examples/
    harvest/examples/ROADS.cf
    harvest/lib-gatherer/
    harvest/lib-gatherer/rebuild.pl
    harvest/lib-gatherer/roadspostproc.pl
    harvest/lib-gatherer/roadsrules
    harvest/lib-gatherer/roadslist.pl

Problems?

Problems parsing HTML

Harvest bases it's parsing of HTML Web pages on the SGML DTD for HTML 2.0. In many cases Web pages do not conform to this (or any other!) DTD. This may well result in ROADS records with less information in them then you would otherwise expect.

Some pages, like the UKOLN home page for example, don't seem to parse well at all :-(. This may result in some gibberish in the ROADS records that are created. (Set the 'Language-v1' attribute to 'gibberish' :-) ). Typically you may see bits of JavaScript code or odd HTML tags finding their way into the 'Description'.

The only real way round this one is for someone to re-write the current Harvest HTML parser.

Destinations with spaces in them

There may be problems if you try to use a Destination attribute with spaces in it like 'Bulk Harvested'. Best avoided for the time being!

Long keywords lists

Depending on the content of the Web pages it is summarising, Harvest can sometimes generate very long keyword lists. I'm not currently sure how best to handle this. Reviewing your ROADS stoplist in 'config/stoplist' may help a little.


Andy Powell - a.powell@ukoln.ac.uk