This file is also available at <URL:http://www.ukoln.ac.uk/roads/harvester/v1a1/README.html>
In the current release all this functionality relies on the Harvest suite of tools. These must be installed on the system that is running ROADS prior to installing this package.
Please note...
This documentation is not yet fully complete!...but it should be enough for you to get things working. And finally...
THIS IS ALPHA SOFTWARE - Use with care!
Here's a quick overview of what you're going to have to do.
I
A full description of setting up multiple ROADS databases can be found in the ROADS FAQ at <URL:http://www.ukoln.ac.uk/roads/faq/>. The rest of these instructions assume that you are not already using multiple databases and that the single database listed in your 'config/databases' file is your main database.
Briefly, you need to add two new lines to your 'config/databases' file, one for the new 'Harvested' database and one for the new 'BulkHarvested' database. So, if your existing 'config/databases' file looks like this:
# Known Whois++ servers
#
# Format is:
#
# friendly name:hostname:port number:value of Destination:server handle
#
# The destination attribute may be left blank
#
Science Resources:roads.ukoln.ac.uk:6664::roadsbathacuk01
then you need to add two lines at the end, e.g.:
Harvested Science Resources:roads.ukoln.ac.uk:6664:Harvested:ukoln01
Robot generated Science Resources:roads.ukoln.ac.uk:6664:BulkHarvested:ukoln01
It is important to remember that there is no way of explicitly
searching for records with an empty 'Destination'
attribute (though you can search for 'ALL' records) so, having enabled the
use of 'Destination' in some records,
normally you will want to add a 'Destination' to every record.
So, as well as adding the two new lines to 'config/databases'
you also need to add a 'destination'
to the existing line.
All in all the new 'config/databases' file should look something like this:
# Known Whois++ servers
#
# Format is:
#
# friendly name:hostname:port number:value of Destination:server handle
#
# The destination attribute may be left blank
#
Science Resources:roads.ukoln.ac.uk:6664:Science:roadsbathacuk01
Harvested Science Resources:roads.ukoln.ac.uk:6664:Harvested:ukoln01
Robot generated Science Resources:roads.ukoln.ac.uk:6664:BulkHarvested:ukoln01
Remember that whenever you create a new resource description from
now on you need to fill in the 'Destination' attribute.
The simplest way to do this is to edit the 'config/outlines' files and set
a default value for 'Destination' in each file.
Finally you need to make sure that you can search on the 'Destination' attribute. Make sure that 'Destination' is listed in 'config/search-restrict'.
bin/newdest.pl -n Science
will look through all your records and put 'Science'
into any empty 'Destination'
attributes.
mkdir lib
zcat roadsharvester-v1a1.tar.Z | tar xvf -
Make sure that you use binary mode if you transfer the compressed tar
file across the network using FTP otherwise it will be damaged in
transit and will not be usable.
Once the tar file has been expanded you should find that you have a directory called 'roadsharvester-v1a1'.
To install the software, enter the 'roadsharvester-v1a1' directory and type:
./Configure
If this doesn't work, try
perl ./Configure
You may need to give the full path to Perl, e.g.
/usr/local/bin/perl ./ConfigureThis runs a configuration script which will find out about your machine, ask you some simple questions about where ROADS and Harvest are installed and then proceed to install the software. The ROADSHarvester software does NOT need to run as the "root" superuser on your machine but it is important that the user the Harvest gatherer runs as (e.g. harvest) is able to read the files which the ROADS software generates and create records in the ROADS 'source' directory. It is probably simplest if the Harvest gatherer is run using the same username as that used by the ROADS software.
Post-Summarizing: lib/roadsrules
Replace the current <RootNodes> list with:
|lib/roadslist.pl Depth=2 URL=250 Delay=2 Host=1
where the 'Depth', 'URL', 'Delay' and 'Host' values should be set according to
your own requirements as detailed in the "Harvest User's Manual".
A completed configuration file might look like this:
#
# ROADS.cf - configuration file for the
# ROADS Gatherer
#
# Created by webmaster@ukoln.ac.uk on Sat Mar 16 21:16:13 GMT 1996
#
Gatherer-Name: ROADS
Gatherer-Port: 9111
Top-Directory: /usr/local/harvest/gatherers/ROADS
Post-Summarizing: lib/roadsrules
<RootNodes>
|lib/roadslist.pl Depth=2 URL=250 Delay=2 Host=1
</RootNodes>
<LeafNodes>
# Enter URLs for LeafNodes here
</LeafNodes>
I suggest that you run both jobs on a nightly basis. In both cases you can use the 'bin/removeold.pl' program to do this.
The first 'cron' job removes old records from the 'Harvested' database. In this case 'old' means more than one day old because records in the 'Harvested' database are normally moved immediately into your main database. Here's an example 'cron' entry:
0 5 * * * /usr/local/roads/bin/removeold.pl -d Harvested 1
The second 'cron' job removes old records from the 'BulkHarvested' database.
In this case you need to match the length of time that records stay
in the ROADS database with the length of time records stay in the Harvest
gatherer database. This is specified by the gatherer's 'Time-To-Live'
configuration variable and is typically this is set to 1 month.
Set the time to live for records in the ROADS 'BulkHarvested' database
to slightly longer than this value, for example:
30 5 * * * /usr/local/roads/bin/removeold.pl -d BulkHarvested 32
which will remove records that are over 32 days old.
This user interface is pretty much naff at the moment but is primarily there to demonstrate functionality. It will be improved in future releases!
Select the type of template you want to create - only 'DOCUMENT' and 'SERVICE' are available currently. Type the URL of the resource you want to catalogue into the text box. Click on the 'Submit selection' button. After a brief pause while the resource is harvested you will be given a second form. Click the 'Update harvested ROADS record now' and you'll be put into a normal 'mktemp.pl' update form. At the very least you must change the 'Destination' attribute from 'Harvested' to the name of your main database.
Finally submit the record to the ROADS database as normal.
The starting URLs for the gatherer are taken from all the URI-v?'s listed in all the records in your main database.
Currently the ROADS database is not re-indexed automatically when the gatherer finishes so you'll have to remember to run 'bin/mkinv.pl', e.g.
bin/mkinv.pl -a
if you run the gatherer by hand.
If you set up a cron job to start the gatherer then add the above line to
it as well.
ROADS files:
bin/listURIs.pl
bin/harvest.pl
bin/soif2roads.pl
bin/newdest.pl
bin/removeold.pl
admin-cgi/
admin-cgi/harvesttemp.pl
config/multilingual/
config/multilingual/UK-English/
config/multilingual/UK-English/harvesttemp/
config/multilingual/UK-English/harvesttemp/addedok.html
config/multilingual/UK-English/harvesttemp/cannotcreatereal.html
config/multilingual/UK-English/harvesttemp/cannotcreatetmp.html
config/multilingual/UK-English/harvesttemp/cannotindex.html
config/multilingual/UK-English/harvesttemp/fail.html
config/multilingual/UK-English/harvesttemp/handleexists.html
config/multilingual/UK-English/harvesttemp/introform.html
Harvest files:
harvest/examples/
harvest/examples/ROADS.cf
harvest/lib-gatherer/
harvest/lib-gatherer/rebuild.pl
harvest/lib-gatherer/roadspostproc.pl
harvest/lib-gatherer/roadsrules
harvest/lib-gatherer/roadslist.pl
Some pages, like the UKOLN home page for example, don't seem to parse well at all :-(. This may result in some gibberish in the ROADS records that are created. (Set the 'Language-v1' attribute to 'gibberish' :-) ). Typically you may see bits of JavaScript code or odd HTML tags finding their way into the 'Description'.
The only real way round this one is for someone to re-write the current Harvest HTML parser.