UKOLN RDFHarvester: a PRIDE tool

Background: The RDFHarvester is one small part of the PRIDE Directory Toolkit. It is an automated agent (or a robot) that is designed to work in the background (set up as a 'cron job' on Unix, for instance). These html pages and the scripts that they link to are not required to run the agent: they exist to provide a visual aspect to the working of the RDFHarvester, however, if they are of interest, the scripts are available.

Contextual overview: a look at the context for a xml:rdf harvest

The RDFHarvester uses the PRIDE directory to read its configuration information. From the 'config' directory entry, it learns the what to do: it reads information about location in the PRIDE directory where it will find and in turn read a list of WWW locations where data to be harvested exists and it reads where (in the PRIDE directory tree) to write a given entry. In other words, it makes a basic configuration read followed by a read of things to do. Once it 'knows what to do' (i.e. where, on the web to find directory content information in RDF fromat) it systematically visits each WWW location, gathers the specified data, and converts it to LDIF format for the PRIDE directory.
View the RDFHarvester "config" entry in the directory
View the set of source entries in the directory (the list of entries locations to be harvested)
View the contents of an existing Source entry
View a sample xml:rdf instance mapped to (PRIDE directory) LDIF format


A Sample Run: using the above configuration values

You are not required to do anything (except press the button), but,
to keep in mind the above configuration settings the critical values are repeated below.

view a subtree to delete specific items

View the contents of all entries currently in the HarvestedData location of the directory
This button NOT YET will delete the entry/entries just added (to put it as it was so the sample can be re-run)


Configure the RDFHarvester: change to new configuration

In changing the values in the RDFHarvester config and sources entries the robot will perform as the new values dictate. In the event that errors are made in the config/source entries, the original values can easily be reloaded (feel free to experiment).
This is the clean-up button. It re-sets the default PrideRDFHarvesterConfig and PrideRDFSourceConfig entries to the values shown in the above diagram and instructions.

It also deletes all extra config entries and all extra source entries as well as everything under the 'cn=harvested-data' entry!

View and edit the RDFHarvester "config" entry in the directory: you will be presented with a form containing a second set of default values (referring to different directory source entries that, in turn, should indicate other valid xml:rdf descriptions). **remember**, your config entry has two critical values: the directory location of a valid 'source' entry, and the 'target' location of your newly harvested data.

View an example of a Source configuration entry. This entry, among other things, holds information relating to the WWW address of xml:rdf resource descriptions ready to be harvested into PRIDE. By adding the relavent information here, a resource description owner is, in effect, making the described resources available to PRIDE.

If you have used a name other than 'config' for your new 'config' entry,
enter the new 'config' (this will be passed to the RDFHarvester). Click to runun the RDFHarvester with the new settings
View the contents of all entries currently in the HarvestedData location of the directory

view a subtree