NAME

dcbot.pl - Robot to harvest Dublin Core metadata


SYNOPSIS

  dcbot.pl [-d] [-f format] [-g] [-l ldat-path] [-p] [-u URL]
           [-w WHOIS-command]


DESCRIPTION

dcbot.pl - is a Perl program to generate Dublin Core elements in various formats from the metadata embedded in a nominated HTML page.

It was derived from the original UKOLN DC-dot program, which is intended to run as a CGI script under harness to a Web server. By removing the Web functionality the resulting code can be used trivially in a batch scripting environment, e.g. to harvest large volumes of Dublin Core metadata for later indexing.

Note that the Web functionality of DC-dot also includes a Dublin Core editor component, which for obvious reasons is not present in this program!


OPTIONS

-d
Indicates whether debugging should be enabled - it's switched off by default.

-f format
Indicates the output format which should be used for the results generated by dcbot.pl. The available choices are listed below.

-g
Attempt to ``guess'' the Publisher of the resource using WHOIS to look up their domain contact details.

-l path-to-ldat
The path to the b<ldat> program, used to extract metadata from Microsoft Office documents. This defaults to ``ldat''.

-p
Whether to attempt to use WHOIS to look up the domain registration information for the site in question, if it is not provided in the embedded Dublin Core META tags.

-u URL
The URL of the HTML page which is to be fetched and parsed for embedded Dublin Core metadata

-w WHOIS-command
The path to the whois command, used if the -g option is given. This defaults to ``whois -h whois.arin.net''.


OUTPUT FORMATS

UNIMARC
UNIMARC format - not fully operational yet, so use with caution!

USMARC
US MARC

SOIF
The Harvest system's Summary Object Interchange Format

TEI
Using the Text Encoding Initiative's TEI SGML DTD

IAFA
IAFA templates suitable for loading into a ROADS server

GILS
GILS GRS-1 records suitable for loading into a Z39.50 server

XML
Using the Dublin Core basic XML schema

OLSTF
OLIB OLSTF format for importing into the NewsAgent system

RDF
Using the Dublin Core basic RDF schema

AbbreviatedRDF
Abbreviated RDF

BIBLINK
Write a plain text formatted version of the record suitable for importing into the BIBLINK system.

IMS
Using the IMS project's XML DTD

TEXT
Synonym for BIBLINK

SPT
Use the Scout Portal Toolkit bulk import/export format


EXAMPLE

  $ dcbot.pl -f SOIF -u http://www.lboro.ac.uk/ 
  @FILE { http://www.lboro.ac.uk/
  Description{75}:        Loughborough University offers degree
    programmes and world class research. 
  Last-Modification-Time{10}:     1032908400
  Time-to-Live{7}:        2419200
  Refresh-Rate{6}:        604800
  Gatherer-Name{6}:       DC-dot
  Type{23}:       text/html || 8415 bytes
  File-Size{4}:   8415
  MD5{32}:        aef5f97ecfbd0a2018d105761e09cc52
  Keywords{106}:  University; England; postgraduate; undergraduate;
    degree programme; course; research; teaching; prospectus
  Title{23}:      Loughborough University
  Type{4}:        Text
  }


BUGS

The dcbot.pl program has an external dependency on the Harvest system's Summary Object Interchange Format (SOIF) parser code.

It is assumed that Harvest has been installed in /usr/local/harvest. If this is not the case, the environmental variable HARVEST_HOME should be set to the Harvest top level directory, or the dcbot.pl code edited to reflect its location.


COPYRIGHT

The original DC-dot program is Copyright (C) 1997 UKOLN, University of Bath, UK, and was written by UKOLN's Andy Powell. This derivative was created by Martin Hamilton in 2002.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA

This work was partially supported by grants from the UK Electronic Libraries Programme (eLib), the European Commission Telematics for Research Programme, and the joint JISC/NSF Digital Libraries Programme (IMesh Toolkit project).

UKOLN is funded by Resource: The Council for Museums, Archives & Libraries (the organisation succeeding the Library and Information Commission), the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath where it is based.