UKOLN A bug in WebWatch robot software



WebWatch uses a robot to gather the web pages it needs. This was initially based on the Harvest gatherer. It was then re-written in Perl. Further ideas and bugs in this version lead us to the idea of developing a new robot in Object-Oriented Perl, with classes to offer general robot functions.

Later versions of the robot software (complete re-writes of the initial robot) have exhibited problems with memory consumption.

In order to track down this problem the program was sucessively cut down and simplified to that shown below. This program will eventually halt on our (UNIX) system with "Out of memory!".

Using top(1) to monitor its memory consumption shows that its memory usage rises from 4000-odd kb to over 150mb in under 10 minutes.

When running the program under the perl debugger (perl -d) it stops with messages along the lines of '100 levels deep in subroutine' whilst in various LWP:: modules. Initially with URI::URL and when that was removed with HTTP::Head (used by HTTP::Request and HTTP::Response).

Is this a LWP problem???

The following is an iterative agent that follows every link it sees, unless it has been visited before. This will stop with the problem detailed above and was tested with 2 different versions of LWP and 2 different versions of Perl. The loop that iteratively follows links is highlighted.

#!/usr/bin/perl

use LWP::UserAgent;
use Fcntl;
use NDBM_File;
use HTTP::Request;
use URI::URL;

$url='http://www.ukoln.ac.uk/';
print "TRAVERSING $url \n\n";
sleep 2;
$ua=LWP::UserAgent->new();
$ua->agent('WebWatchTest');
$ua->proxy('http://scoop.ukoln.ac.uk:3128');
$request = HTTP::Request->new('GET', $url);
$response = $ua->simple_request($request);
$content=$response->content();
undef $request;
if ($response->is_success()) {
 print "Got the node $url with STATUS = = SUCCESS \n";
}
else {
 print "$url STATUS = = PROBLEM! \n";
}
$VISITED{$url}=1;
undef $response;
$external=&pluck($content, $url);
undef $content;
undef $node;

@links=@{$external};
$stop_flag=0;

$dbm=tie (%VISITED, "NDBM_File", "/tmp/visitedurls", O_RDWR|O_CREAT, 0644) || die "Oops\n";
TRAVERSE:
 while ($stop_flag==0) {
INNER:
   for $url (@links) {
	if ($VISITED{$url}==1) {  #if the URL has been visited skip it
		print "$url STATUS == VISITED! \n";
		next INNER;
	}
	$request=HTTP::Request->new('GET', $url);
	$response=$ua->simple_request($request);
	$content=$response->content();
	undef $request;
	if ($response->is_success()) { 
		print "Got $url with STATUS == SUCCESS \n";
	}
	else {
		print "$url STATUS == PROBLEM! \n";
	}
	undef $response;
	$VISITED{$url}=1;
	$external=&pluck($content, $url); #get URLs from document
	undef $content;
	push @temp, @{$external}; # $external is a reference to a list of
				  # links to follow 
	undef $external;	
   }

   print "Done one Loop \n\n";
   undef @links;
   if (defined (@temp)) { #if the last pages produced more URL references
	$stop_flag=0;
	@links=@temp;	  #then put them in @links
	undef @temp;
	next TRAVERSE;    #and let it go again...
   }
   else {
	$stop_flag=1;
   }
 }

exit(0);
  
#this subroutine uses a program called 'HTMLurls' to find links to follow
#I dont think this is the cause of the problem - in the proper robot a 
#module dealing with URLs is used for this.

sub pluck {
 my ($content, $url) = @_;
 $|=1;
 open INF, "> /tmp/webwatch$$.html";
 print INF $content;
 close INF;

 open DINF, "./HTMLurls /tmp/webwatch$$.html |";
 for (<DINF>) {
	if (/^http/) {
		push @nurls, $_;
	}
	else {
		my $myurl=URI::URL->new($_, $url);
		my $fullurl=$myurl->abs($bhref)->as_string();
 		my $frag=$myurl->frag();
	        $fullurl =~ s/\#$frag//g if defined($frag);
		push @nurls, $fullurl;
	}	
 }

 close DINF; 
 return \@nurls;
}