NAME
WWW::SimpleRobot - a simple web robot for recursively following links on web pages.
SYNOPSIS
use WWW::SimpleRobot;
my $robot = WWW::SimpleRobot->new(
URLS => [ 'http://www.perl.org/' ],
FOLLOW_REGEX => "^http://www.perl.org/",
DEPTH => 1,
TRAVERSAL => 'depth',
VISIT_CALLBACK =>
sub {
my ( $url, $depth, $html, $links ) = @_;
print STDERR "Visiting $url\n";
print STDERR "Depth = $depth\n";
print STDERR "HTML = $html\n";
print STDERR "Links = @$links\n";
}
);
$robot->traverse;
my @urls = @{$robot->urls};
my @pages = @{$robot->pages};
for my $page ( @pages )
{
my $url = $page->{url};
my $depth = $page->{depth};
my $modification_time = $page->{modification_time};
}
DESCRIPTION
A simple perl module for doing robot stuff. For a more elaborate interface,
see WWW::Robot. This version uses LWP::Simple to grab pages, and
HTML::LinkExtor to extract the links from them. Only href attributes of
anchor tags are extracted. Extracted links are checked against the
FOLLOW_REGEX regex to see if they should be followed. A HEAD request is
made to these links, to check that they are 'text/html' type pages.
BUGS
This robot doesn't respect the Robot Exclusion Protocol
(http://info.webcrawler.com/mak/projects/robots/norobots.html) (naughty
robot!), and doesn't do any exception handling if it can't get pages - it
just ignores them and goes on to the next page!
AUTHOR
Ave Wrigley <Ave.Wrigley@itn.co.uk>
COPYRIGHT
Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.