NAME

Dezi::Aggregator::Spider - web aggregator

SYNOPSIS

use Dezi::Aggregator::Spider;
my $spider = Dezi::Aggregator::Spider->new(
    indexer => Dezi::Indexer->new
);

$spider->indexer->start;
$spider->crawl( 'http://swish-e.org/' );
$spider->indexer->finish;

DESCRIPTION

Dezi::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, Dezi::Aggregator::Spider uses LWP::RobotUA to do the hard work. See Dezi::Aggregator::Spider::UA.

METHODS

See Dezi::Aggregator.

new( params )

All params have their own get/set methods too. They include:

agent string

Get/set the user-agent string reported by the user agent.

email string

Get/set the email string reported by the user agent.

use_md5 1|0

Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.

uri_cache cache_object

Get/set the Dezi::Cache-derived object used to track which URIs have been fetched already.

md5_cache cache_object

If use_md5() is true, this Dezi::Cache-derived object tracks the URI fingerprints.

file_rules File_Rules_or_ARRAY

Apply File::Rules object in uri_ok(). File_Rules_or_ARRAY should be a File::Rules object or an array of strings suitable to passing to File::Rules->new().

queue queue_object

Get/set the Dezi::Queue-derived object for tracking which URIs still need to be fetched.

ua lwp_useragent

Get/set the Dezi::Aggregator::Spider::UA object.

max_depth n

How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.

Default is unlimited depth.

max_time n

This optional key will set the max minutes to spider. Spidering for this host will stop after max_time seconds, and move on to the next server, if any. The default is to not limit by time.

max_files n

This optional key sets the max number of files to spider before aborting. The default is to not limit by number of files. This is the number of requests made to the remote server, not the total number of files to index (see max_indexed). This count is displayed at the end of indexing as Unique URLs.

This feature can (and perhaps should) be use when spidering a web site where dynamic content may generate unique URLs to prevent run-away spidering.

max_size n

This optional key sets the max size of a file read from the web server. This defaults to 5,000,000 bytes. If the size is exceeded the resource is truncated per LWP::UserAgent.

Set max_size to zero for unlimited size.

modified_since date

This optional parameter will skip any URIs that do not report having been modified since date. The Last-Modified HTTP header is used to determine modification time.

keep_alive 1|0

This optional parameter will enable keep alive requests. This can dramatically speed up spidering and reduce the load on server being spidered. The default is to not use keep alives, although enabling it will probably be the right thing to do.

To get the most out of keep alives, you may want to set up your web server to allow a lot of requests per single connection (i.e MaxKeepAliveRequests on Apache). Apache's default is 100, which should be good.

When a connection is not closed the spider does not wait the "delay" time when making the next request. In other words, there is no delay in requesting documents while the connection is open.

Note: you must have at least libwww-perl-5.53_90 installed to use this feature.

delay n

Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).

timeout n

Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.

authn_callback code_ref

CODE reference to fetch username/password credentials when necessary. See also credentials.

credential_timeout n

Number of seconds to wait before skipping manual prompt for username/password.

credentials user:pass

String with username:password pair to be used when prompted by the server.

follow_redirects 1|0

By default, 3xx responses from the server will be followed when they are on the same hostname. Set to false (0) to not follow redirects.

TODO

remove_leading_dots 1|0

Microsoft server hack.

same_hosts array_ref

ARRAY ref of hostnames to be treated as identical to the original host being spidered. By default the spider will not follow links to different hosts.

BUILD

Initializes a new spider object. Called by new().

uri_ok( uri )

Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on its base, robot rules, and the spider configuration.

add_to_queue( uri )

Add uri to the queue.

next_from_queue

Return next uri from queue.

left_in_queue

Returns queue()->size().

remove_from_queue( uri )

Calls queue()->remove(uri).

get_doc

Returns the next URI from the queue() as a Dezi::Indexer::Doc object, or the error message if there was one.

Returns undef if the queue is empty or max_depth() has been reached.

get_authorized_doc( uri, response )

Called internally when the server returns a 401 or 403 response. Will attempt to determine the correct credentials for uri based on the previous attempt in response and what you have configured in credentials, authn_callback or when manually prompted.

looks_like_feed( http_response )

Called internally to perform naive heuristics on http_response to determine whether it looks like an XML feed of some kind, rather than a HTML page.

looks_like_sitemap( http_response )

Called internally to perform naive heuristics on http_response to determine whether it looks like a XML sitemap feed, rather than a HTML page.

crawl( uri )

Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in max_depth().

Will quit after max_files() unless max_files==0.

Will quit after max_time() seconds unless max_time==0.

write_log( args )

Passes args to Dezi::Utils::write_log().

write_log_line([char, width])

Pass through to Dezi::Utils::write_log_line().

AUTHOR

Peter Karman, <perl@peknet.com>

BUGS

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi-App. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Dezi

You can also look for information at:

COPYRIGHT AND LICENSE

Copyright 2008-2015 by Peter Karman

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

http://swish-e.org/