NAME
Dezi::Aggregator::Spider - web aggregator
SYNOPSIS
use Dezi::Aggregator::Spider;
my $spider = Dezi::Aggregator::Spider->new(
indexer => Dezi::Indexer->new
);
$spider->indexer->start;
$spider->crawl( 'http://swish-e.org/' );
$spider->indexer->finish;
DESCRIPTION
Dezi::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, Dezi::Aggregator::Spider uses LWP::RobotUA to do the hard work. See Dezi::Aggregator::Spider::UA.
METHODS
See Dezi::Aggregator.
new( params )
All params have their own get/set methods too. They include:
- agent string
-
Get/set the user-agent string reported by the user agent.
- email string
-
Get/set the email string reported by the user agent.
- use_md5 1|0
-
Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.
- uri_cache cache_object
-
Get/set the Dezi::Cache-derived object used to track which URIs have been fetched already.
- md5_cache cache_object
-
If use_md5() is true, this Dezi::Cache-derived object tracks the URI fingerprints.
- file_rules File_Rules_or_ARRAY
-
Apply File::Rules object in uri_ok(). File_Rules_or_ARRAY should be a File::Rules object or an array of strings suitable to passing to File::Rules->new().
- queue queue_object
-
Get/set the Dezi::Queue-derived object for tracking which URIs still need to be fetched.
- ua lwp_useragent
-
Get/set the Dezi::Aggregagor::Spider::UA object.
- max_depth n
-
How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.
Default is unlimited depth.
- max_time n
-
This optional key will set the max minutes to spider. Spidering for this host will stop after
max_time
seconds, and move on to the next server, if any. The default is to not limit by time. - max_files n
-
This optional key sets the max number of files to spider before aborting. The default is to not limit by number of files. This is the number of requests made to the remote server, not the total number of files to index (see
max_indexed
). This count is displayted at the end of indexing asUnique URLs
.This feature can (and perhaps should) be use when spidering a web site where dynamic content may generate unique URLs to prevent run-away spidering.
- max_size n
-
This optional key sets the max size of a file read from the web server. This defaults to 5,000,000 bytes. If the size is exceeded the resource is truncated per LWP::UserAgent.
Set max_size to zero for unlimited size.
- modified_since date
-
This optional parameter will skip any URIs that do not report having been modified since date. The
Last-Modified
HTTP header is used to determine modification time. - keep_alive 1|0
-
This optional parameter will enable keep alive requests. This can dramatically speed up spidering and reduce the load on server being spidered. The default is to not use keep alives, although enabling it will probably be the right thing to do.
To get the most out of keep alives, you may want to set up your web server to allow a lot of requests per single connection (i.e MaxKeepAliveRequests on Apache). Apache's default is 100, which should be good.
When a connection is not closed the spider does not wait the "delay" time when making the next request. In other words, there is no delay in requesting documents while the connection is open.
Note: you must have at least libwww-perl-5.53_90 installed to use this feature.
- delay n
-
Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).
- timeout n
-
Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.
- authn_callback code_ref
-
CODE reference to fetch username/password credentials when necessary. See also
credentials
. - credential_timeout n
-
Number of seconds to wait before skipping manual prompt for username/password.
- credentials user:pass
-
String with
username
:password
pair to be used when prompted by the server. - follow_redirects 1|0
-
By default, 3xx responses from the server will be followed when they are on the same hostname. Set to false (0) to not follow redirects.
-
TODO
- remove_leading_dots 1|0
-
Microsoft server hack.
- same_hosts array_ref
-
ARRAY ref of hostnames to be treated as identical to the original host being spidered. By default the spider will not follow links to different hosts.
BUILD
Initializes a new spider object. Called by new().
uri_ok( uri )
Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on its base, robot rules, and the spider configuration.
add_to_queue( uri )
Add uri to the queue.
next_from_queue
Return next uri from queue.
left_in_queue
Returns queue()->size().
remove_from_queue( uri )
Calls queue()->remove(uri).
get_doc
Returns the next URI from the queue() as a Dezi::Indexer::Doc object, or the error message if there was one.
Returns undef if the queue is empty or max_depth() has been reached.
get_authorized_doc( uri, response )
Called internally when the server returns a 401 or 403 response. Will attempt to determine the correct credentials for uri based on the previous attempt in response and what you have configured in credentials, authn_callback or when manually prompted.
looks_like_feed( http_response )
Called internally to perform naive heuristics on http_response to determine whether it looks like an XML feed of some kind, rather than a HTML page.
looks_like_sitemap( http_response )
Called internally to perform naive heuristics on http_response to determine whether it looks like a XML sitemap feed, rather than a HTML page.
crawl( uri )
Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in max_depth().
Will quit after max_files() unless max_files==0.
Will quit after max_time() seconds unless max_time==0.
write_log( args )
Passes args to Dezi::Utils::write_log().
write_log_line([char, width])
Pass through to Dezi::Utils::write_log_line().
AUTHOR
Peter Karman, <perl@peknet.com>
BUGS
Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi-App. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Dezi
You can also look for information at:
Mailing list
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
COPYRIGHT AND LICENSE
Copyright 2008-2009 by Peter Karman
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.