NAME
dezibot - parallel web crawler
SYNOPSIS
# crawl 2 sites
% dezibot http://dezi.org http://swish-e.org
# crawl a list of sites
% dezibot --urls file_with_urls
# pass in stored config
% dezibot --config botconfig.pl
# crawl in parallel
% dezibot --workers 5 --urls file_with_urls
DESCRIPTION
dezibot is a command line tool wrapping the Dezi::Bot module.
dezibot can:
read from a config file or take options on the command line
read URLs from a file or from @ARGV
spawn multiple parallel spiders
OPTIONS
The following options are supported.
--help
Print this message.
--debug
Spew lots of information to stderr. Overrides any setting in --config.
--verbose
Print some status information to stderr. Overrides any setting in --config.
--config file
Read config from file using Config::Any. The parsed config is passed directly to Dezi::Bot->new().
--urls file
Read URLs to crawl from file. Lines starting with whitespace or #
are ignored.
--workers n
Spawn n workers to crawl in parallel. The default is to crawl serially. If n is less than the number of URLs, the list of URLs will be sliced and apportioned among the n workers according to --pool_size.
--pool_size n
The max number of URLs per worker. Default is to divide the number of URLs by the number of workers, but you might want to set the size n to a lower number in order to minimize wait time between crawls.
AUTHOR
Peter Karman, <karman at cpan.org>
BUGS
Please report any bugs or feature requests to bug-dezi-bot at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi-Bot. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Dezi::Bot
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
COPYRIGHT & LICENSE
Copyright 2013 Peter Karman.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.