NAME

dezibot - parallel web crawler

SYNOPSIS

# crawl 2 sites
% dezibot http://dezi.org http://swish-e.org

# crawl a list of sites
% dezibot --urls file_with_urls

# pass in stored config
% dezibot --config botconfig.pl

# crawl in parallel
% dezibot --workers 5 --urls file_with_urls

DESCRIPTION

dezibot is a command line tool wrapping the Dezi::Bot module.

dezibot can:

  • read from a config file or take options on the command line

  • read URLs from a file or from @ARGV

  • spawn multiple parallel spiders

OPTIONS

The following options are supported.

--help

Print this message.

--debug

Spew lots of information to stderr. Overrides any setting in --config.

--verbose

Print some status information to stderr. Overrides any setting in --config.

--config file

Read config from file using Config::Any. The parsed config is passed directly to Dezi::Bot->new().

--urls file

Read URLs to crawl from file. Lines starting with whitespace or # are ignored.

--workers n

Spawn n workers to crawl in parallel. The default is to crawl serially. If n is less than the number of URLs, the list of URLs will be sliced and apportioned among the n workers according to --pool_size.

--pool_size n

The max number of URLs per worker. Default is to divide the number of URLs by the number of workers, but you might want to set the size n to a lower number in order to minimize wait time between crawls.

AUTHOR

Peter Karman, <karman at cpan.org>

BUGS

Please report any bugs or feature requests to bug-dezi-bot at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi-Bot. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Dezi::Bot

You can also look for information at:

COPYRIGHT & LICENSE

Copyright 2013 Peter Karman.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.