NAME
WWW::Crawler::Mojo - A web crawling framework for Perl
SYNOPSIS
use strict;
use warnings;
use WWW::Crawler::Mojo;
my $bot = WWW::Crawler::Mojo->new;
$bot->on(res => sub {
my ($bot, $scrape, $job, $res) = @_;
$bot->enqueue($_) for $scrape->('#context');
});
$bot->enqueue('http://example.com/');
$bot->crawl;
DESCRIPTION
WWW::Crawler::Mojo is a web crawling framework for those who are familiar with Mojo::* APIs.
Althogh the module is only well tested for "focused crawl" at this point, you can also use it for endless crawling by taking special care of memory usage.
ATTRIBUTES
WWW::Crawler::Mojo inherits all attributes from Mojo::EventEmitter and implements the following new ones.
clock_speed
A number of main event loop interval in milliseconds. Defaults to 0.25.
$bot->clock_speed(2);
my $clock = $bot->clock_speed; # 2
html_handlers
Sets HTML handlers of scrapper. Defaults to WWW::Crawler::Mojo::ScraperUtil::html_handler_presets.
$bot->html_handlers( {
'a[href]' => sub { return $_[0]->{href} },
'img[src]' => sub { return $_[0]->{src} },
} );
max_conn
An amount of max connections.
$bot->max_conn(5);
say $bot->max_conn; # 5
max_conn_per_host
An amount of max connections per host.
$bot->max_conn_per_host(5);
say $bot->max_conn_per_host; # 5
queue
WWW::Crawler::Mojo::Queue::Memory object for default.
$bot->queue(WWW::Crawler::Mojo::Queue::Memory->new);
$bot->queue->enqueue($job);
shuffle
An interval in seconds to shuffle the job queue. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.
$bot->shuffle(5);
say $bot->shuffle; # 5
ua
A WWW::Crawler::Mojo::UserAgent instance.
my $ua = $bot->ua;
$bot->ua(WWW::Crawler::Mojo::UserAgent->new);
ua_name
Name of crawler for User-Agent header.
$bot->ua_name('my-bot/0.01 (+https://example.com/)');
say $bot->ua_name; # 'my-bot/0.01 (+https://example.com/)'
EVENTS
WWW::Crawler::Mojo inherits all events from Mojo::EventEmitter and implements the following new ones.
req
Emitted right before crawler perform request to servers. The callback takes 3 arguments.
$bot->on(req => sub {
my ($bot, $job, $req) = @_;
# DO NOTHING
});
res
Emitted when crawler got response from server. The callback takes 4 arguments.
$bot->on(res => sub {
my ($bot, $scrape, $job, $res) = @_;
if (...) {
$bot->enqueue($_) for $scrape->();
} else {
# DO NOTHING
}
});
$bot
WWW::Crawler::Mojo instance.
$scrape
Scraper code reference for current document. The code takes optional argument CSS selector for context and returns new jobs.
for my $job ($scrape->($context)) {
$bot->enqueue($job)
}
Optionally you can specify a scraping target container in CSS selector.
@jobs = $scrape->('#container');
@jobs = $scrape->(['#container1', '#container2']);
$job
WWW::Crawler::Mojo::Job instance.
$res
Mojo::Message::Response instance.
empty
Emitted when queue length gets zero.
$bot->on(empty => sub {
my ($bot) = @_;
say "Queue is drained out.";
});
error
Emitted when user agent returns no status code for request. Possibly caused by network errors or un-responsible servers.
$bot->on(error => sub {
my ($bot, $error, $job) = @_;
say "error: $_[1]";
if (...) { # until failur occures 3 times
$bot->requeue($job);
}
});
Note that server errors such as 404 or 500 cannot be catched with the event. Consider res event for the use case instead of this.
start
Emitted right before crawl is started.
$bot->on(start => sub {
my $self = shift;
...
});
METHODS
WWW::Crawler::Mojo inherits all methods from Mojo::EventEmitter and implements the following new ones.
crawl
Starts crawling loop.
$bot->crawl;
init
Initializes crawler settings.
$bot->init;
process_job
Processes a job.
$bot->process_job;
say_start
Displays starting messages to STDOUT
$bot->say_start;
scrape
Parses and discovers links in a web page and CSS. This performs scraping. With the optional 4th argument, you can specify a CSS selector to container you would collect URLs within.
$bot->scrape($res, $job, );
$bot->scrape($res, $job, $selector);
$bot->scrape($res, $job, [$selector1, $selector2]);
stop
Stop crawling.
$bot->stop;
enqueue
Appends one or more URLs or WWW::Crawler::Mojo::Job objects.
$bot->enqueue('http://example.com/index1.html');
OR
$bot->enqueue($job1, $job2);
OR
$bot->enqueue(
'http://example.com/index1.html',
'http://example.com/index2.html',
'http://example.com/index3.html',
);
requeue
Appends one or more URLs or jobs for re-try. This accepts same arguments as enqueue method.
$self->on(error => sub {
my ($self, $msg, $job) = @_;
if (...) { # until failur occures 3 times
$bot->requeue($job);
}
});
collect_urls_html
Collects URLs out of HTML.
$bot->collect_urls_html($dom, sub {
my ($uri, $dom) = @_;
});
EXAMPLE
https://github.com/jamadam/WWW-Flatten
AUTHOR
Keita Sugama, <sugama@jamadam.com>
COPYRIGHT AND LICENSE
Copyright (C) jamadam
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.