NAME
WWW::Leech::Walker - small web content grabbing framework
SYNOPSIS
use WWW::Leech::Walker;
my $walker = new WWW::Leech::Walker({
ua => new LWP::UserAgent(),
url => 'http://example.tdl',
parser => $www_leech_parser_params,
state => {},
logger => sub {print shift()},
raw_preprocessor => sub{
my $html = shift();
my $walker_obj = shift;
return $html;
}
filter => sub{
my $urls = shift;
my $walker_obj = shift;
# ... filter urls
return $urls
},
processor => sub {
my $data = shift;
my $walker_obj = shift;
# ... process grabbed data
}
});
$walker->leech();
DESCRIPTION
WWW::Leech::Walker walks through a given website parsing content and generating structured data. Declarative interface makes Walker some sort of a framework.
This module is designed to extract data from sites with particular structure: an index page (or any other provided as a root page) contains links to individual pages representing items that should be grabbed. Index page may also contain 'paging' links (e.g. http://exmple.tdl/?page=2) which lead to the page with similar structure. The closest example is a products category page with links to individual products and links to 'sub-pages'.
All required parameters are set as constructor arguments. Other methods are used to start/stop the grabbing process and launch logger (see below).
DETAILS
- new($params)
-
$params must be a hashref providing all data required.
- ua
-
LWP compatible user-agent object.
- url
-
Starting url.
- post_data
-
Url-encoded post data. By default Walker will fetch items list using GET method. POST method is used if post_data is set. Requests fetching individual items pages are still using GET method.
- parser
-
Parameters for WWW::Leech::Parser
- state
-
Optional user-filled value. Walker does not use it directly. State is passed to user callbacks instead. Defaults to empty hashref.
- logger
-
Optional logging callback. Whenever something happens walker runs this subroutine passing message.
- filter
-
Optional urls filtering callback. When walker gets a list of items-pages urls it passes that list to the filter subroutine. Walker itself is passed as a second argument and an arrayref with links text as third. Walker expects it to return filtered list. Empty list is okay.
- processor
-
This callback is launched after the individual item is parsed and converted to a hashref. This hashref is passed to the processor to be saved, or processed in some other way.
- raw_preprocessor
-
This optional callback is launched after any page was retrieved but before parsing started. Walker expects it to return scalar.
- next_page_link_post_process
-
This optional callback allows user to alter next page url. Usually these urls look like 'http://example.tld/list?page=2' and no changes needed there. But sometimes such links are javascript calls like 'javascript:gotoPageNumber(2)'. The source url is passed as is before walker absolutizes it. Walker passes current page url as a third agument - this may be usefull for links like 'javascript:gotoNextPage()'
Walker expects this callback to return a fixed url.
- leech()
-
Starts the process.
- stop()
-
Stops the process completely. By default walker keeps working untill there are links. Some sites may contain zillions of pages, while only first million is required. This method allows to stop at some point. See "CALLBACKS" section below.
If walker is restarted with leech() method it will run as if it was newly created (still the 'state' is saved).
- log($message)
-
Runs the 'logger' callback with $message argument.
- getCurrentDOM()
-
Returns DOM currently beeing processed.
CALLBACKS
Walker passes callback specific data as a first argument, itself as a second and some additional data as third if any.
When grabbing large sites the grabbing process should be stopped at some point (if you don't need all the data of course). This example shows how to do it using state propery and stop() method:
#....
state => {total_links_amount => 0},
filter => sub{
my $links = shift;
my $walker = shift;
if($walker->{'state'}->{'total_links_amount'} > 1_000_000 ){
$walker->log("Million of items grabbed. Enough.");
$walker->stop();
return [];
}
$walker->{'state'}->{'total_links_amount'} += scalar(@$links);
return $links;
}
#....
AUTHOR
Dmitry Selverstov
CPAN ID: JAREDSPB
jaredspb@cpan.org
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.