NAME
WWW::Leech::Parser - HTML Page parser used by WWW::Leech::Walker
SYNOPSIS
use WWW::Leech::Parser;
my $parser = new WWW::Leech::Parser({
'item:link' => '//a[contains(@class,"item-link")]',
'nextpage:link' => '//a[contains(@class,"next-page-link")]',
'fields' => {
'name' => '//h1',
'images[]' => '//img/@src',
'comments[]' =>{
type => 'html',
xpath => '//div[@class="comments"]/div',
filter => sub{
my $values = shift;
my $field_defs = shift;
# ....
return $values;
}
}
# ....
}
});
my $html_string = '...';
my $links_and_next_page_url = $parser->parseList($html_string);
my $item = $parser->parse($html_string);
DESCRIPTION
WWW::Leech::Parser extracts certain information from web page using provided XPath expressions.
First of all it is used to get links to 'sub-pages' and links to 'next-page' from a links-list-page (e.g. search engine results). Also it extracts required data from given HTML using rules defined upon object creation.
DETAILS
- new($rules) $rules is a hashref with following keys:
-
- item:link
-
XPath extracting links to sub-pages
- nextpage:link
-
XPath extracting link to next links-list page
- fields
-
Fields tell parser how to extract data. Can be provided as an arrayref:
$fields = [ { name => 'fieldname1', xpath => '//somenode' }, { name => 'fieldname2', xpath => '//othernode' } ]
Or a hashref:
$fields = { fieldname1 => '//somenode', fieldname2 => { xpath => '//othernode' } } ]
By default parser uses first node found text as a value for the element. Appending '[]' sequence to key name switches parser to 'wantarray' mode. Parser will return an array of values in this case.
Every element can be provided in a simple or a complex form.
Simple form is just a key-value pair where key is a name of a field and value is an XPath expression.
In complex form a hashref determining details about the field must be provided. Following keys are recognized:
- xpath
-
Required.
XPath expression for element data.
- type
-
Optional.
text - gets text content only (default) html - extracts all node content including node itself as is int - not appliable in 'wantarray' mode - removes non numeric characters from text value unique - only appliable in 'wantarray' mode - removes duplicates
- filter
-
Optional.
Coderef. Parser runs filter callback passing extracted value and field definitions. Field value is replaced with whatever callback returns.
- parseList($html_string)
-
returns list-page links as a hashref:
{ links => [...] # URL's array next_page => "/page/N" # next page URL }
- parseList($html_string)
-
returns hashref with data extracted from page using 'fields' section from rules
AUTHOR
Dmitry Selverstov
CPAN ID: JAREDSPB
jaredspb@cpan.org
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.