NAME

WWW::Leech::Parser - HTML Page parser used by WWW::Leech::Walker

SYNOPSIS

use WWW::Leech::Parser;

my $parser = new WWW::Leech::Parser({
  'item:link' => '//a[contains(@class,"item-link")]',
  'nextpage:link' => '//a[contains(@class,"next-page-link")]',
  'fields' => {
    'name' => '//h1',
    'images[]' => '//img/@src',
    'comments[]' =>{
      type => 'html',
      xpath => '//div[@class="comments"]/div',
      filter => sub{
        my $values = shift;
        my $field_defs = shift;

        # ....

        return $values;
      }

    }
    # ....
  }
});

my $html_string = '...';

my $links_and_next_page_url = $parser->parseList($html_string);

my $item = $parser->parse($html_string);

DESCRIPTION

WWW::Leech::Parser extracts certain information from web page using provided XPath expressions.

First of all it is used to get links to 'sub-pages' and links to 'next-page' from a links-list-page (e.g. search engine results). Also it extracts required data from given HTML using rules defined upon object creation.

DETAILS

new($rules)

$rules is a hashref with following keys:

XPath extracting links to sub-pages

XPath extracting link to next links-list page

fields

Fields tell parser how to extract data. Can be provided as an arrayref:

$fields = [
  {
    name => 'fieldname1',
    xpath => '//somenode'
  },
  {
    name => 'fieldname2',
    xpath => '//othernode'
  }
]

Or a hashref:

$fields = 
  {
    fieldname1 => '//somenode',
    fieldname2 => {
      xpath => '//othernode'
    }
  }
]

By default parser uses first node found text as a value for the element. Appending '[]' sequence to key name switches parser to 'wantarray' mode. Parser will return an array of values in this case.

Every element can be provided in a simple or a complex form.

Simple form is just a key-value pair where key is a name of a field and value is an XPath expression.

In complex form a hashref determining details about the field must be provided. Following keys are recognized:

xpath

Required.

XPath expression for element data.

type

Optional.

text - gets text content only (default)
html - extracts all node content including node itself as is
int - not appliable in 'wantarray' mode - removes non numeric characters from text value
unique - only appliable in 'wantarray' mode - removes duplicates 
filter

Optional.

Coderef. Parser runs filter callback passing extracted value and field definitions. Field value is replaced with whatever callback returns.

parseList($html_string)

returns list-page links as a hashref:

{
  links => [...], # URL's array
  links_text => [...], # Text inside corresponding 'a' tags
  next_page => "/page/N" # next page URL
}
parse($html_string)

returns hashref with data extracted from page using 'fields' section from rules

AUTHOR

Dmitry Selverstov
CPAN ID: JAREDSPB
jaredspb@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.