NAME

HTML::Feature - Extract Feature Sentences From HTML Documents

SYNOPSIS

use HTML::Feature;

# simple usage

my $feature = HTML::Feature->new;
my $result  = $feature->parse("http://www.perl.com");

print "Title:"        , $result->title, "\n";
print "Description:"  , $result->desc , "\n";
print "Featured Text:", $result->text , "\n";


# you can set some engine modules serially. ( if one module can't extract text, it calls to next module )

my $feature = HTML::Feature->new( 
  engines => [
    'HTML::Feature::Engine::LDRFullFeed',
    'HTML::Feature::Engine::GoogleADSection',
    'HTML::Feature::Engine::TagStructure',
  ],
);

my $result = $feature->parse($url);


# And you can set your custom engine module in arbitrary place.

my $feature = HTML::Feature->new( 
  engines => [
    'Your::Custom::Engine::Module'
  ],
);

DESCRIPTION

This module extracst blocks of feature sentences out of an HTML document.

Version 3.0, we provide three engines.

1. LDRFullFeed

  Use wedata's databaase that is compatible for LDR Full Feed.
    see -> http://wedata.net/help/about ( Japanse only )

2. GoogleADSection

  Extract by 'Google AD Section' HTML COMMENT TAG

3. TagStructure

  Default engine. It guesses and extracts a feature sentence by HTML tag structure.
  Unlike other modules that performs similar tasks, this module by default extracts blocks without using morphological analysis, and instead it uses simple statistics processing.
  Because of this, HTML::Feature::Engine::TagStructure has an advantage over other similar modules in that it can be applied to documents in any language.

METHODS

new

Instantiates a new HTML::Feature object. Takes the following parameters

 my $f = HTML::Feature->new(%param);

 my $f = HTML::Feature->new(
     engines      => [ class_name1, 
                       class_name2,       # backend engine module (default: 'TagStructure') 
                       class_name3 ], 

     user_agent   => 'my-agent-name',     # LWP::UserAgent->agent (default: 'libwww-perl/#.##') 
     http_proxy   => 'http://proxy:3128', # http proxy server (default: '')
     timeout      => 10,                  # set the timeout value in seconds. (default: 180)

     not_decode   => 1,                   # if this value is 1, HTML::Feature does not decode the HTML document (default: '')
     not_encode   => 1,                   # if this value is 1, HTML::Feature does not encode the result value  (default: '') 

     element_flag => 1,                   # flag of HTML::Element object as returned value (default: '') 
);
engine

Specifies the class name of the engine that you want to use.

HTML::Feature is designed to accept some different engines. If you want to customize the behavior of HTML::Feature, specify your own engine in this parameter.

parse

my $result = $f->parse($url);
# or
my $result = $f->parse($html_ref);
# or
my $result = $f->parse($http_response);

Parses the given argument. The argument can be either a URL, a string of HTML (must be passed as a scalar reference), or an HTTP::Response object. HTML::Feature will detect and delegate to the appropriate method (see below)

parse_url($url)

Parses an URL. This method will use LWP::UserAgent to fetch the given url.

parse_html($html)

Parses a string containing HTML.

parse_response($http_response)

Parses an HTTP::Response object.

front_parser

accessor method that points to HTML::Feature::FrontParser object.

engine

accessor method that points to HTML::Feature::Engine object.

AUTHOR

Takeshi Miki <miki@cpan.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO