NAME
HTML::Feature - Extract Feature Sentences From HTML Documents
SYNOPSIS
use HTML::Feature;
# simple usage
my $feature = HTML::Feature->new;
my $result = $feature->parse("http://www.perl.com");
print "Title:" , $result->title, "\n";
print "Description:" , $result->desc , "\n";
print "Featured Text:", $result->text , "\n";
# you can set some engine modules serially. ( if one module can't extract text, it calls to next module )
my $feature = HTML::Feature->new(
engines => [
'HTML::Feature::Engine::LDRFullFeed',
'HTML::Feature::Engine::GoogleADSection',
'HTML::Feature::Engine::TagStructure',
],
);
my $result = $feature->parse($url);
# And you can set your custom engine module in arbitrary place.
my $feature = HTML::Feature->new(
engines => [
'Your::Custom::Engine::Module'
],
);
DESCRIPTION
This module extracst blocks of feature sentences out of an HTML document.
Version 3.0, we provide three engines.
1. LDRFullFeed
Use wedata's databaase that is compatible for LDR Full Feed.
see -> http://wedata.net/help/about ( Japanse only )
2. GoogleADSection
Extract by 'Google AD Section' HTML COMMENT TAG
3. TagStructure
Default engine. It guesses and extracts a feature sentence by HTML tag structure.
Unlike other modules that performs similar tasks, this module by default extracts blocks without using morphological analysis, and instead it uses simple statistics processing.
Because of this, HTML::Feature::Engine::TagStructure has an advantage over other similar modules in that it can be applied to documents in any language.
METHODS
new
Instantiates a new HTML::Feature object. Takes the following parameters
my $f = HTML::Feature->new(%param);
my $f = HTML::Feature->new(
engines => [ class_name1,
class_name2, # backend engine module (default: 'TagStructure')
class_name3 ],
user_agent => 'my-agent-name', # LWP::UserAgent->agent (default: 'libwww-perl/#.##')
http_proxy => 'http://proxy:3128', # http proxy server (default: '')
timeout => 10, # set the timeout value in seconds. (default: 180)
not_decode => 1, # if this value is 1, HTML::Feature does not decode the HTML document (default: '')
not_encode => 1, # if this value is 1, HTML::Feature does not encode the result value (default: '')
element_flag => 1, # flag of HTML::Element object as returned value (default: '')
);
- engine
-
Specifies the class name of the engine that you want to use.
HTML::Feature is designed to accept some different engines. If you want to customize the behavior of HTML::Feature, specify your own engine in this parameter.
parse
my $result = $f->parse($url);
# or
my $result = $f->parse($html_ref);
# or
my $result = $f->parse($http_response);
Parses the given argument. The argument can be either a URL, a string of HTML (must be passed as a scalar reference), or an HTTP::Response object. HTML::Feature will detect and delegate to the appropriate method (see below)
parse_url($url)
Parses an URL. This method will use LWP::UserAgent to fetch the given url.
parse_html($html)
Parses a string containing HTML.
parse_response($http_response)
Parses an HTTP::Response object.
front_parser
accessor method that points to HTML::Feature::FrontParser object.
engine
accessor method that points to HTML::Feature::Engine object.
AUTHOR
Takeshi Miki <miki@cpan.org>
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.