NAME
HTML::Feature - Extract Feature Sentences From HTML Documents
SYNOPSIS
use HTML::Feature;
my $f = HTML::Feature->new(enc_type => 'utf8');
my $result = $f->parse('http://www.perl.com');
# or $f->parse($html);
print "Title:" , $result->title(), "\n";
print "Description:" , $result->desc(), "\n";
print "Featured Text:", $result->text(), "\n";
print "HTML Element:", $result->element->as_HTML, "\n";
# a simpler method is,
use HTML::Feature qw(feature);
print scalar feature('http://www.perl.com');
# very simple!
DESCRIPTION
This module extracst blocks of feature sentences out of an HTML document.
Unlike other modules that performs similar tasks, this module by default extracts blocks without using morphological analysis, and instead it uses simple statistics processing.
Because of this, HTML::Feature has an advantage over other similar modules in that it can be applied to documents in any language.
METHODS
new()
my $f = HTML::Feature->new(%param);
my $f = HTML::Feature->new(
engine => $class, # backend engine module (default: 'TagStructure')
max_bytes => 5000, # max number of bytes per node to analyze (default: '')
min_bytes => 10, # minimum number of bytes per node to analyze (default is '')
enc_type => 'euc-jp', # encoding of return values (default: 'utf-8')
http_proxy => 'http://proxy:3128', # http proxy server (default: '')
);
Instantiates a new HTML::Feature object. Takes the following parameters
- engine
-
Specifies the class name of the engine that you want to use.
HTML::Feature is designed to accept different engines to change its behavior. If you want to customize the behavior of HTML::Feature, specify your own engine in this parameter
The rest of the arguments are directly passed to the HTML::Feature::Engine object constructor.
parse()
my $result = $f->parse($url);
# or
my $result = $f->parse($html);
# or
my $result = $f->parse($http_response);
Parses the given argument. The argument can be either a URL, a string of HTML, or an HTTP::Response object. HTML::Feature will detect and delegate to the appropriate method (see below)
parse_url($url)
Parses an URL. This method will use LWP::UserAgent to fetch the given url.
parse_html($html)
Parses a string containing HTML.
parse_response($http_response)
Parses an HTTP::Response object.
extract()
$data = $f->extract(url => $url);
# or
$data = $f->extract(string => $html);
HTML::Feature::extract() has been deprecated and exists for backwards compatiblity only. Use HTML::Feature::parse() instead.
extract() extracts blocks of feature sentences from the given document, and returns a data structure like this:
$data = {
title => $title,
description => $desc,
block => [
{
contents => $contents,
score => $score
},
.
.
]
}
feature
feature() is a simple wrapper that does new(), parse() in one step. If you do not require complex operations, simply calling this will suffice. In scalar context, it returns the feature text only. In list context, some more meta data will be returned as a hash.
This function is exported on demand.
use HTML::Feature qw(feature);
print scalar feature($url); # print featured text
my %data = feature($url); # wantarray(hash)
print $data{title};
print $data{desc};
print $data{text};
print $data{element}->as_HTML;
AUTHOR
Takeshi Miki <miki@cpan.org>
Special thanks to Daisuke Maki
COPYRIGHT AND LICENSE
Copyright (C) 2007 Takeshi Miki This library is free software; you can redistribute it and/or modifyit under the same terms as Perl itself, either Perl version 5.8.8 or,at your option, any later version of Perl 5 you may have available.