NAME

HTML::ContentExtractor - extract the main content from a web page by analysising the DOM tree!

VERSION

Version 0.03

SYNOPSIS

use HTML::ContentExtractor;
my $extractor = HTML::ContentExtractor->new();
my $agent=LWP::UserAgent->new;

my $url='http://sports.sina.com.cn/g/2007-03-23/16572821174.shtml';
my $res=$agent->get($url);
my $HTML = $res->decoded_content();

$extractor->extract($url,$HTML);
print $extractor->as_html();
print $extractor->as_text();

DESCRIPTION

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.

A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.

Please notice the input HTML should be encoded in utf-8 format( so do the spam words), thus the module can handle web pages in any language (I've used it to process English, Chinese, and Japanese web pages).

$e = HTML::ContentExtractor->new(%options);

Constructs a new HTML::ContentExtractor object. The optional %options hash can be used to set the options list below.

$e->table_tags();
$e->table_tags(@tags);
$e->table_tags(\@tags);

This is used to get/set the table tags array. The tags are used as the container tags.

$e->ignore_tags();
$e->ignore_tags(@tags);
$e->ignore_tags(\@tags);

This is used to get/set the ignore tags array. The elements of such tags will be removed.

$e->spam_words();
$e->spam_words(@strings);
$e->spam_words(\@strings);

This is used to get/set the spam words list. The elements have such string will be removed.

This is used to get/set the link/text ratio, default is 0.05.

$e->min_text_len();
$e->min_text_len($len);

This is used to get/set the min text length, default is 20. If length of the text of an elment is less than this value, this element will be removed.

$e->extract($url,$HTML);

This is used to perform the extraction process. Please notice the input $HTML must be encoded in UTF-8.

$e->as_html();

Return the extraction result in HTML format.

$e->as_text();

Return the extraction result in text format.

AUTHOR

Zhang Jun, <jzhang533 at gmail.com>

COPYRIGHT & LICENSE

Copyright 2007 Zhang Jun, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.