NAME

HTML::ExtractMain - Extract the main content of a web page

VERSION

Version 0.63

SYNOPSIS

    use HTML::ExtractMain qw( extract_main_html );

    my $html = <<'END';
    <div id="header">Header</div>
    <div id="nav"><a href="/">Home</a></div>
    <div id="body">
        <p>Foo</p>
        <p>Baz</p>
    </div>
    <div id="footer">Footer</div>
    END

    my $main_html = extract_main_html($html, output_type => 'xhtml');
    if (defined $main_html) {
	# do something with $main_html here
        # $main_html is '<div id="body"><p>Foo</p><p>Baz</p></div>'
    }

EXPORT

extract_main_html is optionally exported

FUNCTIONS

extract_main_html

extract_main_html takes HTML content, and uses the Readability algorithm to detect the main body of the page, usually skipping headers, footers, navigation, etc.

The first argument is either an HTML string, or an HTML::TreeBuilder tree. (If passed a tree, the tree will be modified and destroyed.)

Remaining arguments are optional and represent key/value options. The available options are:

output_type

This determines what format to return data in. If not specified then xhtml format will be used. Valid formats are:

xhtml
html
tree

If tree is selected, then an HTML::Element object will be returned instead of a string.

If the HTML's main content is found, it's returned in the chosen output format. The returned HTML/XHTML will not look like what you put in. (Source formatting, e.g. indentation, will be removed.)

If a most relevant block of content is not found, extract_main_html returns undef.

AUTHOR

Anirvan Chatterjee, <anirvan at cpan.org>

BUGS

Please report any bugs or feature requests to bug-html-extractmain at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-ExtractMain. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc HTML::ExtractMain

You can also look for information at:

RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ExtractMain
AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/HTML-ExtractMain
CPAN Ratings

http://cpanratings.perl.org/d/HTML-ExtractMain
Search CPAN

http://search.cpan.org/dist/HTML-ExtractMain/

ACKNOWLEDGEMENTS

The Readability algorithm is ported from Arc90's JavaScript original, built as part of the excellent Readability application, online at http://lab.arc90.com/experiments/readability/, repository at http://code.google.com/p/arc90labs-readability/.

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install HTML::ExtractMain, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::ExtractMain

CPAN shell

perl -MCPAN -e shell
install HTML::ExtractMain

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)