NAME
HTML::TableContentParser - Do interesting things with the contents of tables.
SYNOPSIS
use HTML::TableContentParser;
my $p = HTML::TableContentParser->new();
my $html = read_html_from_somewhere();
my $tables = $p->parse( $html );
for my $t (@$tables) {
for my $r (@{$t->{rows}}) {
print 'Row:';
for my $c (@{$r->{cells}}) {
print " [$c->{data}]";
}
print "\n";
}
}
DESCRIPTION
This package parses tables out of HTML. The return from the parse is a reference to an array containing the tables found.
Tables appear in the output in the order in which they are encountered. If a table is nested inside a cell of another table, it will appear after the containing table in the output, and any connection between the two will be lost. As of version 0.200_01, the appearance of a nested table should not cause any truncation of the containing table.
The following tags are processed by this module: <table>
, <caption>
, <tr>
, <th>
, and <td>
. In the return from the parse method, each tag is represented by a hash reference, having the tag's attributes as keys, and the attribute values as values. In addition, the following keys will be provided:
<table>
-
-
the
<caption>
tag, if any - headers
-
a reference to an array containing all the
<th>
tags, in the order encountered - rows
-
a reference to an array containing all the
<tr>
tags, in the order encountered
-
-
- data
-
the content of the
<caption>
tag
<tr>
-
- cells
-
a reference to an array containing all the
<td>
tags, in the order encountered, withundef
representing any<th>
tags encountered. Trailingundef
values will be dropped, and the entire key will be absent unless actual<td>
tags are found in the row.Note that prior to version 0.299_01,
<th>
tags were not represented at all. - headers
-
new with version 0.299_01, this is a reference to an array containing all the
<th>
tags in the row, in the order encountered, withundef
representing any<td>
tags. Trailingundef
values will be dropped, and the entire key will be absent unless actual<th>
tags are found in the row.It is the understanding of the current author (TRW) that in valid HTML
<th>
tags must occur inside a<tr>
element, so they need to be recognized there, rather than (or in addition to) in isolation.
<th>
-
- data
-
the content of the
<th>
tag
<td>
-
- data
-
the content of the
<td>
tag
METHODS
This module is a subclass of HTML::Parser. It provides only one new method, classic(), which is an accessor for the attribute of the same name. The following inherited (or overridden) methods may profitably be called by the user.
new
my $p = HTML::TableContentParser->new();
This static method instantiates the parser object. The only supported argument is
- classic
-
If this argument is set to
1
,<th>
tags are handled in the pre-0.299_01 way. That is, the<tr>
hash will not contain a{headers}
key, and its{cells}
key will not contain anyundef
values corresponding to<th>
elements.If this argument is set to
0
, you get the behavior documented for 0.299_01 and after.If this argument is
undef
or omitted, the value of $HTML::TableContentParser::CLASSIC is used.No other values are supported -- that is, the author reserves them, and the behavior when you use them may change without warning.
classic
This method returns the value of the classic
attribute, whether specified or defaulted.
parse
my $tables = $p->parse( $html );
This method parses the given HTML. The return is a reference to an array containing all the tables found.
GLOBALS
The following global variables, properly localized, can be used to modify the behavior of this module.
$HTML::TableContentParser::CLASSIC
This variable provides the default value of the classic
argument to new(), and is subject to the same restrictions.
$HTML::TableContentParser::DEBUG
If set to 1
, causes debug output to STDERR (via warn()
). Setting this to any true value (including 1
) is unsupported in the sense that the behavior of this module in response to any true value is explicitly undocumented, and can change without notice.
EXPORTS
Nothing.
CAVEATS, BUGS, and TODO
The rowspan
and colspan
attributes are reported but ignored. That is,
<tr><td colspan="2">Moe</td><td>Howard</td></tr>
occupies three columns in the HTML table, but only two entries are made in the {cells}
value of the hash that represents this row.
Please file bug reports at https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-TableContentParser, https://github.com/trwyant/perl-HTML-TableContentParser/issues, or in electronic mail to wyant at cpan dot org.
SEE ALSO
This module is a very specific tool to address a very specific problem. One of the following modules may better address your needs.
HTML::Parser. This is a general HTML parser, which forms the basis for this module.
HTML::TreeBuilder. This is a general HTML parser, with methods to search and traverse the parse tree once generated.
Mojo::DOM in the Mojolicious distribution. This is a general HTML/XML DOM parser, with methods to search the parse tree using CSS selectors.
AUTHOR
Simon Drabble <sdrabble@cpan.org>
Thomas R. Wyant, III wyant at cpan dot org
COPYRIGHT AND LICENSE
Copyright (C) 2002 Simon Drabble
Copyright (C) 2017-2021 Thomas R. Wyant, III
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.