NAME
Regexp::IgnoreHTML - Let us ignore the HTML tags when parsing HTML text
SYNOPSIS
use Regexp::IgnoreHTML;
my $rei = new Regexp::IgnoreHTML($text,
"<!-- __INDEX__ -->");
# split the wanted text from the unwanted text
$rei->split();
# use substitution function
$rei->s('(var)_(\d+)', '$2$1', 'gi');
$rei->s('(\d+):(\d+)', '$2:$1');
# merge back to get the resulted text
my $changed_text = $rei->merge();
DESCRIPTION
Inherit from Regexp:Ignore and implements the get_tokens method. The tokens that are returned by the get_tokens are all the HTML tags.
Note that for some HTML code, it might be better to use different get_tokens then this one. Suppose for example we have the following code
<table>
<tr>
<td>Hi</td><td>There</td>
</tr>
</table>
The "cleaned" text that will be generated after using the get_tokens method that comes from this class will look like:
HiThere
If we try to match the work "hit" we might match by mistake the HiT of HiThere.
One way to solve it is to place after each clean text a space. However, this might introduce other look to your results (for example inside <pre> block.
Other way is to try to place the space only after certain tags (so after <td> but not after <pre<gt>). See the access method space_after_non_text_characteristics_html for more details about this possibility.
The class Regexp::IgnoreTextCharacteristicsHTML provides implementation of get_tokens that mark as unwanted only HTML tags that are text characteristics tags (like <b> that make the text bold). After all we do not expect to have line like the following line:
<td>H</td><td>ello</td>
In some cases, the Regexp::IgnoreTextCharacteristicsHTML class provides a good solution for parsing HTML text.
ACCESS METHODS
- space_after_non_text_characteristics_html ( BOOLEAN )
-
If true (by default it is false), a space token will be placed after any tag that is not text characteristics tag. To be specific, the tags: <B>, <BASEFONT>, <BIG>, <BLINK>, <CITE>, <CODE>, <EM>, <FONT>, <I>, <KBD>, <PLAINTEXT>, <S>, <SMALL>, <STRIKE>, <STRONG>, <SUB>, <SUP>, <TT>, <U>, <VAR>, <A>, <SPAN>, and <WBR>.
AUTHOR
Rani Pinchuk, <rani@cpan.org>
COPYRIGHT
Copyright (c) 2002 WAM!NET EOC Belgium N.V. & Rani Pinchuk. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.