NAME

Regexp::IgnoreHTML - Let us ignore the HTML tags when parsing HTML text

SYNOPSIS

  use Regexp::IgnoreHTML;

  my $rei = new Regexp::IgnoreHTML($text, 
				   "<!-- __INDEX__ -->");
  # split the wanted text from the unwanted text
  $rei->split();  

  # use substitution function
  $rei->s('(var)_(\d+)', '$2$1', 'gi');
  $rei->s('(\d+):(\d+)', '$2:$1');

  # merge back to get the resulted text
  my $changed_text = $rei->merge();

DESCRIPTION

Inherit from Regexp:Ignore and implements the get_tokens method. The tokens that are returned by the get_tokens are all the HTML tags.

Note that for some HTML code, it might be better to use different get_tokens then this one. Suppose for example we have the following code

<table>
  <tr>
    <td>Hi</td><td>There</td>
  </tr>
</table>

The "cleaned" text that will be generated after using the get_tokens method that comes from this class will look like:

HiThere

If we try to match the work "hit" we might match by mistake the HiT of HiThere.

One way to solve it is to place after each clean text a space. However, this might introduce other look to your results (for example inside <pre> block.

Other way is to try to place the space only after certain tags (so after <td> but not after <pre<gt>). See the access method space_after_non_text_characteristics_html for more details about this possibility.

The class Regexp::IgnoreTextCharacteristicsHTML provides implementation of get_tokens that mark as unwanted only HTML tags that are text characteristics tags (like <b> that make the text bold). After all we do not expect to have line like the following line:

<td>H</td><td>ello</td>

In some cases, the Regexp::IgnoreTextCharacteristicsHTML class provides a good solution for parsing HTML text.

ACCESS METHODS

space_after_non_text_characteristics_html ( BOOLEAN )

If true (by default it is false), a space token will be placed after any tag that is not text characteristics tag. To be specific, the tags: <B>, <BASEFONT>, <BIG>, <BLINK>, <CITE>, <CODE>, <EM>, <FONT>, <I>, <KBD>, <PLAINTEXT>, <S>, <SMALL>, <STRIKE>, <STRONG>, <SUB>, <SUP>, <TT>, <U>, <VAR>, <A>, <SPAN>, and <WBR>.

AUTHOR

Rani Pinchuk, <rani@cpan.org>

COPYRIGHT

Copyright (c) 2002 WAM!NET EOC Belgium N.V. & Rani Pinchuk. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perl, perlop, perlre, Regexp::Ignore.