NAME

Regexp::IgnoreHTML - Let us ignore the HTML tags when parsing HTML text

SYNOPSIS

  use Regexp::IgnoreHTML;

  my $rei = new Regexp::IgnoreHTML($text, 
				   "<!-- __INDEX__ -->");
  # split the wanted text from the unwanted text
  $rei->split();  

  # use substitution function
  $rei->s('(var)_(\d+)', '$2$1', 'gi');
  $rei->s('(\d+):(\d+)', '$2:$1');

  # merge back to get the resulted text
  my $changed_text = $rei->merge();

DESCRIPTION

Inherit from Regexp:Ignore and implements the get_tokens method. The tokens that are returned by the get_tokens are all the HTML tags.

Note that for some HTML code, it might be better to use different get_tokens then this one. Suppose for example we have the following code

<table>
  <tr>
    <td>Hi</td><td>There</td>
  </tr>
</table>

The "cleaned" text that will be generated after using the get_tokens method that comes from this class will look like:

HiThere

If we try to match the work "hit" we might match by mistake the HiT of HiThere.

One way to solve it is to place after each clean text a space. However, this might introduce other look to your results (for example inside <pre> block.

Other way is to try to place the space only after certain tags (so after <td> but not after <pre<gt>). See the access method space_after_non_text_characteristics_html for more details about this possibility.

The class Regexp::IgnoreTextCharacteristicsHTML provides implementation of get_tokens that mark as unwanted only HTML tags that are text characteristics tags (like that make the text bold). After all we do not expect to have line like the following line:

<td>H</td><td>ello</td>

In some cases, the Regexp::IgnoreTextCharacteristicsHTML class provides a good solution for parsing HTML text.

ACCESS METHODS

space_after_non_text_characteristics_html ( BOOLEAN ): If true (by default it is false), a space token will be placed after any tag that is not text characteristics tag. To be specific, the tags: , <BASEFONT>, <BIG>, <BLINK>, <CITE>, <CODE>, , , , <KBD>, <PLAINTEXT>, <S>, , <STRIKE>, , , , <TT>, , <VAR>, <A>, , and .

AUTHOR

Rani Pinchuk, <rani@cpan.org>

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

ACCESS METHODS

AUTHOR

COPYRIGHT

SEE ALSO

NAME

SYNOPSIS

DESCRIPTION

ACCESS METHODS

AUTHOR

COPYRIGHT

SEE ALSO

Module Install Instructions