NAME

Regexp::IgnoreTextCharacteristicsHTML - Let us ignore the HTML tags when parsing HTML text

SYNOPSIS

  use Regexp::IgnoreTextCharacteristicsHTML;

  my $rei = 
    new Regexp::IgnoreTextCharacteristicsHTML($text, 
					      "<!-- __INDEX__ -->");
  # split the wanted text from the unwanted text
  $rei->split();  

  # use substitution function
  $rei->s('(var)_(\d+)', '$2$1', 'gi');
  $rei->s('(\d+):(\d+)', '$2:$1');

  # merge back to get the resulted text
  my $changed_text = $rei->merge();

DESCRIPTION

Inherit from Regexp:Ignore and implements the get_tokens method. The tokens that are returned by the get_tokens as unwanted are text characteristics HTML tags. To be specific, the tags: <B>, <BASEFONT>, <BIG>, <BLINK>, <CITE>, <CODE>, <EM>, <FONT>, <I>, <KBD>, <PLAINTEXT>, <S>, <SMALL>, <STRIKE>, <STRONG>, <SUB>, <SUP>, <TT>, <U>, <VAR>, <A>, <SPAN>, and <WBR>.

It will also take as unwanted tokens any HTML remarks and any remarks that MSWord creates when saving a document as HTML. However this behaviour can be changed using the class members IGNORE_HTML_REMARKS and IGNORE_WORD_REMARKS.

ACCESS METHODS

ignore_html_remarks ( BOOLEAN )

If true (which is also the default), the get_tokens method will take the HTML remarks as unwanted tokens. So, any <!-- ... --> will be ignored. Should be called before split is called.

ignore_word_remarks ( BOOLEAN )

If true (which is also the default), the get_tokens method will take the WORD remarks as unwanted tokens. So, any <![ ... ]> will be ignored. Should be called before split is called.

do_not_ignore ( TAGS )

TAGS is a list of strings, each is a name of a tag. For example:

("B", "FONT")

The tags that will be sent to this method, will not be ignored by the object.

tags_to_ignore ( TAGS )

TAGS is a list of strings, each is a name of a tag. See do_not_ignore above, for example. The tags that are sent to this method will be ignored by the object. You can send already ignored tags, tags that were canceled by a call to do_not_ignore or totally new tags. All of them will be ignored. In a list context, it will return a list of all the tags that will be ignored.

AUTHOR

Rani Pinchuk, <rani@cpan.org>

COPYRIGHT

Copyright (c) 2002 Ockham Technology N.V. & Rani Pinchuk. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perl, perlop, perlre, Regexp::Ignore.