NAME
Regexp::IgnoreTextCharacteristicsHTML - Let us ignore the HTML tags when parsing HTML text
SYNOPSIS
use Regexp::IgnoreTextCharacteristicsHTML;
my $rei =
new Regexp::IgnoreTextCharacteristicsHTML($text,
"<!-- __INDEX__ -->");
# split the wanted text from the unwanted text
$rei->split();
# use substitution function
$rei->s('(var)_(\d+)', '$2$1', 'gi');
$rei->s('(\d+):(\d+)', '$2:$1');
# merge back to get the resulted text
my $changed_text = $rei->merge();
DESCRIPTION
Inherit from Regexp:Ignore and implements the get_tokens method. The tokens that are returned by the get_tokens as unwanted are text characteristics HTML tags. To be specific, the tags: <B>, <BASEFONT>, <BIG>, <BLINK>, <CITE>, <CODE>, <EM>, <FONT>, <I>, <KBD>, <PLAINTEXT>, <S>, <SMALL>, <STRIKE>, <STRONG>, <SUB>, <SUP>, <TT>, <U>, <VAR>, <A>, <SPAN>, and <WBR>.
It will also take as unwanted tokens any HTML remarks and any remarks that MSWord creates when saving a document as HTML. However this behaviour can be changed using the class members IGNORE_HTML_REMARKS and IGNORE_WORD_REMARKS.
ACCESS METHODS
- ignore_html_remarks ( BOOLEAN )
-
If true (which is also the default), the get_tokens method will take the HTML remarks as unwanted tokens. So, any <!-- ... --> will be ignored. Should be called before split is called.
- ignore_word_remarks ( BOOLEAN )
-
If true (which is also the default), the get_tokens method will take the WORD remarks as unwanted tokens. So, any <![ ... ]> will be ignored. Should be called before split is called.
- do_not_ignore ( TAGS )
-
TAGS is a list of strings, each is a name of a tag. For example:
("B", "FONT")
The tags that will be sent to this method, will not be ignored by the object.
-
TAGS is a list of strings, each is a name of a tag. See do_not_ignore above, for example. The tags that are sent to this method will be ignored by the object. You can send already ignored tags, tags that were canceled by a call to do_not_ignore or totally new tags. All of them will be ignored. In a list context, it will return a list of all the tags that will be ignored.
AUTHOR
Rani Pinchuk, <rani@cpan.org>
COPYRIGHT
Copyright (c) 2002 Ockham Technology N.V. & Rani Pinchuk. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.