NAME

HTML::HiLiter - highlight words in an HTML document just like a felt-tip HiLiter

SYNOPSIS

use HTML::HiLiter;

my $hiliter = new HTML::HiLiter(
 word_characters   =>  '\w\-\.',
 ignore_first_char =>  '\-\.',
 ignore_last_char  =>  '\-\.',
 tag               =>	'span',
 colors            =>	[ qw(#FFFF33 yellow pink) ],
 tag_filter        =>	\&yourtagcode(),
 text_filter       =>	\&yourtextcode(),
 query             =>  'foo bar or "some phrase"',
);

$hiliter->run($some_file_or_URL);

DESCRIPTION

HTML::HiLiter is designed to make highlighting search queries in HTML easy and accurate. HTML::HiLiter was designed for CrayDoc 4, the Cray documentation server.

As of verison 0.14, HTML::HiLiter has been completely re-written with a new API, using Search::Tools.

REQUIREMENTS

The following are required:

  • Perl version 5.8.3 or later (for proper UTF-8 support).

  • Search::Tools 0.25 or later.

  • HTML::Parser

Required to use the HTTP option in the run() method:

  • HTTP::Request

  • LWP::UserAgent

FEATURES

A cornucopia of features.

  • HTML::HiLiter parses HTML chunk by chunk, buffering all text within an HTML block element before applying highlighting to the buffer.

    The default behavior is to print() all the HTML, highlighted or not, as soon as it is evaluated. You can change that behavior with the print_stream parameter in new(), which will instead cache all the HTML and return it as a scalar string from run().

    Otherwise, you can direct the print() to a filehandle with the fh() param/method.

  • Turn highlighting off on a per-tagset basis with the custom HTML "nohiliter" attribute. Set the attribute to a b<true> value (like 1) to turn off highlighting for the duration of that tag.

  • Ample debugging. Set the debug param to a level between 1 and 3, and lots of debugging info will be printed within HTML comments <!-- -->.

  • Smart context. Won't highlight across an HTML block element like a <p></p> tagset or a <div></div> tagset. (IMHO, your indexing software shouldn't consider matches for phrases that span across those tags either.)

  • Rotating colors. Each query gets a unique color. The default is four different colors, which will repeat if you have more than four terms in a single query. You can define more or different colors in the new() object call.

  • CSS support. You can alter the highlighting markup used with the tag, class, style and text_color parameters. See the documentation for Search::Tools::HiLiter.

METHODS

new()

Create a HiLiter object.

Any parameter that can be passed to Search::Tools::HiLiter can be passed to HTML::HiLiter. In addition, the following HTML::HiLiter-specific parameters are supported:

fh

The filehandle to send output to. Defaults to STDOUT. If print_stream is false, will buffer instead of printing.

hiliter

Set a Search::Tools::HiLiter object for HTML::HiLiter to use. If you do not set one, one will be created based on the other parameters you pass.

tag_filter

A CODE reference of your choosing for filtering HTML tags as they pass through the HTML::Parser. See FILTERS.

text_filter

A CODE reference of your choosing for filtering HTML text as it passes through the HTML::Parser. See FILTERS.

buffer_limit

When the number of characters in the HTML buffer exceeds the value of buffer_limit, the buffer is printed without highlighting being attempted. The default is 2**16 characters. Make this higher at your peril. Most HTML will not exceed more than that n a <p> tagset, for example.

Default value true (1). Print highlighted HTML as the HTML::Parser encounters it. If true, use a select() in your script to print somewhere besides the perl default of STDOUT.

NOTE: Set this to 0 (false) only if you are highlighting small chunks of HTML (i.e., smaller than buffer_limit). See run().

BUILD

Called internally by new().

query

Get the Search::Tools::Query object created in new().

style_header( html )

If set, html will be applied just after the opening <head> tag while parsing. This is to allow insertion of CSS or other head-appropriate markup.

apply_hiliting( string )

Passes string through Search::Tools::HiLiter->light() and returns string highlighted.

Queries

This method is deprecated. See the query param to new() instead.

run( $file | $url | \$html )

run() takes either a file name, a URL (indicated by a leading 'http://'), or a scalar reference to a string of HTML text.

Run

For backwards compatability, Run() is an alias for run().

FILTERS

text_filter and tag_filter are two optional parameters that allow you to filter the contents of your HTML beyond normal highlighting. Each parameter takes a CODE reference.

text_filter should expect these parameters in this order:

parserobj, dtext, text, offset, length

tag_filter should expect these parameters in this order:

parserobj, tag, tagname, offset, length, offset_end, attr, text

Both should return a scalar string of text. tag_filter should return a set of attributes. text_filter may return whatever you want. See EXAMPLES and the HTML::Parser documentation for what these parameters mean and for more about writing filters.

EXAMPLES

See examples/ directory in source distribution.

HISTORY

Yet another highlighting module?

My goal was complete, exhaustive, tear-your-hair-out efforts to highlight HTML. No other modules I found on the web supported nested tags within words and phrases, or character entities. Cray uses the standard DocBook stylesheets from Norm Walsh et al, to generate HTML. These stylesheets produce valid HTML but often fool the other highlighters I found.

The problem became most evident when we started using Swish-e. Swish-e does such a good job at converting entities and doing phrase matching that we found ourselves in a dilemma: Swish-e often gave valid search results that mere mortal highlighters could not match in the source HTML -- not even the SWISH::*Highlight modules.

With the exception of the 'nohiliter' attribute, I think I follow the W3C HTML 4.01 specification. Please prove me wrong.

Prime Example of where this module overcomes other attempts by other modules.

The query 'bold in the middle' should match this HTML:

<p>some phrase <b>with <i>b</i>old</b> in&nbsp;the middle</p>

GOOD highlighting:

<p>some phrase <b>with <i><span>b</span></i><span>old</span></b><span>
in&nbsp;the middle</span></p>

BAD highlighting:

<p>some phrase <b>with <span><i>b</i>bold</b> in&nbsp;the middle</span></p>

No module I tried in my tests could even find that as a match (let alone perform bad highlighting on it), even though indexing programs like Swish-e would consider a document with that HTML a valid match.

Should you use this module?

I would suggest not using HTML::HiLiter if your HTML is fairly simple, since in HTML::HiLiter, speed has been sacrificed for accuracy and rich features. Check out HTML::Highlight instead.

Unlike other highlighting code I've found, HTML::HiLiter supports nested tags and character entities, such as might be found in technical documentation or HTML generated from some other source (like DocBook SGML or XML).

The goal is server-side highlighting that looks as if you used a felt-tip marker on the HTML page. You shouldn't need to know what the underlying tags and entities and encodings are: you just want to easily highlight some text as your browser presents it.

TODO

  • More tests.

  • Restore highlighting of link text, which was dropped in 0.14 with the Search::Tools rewrite. Highlight IMG tags where ALT attribute matches query??

KNOWN BUGS AND LIMITATIONS

Will not highlight literal parentheses ().

Phrases that contain stopwords may not highlight correctly. It's more a problem of *which* stopword the original doc used and is not an intrinsic problem with the HiLiter, but noted here for completeness' sake.

AUTHOR

Peter Karman, karman@cray.com

Thanks to the Swish-e developers, in particular Bill Moseley for graciously sharing time, advice and code examples.

Comments and suggestions are welcome.

COPYRIGHT

###############################################################################
#    CrayDoc 4
#    Copyright (C) 2004 Cray Inc swpubs@cray.com
#
#    This program is free software; you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation; either version 2 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program; if not, write to the Free Software
#    Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
###############################################################################

SUPPORT

Send email to swpubs@cray.com.

SEE ALSO

Search::Tools, HTML::Parser