NAME

HTML::Index::Search - Perl module for searching a searchable HTML files index

SYNOPSIS

use HTML::Index::Search;

my $store = HTML::Index::Store->new();
my $search = HTML::Index::Search->new( STORE => $store );
my @results = $search->search( $q, [ SOUNDEX => 1 ] );
my @words = $search->get_words( $q );

DESCRIPTION

This module is the complement to the HTML::Index::Create module. It allows the inverted index created by it to be searched, based on a query string containing words and boolean logic. The search returns a set of results consisting of the tokens corresponding to the name attributes of the HTML::Index::Document objects that were indexed by the HTML::Index::Create object. The words extracted from the query string can be accessed after the search using the get_words method.

OPTIONS

VERBOSE

Print various bumpf to STDERR.

STORE

Something which ISA HTML::Index::Store.

METHODS

This method takes a query string as its first argument. This query string is a whitespace separated list of words, optionally connected by Boolean terms (or, and, not - case insensitive), and also optionally grouped using parentheses. Any terms that are not connected by Booleans are assumed to be AND'ed. Here are some examples:

some stuff
some AND stuff
some and stuff
some OR stuff
some AND stuff AND NOT more
( more AND stuff ) OR ( sample AND stuff )

For those that are interested ... the inverted index is actually stored as a bitvector, where the entry for each word is a scalar, the n'th bit of which is set 1 or 0 depending on whether that word appears in the n'th file. This is not the most compact stoage method, but it makes the processing of Boolean queries very simple, using bitwise arithmetic. Also, since the bitvectors are generally sparce, they compress well with standard compression (in this case Compress::Gzip - see HTML::Index::Compress.

The second argument to search is an options hashref. Currently the only option available is a SOUNDEX option (value true or false). If true, the search is done via a soundex algorithm, so the result set contains all docments that contain words that sound alike to the query string by this measure.

get_words

This method simply returns the list of words (not including Booleans) extracted from the most recently searched query string. It is used by HTML::Index::Search::CGI to generate a summary with the keywords highlighted.

SEE ALSO

HTML::Index

AUTHOR

Ave Wrigley <Ave.Wrigley@itn.co.uk>

COPYRIGHT

Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.