NAME
HTML::Index::Search - Perl module for searching a searchable HTML files index
SYNOPSIS
use HTML::Index::Search;
my $store = HTML::Index::Store->new();
my $search = HTML::Index::Search->new( STORE => $store );
my @results = $search->search( $q, [ SOUNDEX => 1 ] );
my @words = $search->get_words( $q );
DESCRIPTION
This module is the complement to the HTML::Index::Create module. It allows the inverted index created by it to be searched, based on a query string containing words and boolean logic. The search returns a set of results consisting of the tokens corresponding to the name attributes of the HTML::Index::Document objects that were indexed by the HTML::Index::Create object. The words extracted from the query string can be accessed after the search using the get_words method.
OPTIONS
- VERBOSE
-
Print various bumpf to STDERR.
- STORE
-
Something which ISA HTML::Index::Store.
METHODS
- search
-
This method takes a query string as its first argument. This query string is a whitespace separated list of words, optionally connected by Boolean terms (or, and, not - case insensitive), and also optionally grouped using parentheses. Any terms that are not connected by Booleans are assumed to be AND'ed. Here are some examples:
some stuff some AND stuff some and stuff some OR stuff some AND stuff AND NOT more ( more AND stuff ) OR ( sample AND stuff )
For those that are interested ... the inverted index is actually stored as a bitvector, where the entry for each word is a scalar, the n'th bit of which is set 1 or 0 depending on whether that word appears in the n'th file. This is not the most compact stoage method, but it makes the processing of Boolean queries very simple, using bitwise arithmetic. Also, since the bitvectors are generally sparce, they compress well with standard compression (in this case Compress::Gzip - see HTML::Index::Compress.
The second argument to search is an options hashref. Currently the only option available is a SOUNDEX option (value true or false). If true, the search is done via a soundex algorithm, so the result set contains all docments that contain words that sound alike to the query string by this measure.
- get_words
-
This method simply returns the list of words (not including Booleans) extracted from the most recently searched query string. It is used by HTML::Index::Search::CGI to generate a summary with the keywords highlighted.
SEE ALSO
AUTHOR
Ave Wrigley <Ave.Wrigley@itn.co.uk>
COPYRIGHT
Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.