NAME
Search.pm - provide framework for multiple searches
SYNOPSIS
use Search::TextSearch;
use Search::Glimpse;
DESCRIPTION
This module is a base class interfacing to search engines. It defines an interface that can be used by any of the Search search modules, such as Search::Glimpse or Search::TextSearch, which are the standard ones included with the module.
It exports no routines, just provides methods for the other classes.
Search Comparisons
Text Glimpse
------- --------
Speed Medium Dependent on uniqueness of terms
Requires add'l software No Yes
Search individually Yes Yes
for fields
Search multiple files Yes Yes
Allows spelling errors No Yes
Fully grouped matching No No
Methods
Only virtual methods are supported by the Search::Base class. Those search modules that inherit from it may export static methods if desired. There are the "Global Parameter Method", "Column Setting Functions", and "Row Setting Functions". "SEE ALSO" Search::Glimpse, Search::TextSearch.
Global Parameter Method
$s->global(param,val);
$status = $s->global(param);
%globals = $s->global();
Allows setting of the parameters that are global to the search. The standard ones are listed below.
- base_directory
-
Those engines which look for matches in index files can read this to get the base directory of the images.
- case_sensitive
-
This is a global version of the cases field. If set, the search engine should return only matches which exactly match, including distinction between lower- and upper-case letters. The default is not set, to ignore case.
- error_page
-
A page or filename to be displayed when a user error occurs. Passed along with a single string parameter to the error_routine, as in:
&{$self->{global}->{error_routine}} ($self->{global}->{error_page}, $msg);
- error_routine
-
Reference to a subroutine which will send errors to the user. The default is '\&croak'.
- exact_match
-
Strings sent as match specifications will have double quotes put around them, meaning the words must be found in the order they are put in. Any double quotes contained in the string will be silently deleted.
- first_match
-
The number of the first match returned in a of more_matches. This tells the calling program where to start their increment.
- head_skip
-
Used for the TextSearch module to indicate lines that should be skipped at the beginning of the file. Allows a header to be skipped if it should not be searched.
- index_delim
-
The delimiter character to terminate the return code in an ascii index file. In Search::Glimpse and Search::TextSearch, the default is "\t", or TAB. This is also the default for {return_delim} if that is not set.
If field-matching is being used, this the character/string used for splitting the fields. If properly escaped, and {return_delim} is used for joining fields, it can be a regular expression -- Perl style.
- index_file
-
A specification of an index file (or files) to search. The usage is left to the module -- it could, for example, be an anonymous array (as in Search::TextSearch) or wild-card specification for multiple indices (as in Search::Glimpse).
- log_routine
-
A reference to a subroutine to log an error or status indication. By default, it is '\&Carp::carp';
- match_limit
-
The number of matches to return. Not to be confused with max_matches, at which number the search will terminate. Additional matches will be stored in the file pointed to by save_dir, session_id, and search_mod. The default is 50.
- matches
-
Set by the search routine to indicate the number of matches in the last search. If the engine can return the total number of matches (without the data) then that is the result.
- min_string
-
The minimum size of search string that is supported. Using a size of less than 4 for Glimpse, for example, is not wise.
- next_pointer
-
The pointer to the next list of matches. This is for engines that can return only a subset of the elements starting from an element. For making a next match list.
- next_url
-
The base URL that should be used to invoke the more_matches function. Provided as an object-contained scratchpad value for the calling routine -- it will not be used or modified by Search::Base.
There are a couple of useful ways to use this to invoke the proper more_matches search that are shown in the example search CGI provided with this module set. Both involve setting the next_url and combining it with a random session_id and search_mod.
- or_search
-
If set, the search engine should return matches which match any of the search patterns. The default is not set, requiring matches to all of the keywords for a match return.
- overflow
-
Set by the search routine if it matched the maximum number before reaching the end of the search. Set to undef if not supported.
- record_delim
-
This sets the type of record that will be searched. For the Search::TextSearch module, this is the same as Perl -- in fact, the default is to use $/ (at the time of object creation) as the default.
For the Search::Glimpse module, the following mappings occur:
$/ = "\n" Glimpse default $/ = "\n\n" -d '$$' $/ = "" -d '$$' $/ = undef -d 'NeVAiRbE' anything else passed on as is (-d 'whatever')
One useful pattern is '^From ', which will make each email message in a folder a separate record.
If you are doing this, and expect to be doing field returns, it will probably be useful to set "\n\n" or "\n" as the default index_delim. If used in combination with the obscure anonymous hash definition of return_fields, you can search and return mail headers on each message that matches.
- return_delim
-
The delimiter character to join fields that are cut out of the matched line/paragraph/page. The default is to set it to be the same as {index_delim} if not explicitly set.
- return_fields
-
The fields that will be returned from the line/paragraph/page matched. This is not to be confused with the fields setting -- it will not affect the matching, only the returned fields. The default (when it is undefined) is to return only the first field. There are several options for this field.
If the value is an ARRAY reference, an integer list of the columns to be returned is assumed.
If the value is a HASH reference, then all words found AFTER the key of the hash (with assumed but not required whitespace as a separator) up to the value of the item (used as a delimiter). The following example will print the value of the From:, Reply-to: and Date: headers from any message in your (UNIX) system mailbox that contains 'foobar'.
$s = new Search::TextSearch return_fields => { From: => "\n", Reply-To: => "\n", Date: => "\n", }, record_delim => "\nFrom ", search_file => $ENV{MAIL}; print $s->search('foobar');
- return_file_name
-
All return options will be ignored, and the file names of any matches will be returned instead. The limit match-to-field routines are still enabled for Search::TextSearch, but not for Glimpse, since the 'glimpse -l' option is used for that.
- save_dir
-
The directory that search caches (for the more_matches function) will be saved in. Only applies to file save mode.
- search_mod
-
This is used to develop a unique search save id for a user with a consistent session_id. For the more_matches function.
- search_port
-
The port (passed to glimpse with the -K option) that is to be used for a network-attached search server.
- search_server
-
The host name of a network-attached search server, passed to glimpse with the -J option.
- session_id
-
This is used to determine the save file name or hash key used to cache a search (for the more_matches) function.
- speed
-
The speed of search desired, in an integer value from one to 10. Those engines that have a faster method to search (possibly at a cost of comprensivity) can adjust their routines accordingly.
- spelling_errors
-
Those engines that support "tolerant matching" can honor this parameter to set the number of spelling errors that they will allow for. This can slow search dramatically on big databases. Ignored by search engines that don't have the capability.
- substring_match
-
If set, the search engine should return partial, or substring, matches. The default is not set, to indicate whole word matching. This can slow search dramatically on big databases.
- uneval_routine
-
A reference to a subroutine to save the search parameters to a cache. By default, it is '\&uneval', the routine supplied with Search::Base.
METHODS
Virtual methods provided
- more_matches
-
Given a file with return codes from previous searches, one per line, returns an array with the correct matches in the array. Opens the file in directory save_dir, with session information appended (the session_id and search_mod), and returns match_limit matches, starting at next_pointer.
- search
-
This is the main method defined in the individual search engine. You can submit a single parameter for a quick search, which will be interpreted as the one and only search specification, overriding any settings residing in the specs array. Options can be specified at object creation, or separately with the global method. Or a search_spec can be specified, which will temporarily override the setting in specs (for that invocation only).
Otherwise, the parameters are named search options as documented above. Examples:
# Simple search with default options for 'foobar' in the named file $s = new Search::TextSearch search_file => '/var/adm/messages'); @found = $s->search('foobar'); # Search for 'foobar' in /var/adm/messages, return only fields 0 and 2 # where fields are separated by spaces $s = new Search::TextSearch; @found = $s->search( search_file => '/var/adm/messages', search_spec => 'foobar', return_fields => [0,2], return_delim => ' ', index_delim => '\s+' ); # Search for 'foobar' in any file containing 'messages' in # the default glimpse index, return the file names $s = new Search::Glimpse; @found = $s->search( search_spec => 'foobar', search_file => 'messages', return_file_name => 1, ); # Same as above except use glimpse index located in /var directory $s = new Search::Glimpse; @found = $s->search( search_spec => 'foobar', base_directory => '/var', search_file => 'messages', return_file_name => 1, ); # Search all files in /etc # Return file names with lines that have 'foo' in field 1 # and 'bar' in field 3, with case sensitivity # (using the default field delimiter of \t) $s = new Search::TextSearch; $s->rowpush('foo', 1); $s->rowpush('bar', 3); chop(@files = `ls /etc`); @found = $s->search( search_file => [@files], case_sensitive => 1, return_file_name => 1, ); # Same as above using direct access to specs/fields $s = new Search::TextSearch; $s->specs('foo', 'bar'); $s->fields(1, 3); chop(@files = `ls /etc`); @found = $s->search( search_file => [@files], case_sensitive => 1, return_file_name => 1, ); # Repeat search with above settings, except for specs, # if less than 4 matches are found if(@found < 4) { @found = $s->search('foo'); }
Column Setting Methods
Column setting functions allow the setting of a series of columns of match criteria:
$search->specs('foo', 'foo bar', 'foobar');
$search->fields(1, 3, 4);
This is an example for the specs and fields match criteria, which are the search specifications and the the fields to search, respectively. Similar functions are provided for mods, links, cases, negates, open_parens, and close_parens.
For the included Search::Glimpse and Search::TextSearch modules, an item will match the above example only if field (or column) 1 contains 'foo', field 3 contains 'foo' and/or 'bar', and field 4 contains foobar. The setting of the case_sensitive, or_search, and substring_match terms will be honored as well.
For simple searches, only one term need be set, and the grouping functions links, open_parens, and close_parens are ignored. In most cases, if the setting for a particular column is not defined for a row, the value in the global setting is used.
- specs
-
The search text, raw, per field. This is the only item that necessarily needs to be set to do a search.
If more than one specification is present, there are three forms of behavior supported by the included Search::TextSearch and Search::Glimpse modules. First, if there are multiple search specifications, they are combined together, just as they would if separated by spaces (and not quoted). Second, if the number of specs matches the number of fields, each spec must match the field that it is associated with (subject to the or_search and case_sensitive settings within that field). Last, if there are more fields than specs, only the columns in fields are searched for the combined specs.
- fields
-
The column numbers to search, where a column is a field separated by index_delim. In the Search::TextSearch and Search::Glimpse modules, this becomes operative in one of two ways. If the number of specs match the number of fields, each specification is separately checked against its associated field. If the number of fields is different from the number of specs, all specs are applied, but only to the text in the specified fields. Both first match on all of the text in the row, then filter the match with another routine that checks for matches in the specified fields.
- mods
-
Modifies the match criteria. Recognized modifications might be:
start Matches when the field starts with the spec sub Match substrings.
Not supported in the included modules.
- links
-
The link to the previous row. If there are two fields to search, with two different specs, this determines whether the search is AND, OR, or NEAR. For engines that support it, NEAR matches with in $self->global('near') words of the previous word (forward only). Not supported in the included modules.
- cases
-
For advanced search engines that support full associative case-sensitivity. Determines whether the particular match in this set will be case-sensitive or not. If the search engine doesn't support independent case-sensitivity (like the Search::TextSearch and Search::Glimpse modules), the value in or_search will be used. Not supported in the included modules.
- negates
-
Negates the sense of the match for that term. Allows searches like "spec1 AND NOT spec2" or "NOT spec1 AND spec2". Not supported in the included modules.
- open_parens
-
Determines whether a logical parentheses will be placed on the left of the term. Allows grouping of search terms for more expressive matching, i.e. "(AUTHOR Shakespeare AND TYPE Play ) NOT TITLE Hamlet". Not supported in the included modules.
- close_parens
-
Determines whether a logical parentheses will be placed on the right of the term. Not supported in the included modules.
Row Setting Methods
Row setting functions allow the setting of all columns in a row.
$query->rowpush($field,$spec,$case,$mod,$link,$left,$right);
($field,$spec,$case,$mod,$link,$left,$right) = $query->rowpop();
@oldvals = $query->rowset(n,$field,$spec,$case,$mod,$link,$left,$right);
You can ignore the trailing parameters for a simple search. For example:
$field = 'author';
$spec = 'forsythe';
$limit = 25;
$query = new Search::Glimpse;
$query->rowpush($field,$spec);
@rows = $query->search( match_limit => 25 );
This searches the field 'author' for the name 'forsythe', with all other options at their defaults (ignore case, match substrings, not negated, no links, no grouping), and will return 25 matches (sets the matchlimit global). For a more complex search, you can add the rest of the parameters as needed.
SEE ALSO
glimpse(1), Search::Glimpse(3L), Search::TextSearch(3L)
AUTHOR
Mike Heins, <mikeh@iac.net>
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 316:
'=item' outside of any '=over'
- Around line 391:
You forgot a '=back' before '=head2'