NAME

Search::FreeText - Free text indexing module for medium-to-large text corpuses

SYNOPSIS

my $test = new Search::FreeText(-db => ['DB_File', "stories.db"]);

$text->open_index();
$text->clear_index();
$text->index_document(1, "Hello world");
$text->index_document(2, "World in motion");
$text->index_document(3, "Cruel crazy beautiful world");
$text->index_document(4, "Hey crazy");
$text->close_index();

$text->open_index();
foreach ($text->search("Crazy", 10)) {
    print "$_->[0], $_->[1]\n";
};
$text->close_index();

DESCRIPTION

This module provides free text searching in a relatively open manner. It allows a persistent inverted file index to be constructed and managed (within limits), and then to be searched fairly efficiently. The module depends on a DBM module of some kind to manage the inverted file (DB_File is usually the best choice, as it is quite fast, quite scaleable, and accepts the long values that are needed for performance.

The free text searching algorithm used is the BM25 weighting scheme described in Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., and Payne, A. (1995). Okapi at TREC-4, in NIST Special Publication 500-236, the Fourth Text Retrieval Conference (TREC-4), pages 73-96.

Much of the module depends on an open lexical analysis system, which is implemented by Search::FreeText::LexicalAnalysis. This is where all the word splitting and stemming is handled (Lingua::Stem is used for the stemming).

Using the module is quite simple: you can open an index and close it, and while it is open you add documents as strings, each with a key of your own choosing. You can search the corpus using a string, and you get back a list of matches, each an array of your own document key and a relevance measure. So, for example, the keys might be database table keys, URLs, file names, anything like that will do. This makes Search::FreeText a very useful package to implement fairly efficient and high quality search systems.

METHODS

new Search::FreeText(arguments...);

Makes a new free text searching object. The following initialization parameters are supported:

-db

Parameters to be passed to the tie function to connect to the database module. The first parameter is assumed to be a Perl module, and will be required.

-filters

A list of filters, which is passed to Search::FreeText::LexicalAnalysis. If none is provided here, the default is used, which is, in order: Search::FreeText::LexicalAnalysis::Heuristics, Search::FreeText::LexicalAnalysis::Tokenize, Search::FreeText::LexicalAnalysis::Stop, and Search::FreeText::LexicalAnalysis::Stem.

-stoplist

This is optional, but if provided, is a big string containing the stop list. The Search::FreeText::LexicalAnalysis::Stop module looks here for a stop list, and if one is provided, it uses it rather than defaulting to its own.

-values

Sets the BM25 parameters. The value should be hash reference containing the key values for B, K1, and K3 in the BM25 matching measure. The default values for these parameters are 0.75, 1.2, and 7 respectively.

$self->open_index();

This method is called to open the index database file. Underneath, this calls the tie function, with the parameters passed using the -args keyword when the object was initialized.

$self->close_index();

This method is called to close the index database file.

$self->clear_index();

This method can be used to clear the index database file, which should be open at the time.

$self->index_document(documentkey, string);

This is the method which adds a new document to the index. Your chosen document key can be passed as the first parameter: this value will be passed back to you when a search matches this document, but it can be more or less any string you like. The string is passed to the lexical analyser before the document is added to the free text index.

$self->add_document(documentid, documentkey, documentsize, word);

The internal method which adds a new document to the inverted file database. You shouldn't need to worry about this, as it will be called automatically by index_document.

$self->get_new_document_id(documentkey, documentsize);

An internal method which generates and allocates a new document id for the given document key, and updates the database to include it. This method is called automatically when a document is being indexed.

$self->search_with_callback(words, subroutine);

This is the core of the searching system. words can either be a string or an array of words - if a string the lexical analyser is used to turn it into an array of words. This is then used to search the index, and for each match, the subroutine is called with the Search::FreeText instance as the first parameter (in case it's a method), and the document key, relevance measure, database handle, and internal document id. The last two parameters are not to be mucked about with!

$self->search(string, limit);

Searches the free text index, returning up to limit matches. Each match is returned as an array of two elements: first is the document key, and second is a relevance measure. The matches will be sorted when they are returned.

Internally, the search method calls the search_with_callback method, but for most purposes, this is an easier way to get the matches that you need. However, under a few circumstances, search_with_callback may be needed to process the matches. For example, if the search needed to be filtered in some way, you could do this by overriding search_with_callback.

CHANGES

0.05

Improved documentation to an almost acceptable standard, included tests, and quite a few other cleanups to make this the first essentially usable distribution.

0.04

Major performance problem in search_with_callback. 99% of the CPU time had nothing to do with searching, due to stupidly large amounts of backtracking in a pattern match, where we just wanted the end part of a string. Used rindex instead to achieve the same effect with huge performance improvement.

0.03

Alpha-test distribution.

0.02

Fixed the module distribution to contain the proper version of Search.pm, not the version that was autogenerated by h2xs and which trampled the original.

0.01

Beginning of the Search::FreeText class.

AUTHORS

Stuart Watt <S.N.K.Watt@rgu.ac.uk>.

Copyright (c) 2003. The Robert Gordon University. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

VERSION

Version 0.05 - 18th March 2003