NAME
Search::Indexer - full-text indexer
SYNOPSIS
use Search::Indexer;
# feed the index
my $ix = new Search::Indexer(dir => $dir, writeMode => 1);
while (my ($docId, $docContent) = get_next_document() ) {
$ix->add($docId, $docContent);
}
# search
my $result = $ix->search('normal_word +mandatory_word -excludedWord "exact phrase"');
my $scores = $result->{scores};
my $n_docs = keys %$scores;
my @best_docs = (sort {$scores->{$b} <=> $scores->{$a}} keys %$scores)[0 .. $max];
my $killedWords = join ", ", @{$result->{killedWords}};
# show results
print "$n_docs documents found, displaying the first $max\n";
print "words $killedWords were ignored during the search\n" if $killedWords;
foreach my $docId (@best_docs) {
my $excerpts = join "\n", $ix->excerpts(doc_content($docId), $result->{regex});
print "DOCUMENT $docId (score $scores->{$docId}) :\n$excerpts\n\n";
}
# boolean search
my $result2 = $ix->search('word1 AND (word2 OR word3) AND NOT word4');
# removing a document
$ix->remove($someDocId);
DESCRIPTION
This module builds a fulltext index for a collection of documents. It provides support for searching through the collection and displaying the sorted results, together with contextual excerpts of the original documents.
Unlike Search::Elasticsearch, which is a client to an indexing server, here we have an embedded index, running in the same process as your application. Index data is stored in BerkeleyDB databases, accessed through a C-code library, so indexing is fast; the storage format use perlpacktut/Another Portable Binary Encoding, so it can accomodate large collections.
Documents
As far as this module is concerned, a document is just a buffer of plain text, together with a unique identifying number. The caller is responsible for supplying unique numbers, and for converting the original source (HTML, PDF, whatever) into plain text. Metadata about documents (fields like date, author, Dublin Core, etc.) must be handled externally, in a database or any other store. For collections of moderate size, a candidate for storing metadata could be File::Tabular, which uses the same query parser.
Search syntax
Searching requests may include plain terms, "exact phrases", '+' or '-' prefixes, boolean operators and parentheses. See Search::QueryParser for details.
Index files
The indexer uses three files in BerkeleyDB format : a) a mapping from words to wordIds; b) a mapping from wordIds to lists of documents ; c) a mapping from pairs (docId, wordId) to lists of positions within the document. This third file holds detailed information and therefore uses more disk space ; but it allows us to quickly retrieve "exact phrases" (sequences of adjacent words) in the document. Optionally, this positional information can be omitted, yielding to smaller index files, but less precision in searches (a query for "exact phrase" will be downgraded to a search for all words in the phrase, even if not adjacent).
NOTE: the internal representation in v1.0 has slightly changed from previous versions; existing indexes are not compatible and must be rebuilt.
Indexing steps
Indexing of a document buffer goes through the following steps :
terms are extracted, according to the wregex regular expression
extracted terms are normalized or filtered out by the wfilter callback function. This function can for example remove accented characters, perform lemmatization, suppress irrelevant terms (such as numbers), etc.
normalized terms are eliminated if they belong to the stopwords list (list of common words to exclude from the index).
remaining terms are stored, together with the positions where they occur in the document.
Related modules
This module depends on Search::QueryParser for analyzing requests and on BerkeleyDB for storing the indexes.
This module was originally designed together with File::Tabular; however it can be used independently. In particular, it is used in the Pod::POM::Web application for indexing all local Perl modules and documentation.
METHODS
Class methods
new(arg1 => expr1, ...)
Instantiates an indexer (either for a new index, or for accessing an existing index). Parameters are :
- dir
-
Directory for index files and possibly for the stopwords file. Defaults to the current directory.
- writeMode
-
Flag which must be set to true if the application intends to write into the index.
- wregex
-
Regex for matching a word (
qr/\p{Word}+/
by default). Used both for add and search method. The regex should not contain any capturing parentheses (use non-capturing parentheses(?: ... )
instead). - wfilter
-
Ref to a callback sub that may normalize or eliminate a word. The default wfilter performs case folding and translates accented characters into their non-accented form.
- stopwords
-
List of words that will be marked into the index as "words to exclude". Stopwords are stored in the index, so they need not be supplied again when opening an index for searches or updates.
The list may be supplied either as a ref to an array of scalars, or as a the name of a file containing the stopwords (full pathname or filename relative to dir).
- fieldname
-
This paramete will only affect the search method. Search queries are passed to a general parser (see Search::QueryParser). Then, before being applied to the present indexer module, queries are pruned of irrelevant items. Query items are considered relevant if they have no associated field name, or if the associated field name is equal to this
fieldname
.
Below are some additional parameters that only affect the "excerpts" method :
- ctxtNumChars
-
Number of characters determining the size of contextual excerpts return by the "excerpts" method. A contextual excerpt is a part of the document text, containg a matched word surrounded by ctxtNumChars characters to the left and to the right. Default is 35.
- maxExcerpts
-
Maximum number of contextual excerpts to retrieve per document. Default is 5.
- preMatch
-
String to insert in contextual excerpts before a matched word. Default is
"<b>"
. - postMatch
-
String to insert in contextual excerpts after a matched word. Default is
"</b>"
. - positions
-
my $indexer = new Search::Indexer(dir => $dir, writeMode => 1, positions => 0);
Truth value to tell whether or not, when creating a new index, word positions should be stored. The default is true.
If you turn it off, index files will be smaller, indexing will be faster, but results will be less precise, because the indexer can no longer find "exact phrases". So if you type
"quick fox jumped"
, the query will be translated intoquick AND fox AND jumped
, and therefore will retrieve documents in which those three words are present, even if not in the required order or proximity. - bm25_k1
-
Value of the k1 constant to be used when computing the https://fr.wikipedia.org/wiki/Okapi_BM25 ranking function. Default is 1.2.
- bm25_b
-
Value of the b constant to be used when computing the https://fr.wikipedia.org/wiki/Okapi_BM25 ranking function. Default is 0.75.
has_index_in_dir($dir)
Checks for presence of the three *.bdb files in the given $dir
.
Building the index
add($docId, $buf)
Add a new document to the index. $docId is the unique identifier for this doc (the caller is responsible for uniqueness). Doc ids need not be consecutive. $buf is a scalar containing the text representation of this doc.
remove($docId [, $buf])
Removes a document from the index. If the index contains word positions (true by default), then only the docId
is needed; however, if the index was created without word positions, then the text representation of the document must be given as a scalar string in the second argument (of course this text should be the same as the one that was supplied when calling the "add" method).
Searching the index
search($queryString, [ $implicitPlus ])
Searches the index. The query string may be a simple word or a complex boolean expression, as described above in the "DESCRIPTION" section; precise technical details are documented in Search::QueryParser. The second argument $implicitPlus
is optional ; if true, all words without any prefix will implicitly take the prefix '+' (all become mandatory words).
The return value is a hashref containing :
- scores
-
hash ref, where keys are docIds of matching documents, and values are the corresponding relevancy scores, computed according to the https://fr.wikipedia.org/wiki/Okapi_BM25 algorithm. Documents with the highest scores are the most relevant.
- killedWords
-
ref to an array of terms from the query string which were ignored during the search (because they were filtered out or were stopwords)
- regex
-
ref to a regular expression corresponding to all terms in the query string. This will be useful if you later want to get contextual excerpts from the found documents (see the "excerpts" method).
excerpts(buf, regex)
Searches buf
for occurrences of regex
, extracts the occurences together with some context (a number of characters to the left and to the right), and highlights the occurences. See parameters ctxtNumChars
, maxExcerpts
, preMatch
, postMatch
of the "new" method.
Other public methods
indexed_words_for_prefix($prefix)
Returns a ref to an array of words found in the dictionary, starting with the given prefix. For example, $ix->indexed_words_for_prefix("foo")
will return "foo", "food", "fool", "footage", etc.
dump()
Debugging function that prints indexed words with lists of associated docs.
AUTHOR
Laurent Dami, <dami@cpan.org>
LICENSE AND COPYRIGHT
Copyright 2005, 2007, 2021 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
4 POD Errors
The following errors were encountered while parsing the POD:
- Around line 742:
alternative text 'perlpacktut/Another Portable Binary Encoding' contains non-escaped | or /
- Around line 932:
alternative text 'https://fr.wikipedia.org/wiki/Okapi_BM25' contains non-escaped | or /
- Around line 938:
alternative text 'https://fr.wikipedia.org/wiki/Okapi_BM25' contains non-escaped | or /
- Around line 988:
alternative text 'https://fr.wikipedia.org/wiki/Okapi_BM25' contains non-escaped | or /