NAME
OurNet::FuzzyIndex - Inverted search for double-byte characters
SYNOPSIS
use OurNet::FuzzyIndex;
my $idxfile = 'test.idx'; # Name of the database file
my $pagesize = undef; # Page size (twice of an average record)
my $cache = undef; # Cache size (undef to use default)
my $subdbs = 0; # Number of child dbs; 0 for none
# Initiate the DB from scratch
unlink $idxfile if -e $idxfile;
my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs);
# Index a record: key = 'Doc1', content = 'Some text here'
$db->insert('Doc1', 'Some text here');
# Alternatively, parse the content first with different weights
my %words = $db->parse("Some other text here", 5);
%words = $db->parse_xs("Some more texts here", 2, \%words);
# Then index the resulting hash with 'Doc2' as its key
$db->insert('Doc2', %words);
# Perform a query: the 2nd argument is the 'exact match' flag
my %result = $db->query('search for some text', $MATCH_FUZZY);
# Combine the result with another query
%result = $db->query('more please', $MATCH_NOT, \%result);
# Dump the results; note you have to call $db->getkey each time
foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) {
$val = $result{$idx};
print "Matched: ".$db->getkey($idx)." (score $val)\n";
}
# Set database variables
$db->setvar('variable', "fetch success!\n");
print $db->getvar('variable');
# Get all records: the optional 0 says we want an array of keys
print "These records are indexed:\n";
print join(',', $db->getkeys(0));
# Alternatively, get it with its internal index number
my %allkeys = $db->getkeys(1);
DESCRIPTION
OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.
It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.
This module also supports a distributed databases option, which optimizes each query to access only a small portion of database.
Although this module currently only supports the Big5 encoding internally, you could override the parse.c module for extensions, or add your own translation maps.
METHODS
OurNet::FuzzyIndex->new($dbfile, [ $pagesize, $cachesize, $split, $submin, $submax ])
The constructor method; normally only needs the first argument.
$self->parse($content, [$weight], [\%words])
Parses $content
into two-word chunks, stored as keys in %words
, with values equal to their occurrence counts multipled by $weight
(defaults to 1). May also be invoked as a normal function without $self
.
Returns the hash (or hash reference in scalar context) representing the parsed words and frequency.
$self->parse_xs($content, [$weight], [\%words])
Same as parse()
, but implemented in XS.
$self->insert($key, [$content | \%words])
Insert an entry, stored in $content
as pre-parsed text, or in %words
as a parsed hash. The $key
is the name of the entry in the database.
Returns the database ID of the newly created entry.
$self->query($query, $flag, [\%match])
Perform a query on the database represented by $self
; $query
contains a free-form query string. The type of query is specified by $flag
, as one of the constants below:
- MATCH_FUZZY (default)
-
Match the query string with fuzzy scoring heuristics.
- MATCH_EXACT
-
Match the exact string
$query
. - MATCH_PART
-
Match each individual characters fuzzily, in addition to normal fuzzy matching.
- MATCH_NOT
-
Only matches entries that has none of the phrases in the query string.
The %match
hash, if specified, contains the result of a previous query()
, and indicates that this is a subquery limited by the previous search.
Returns the hash (or hash reference in scalar context) containing the matched entry IDs as keys, and their scores as values.
$self->sync()
Synchronize the in-memory records into the disk.
$self->setvar($varname, $value)
Sets a user-defined variable in the database. Such variables does not affect operations on the database.
$self->getvar($varname)
Returns the value of a previously set variable, or undef
if no such variable exists.
$self->getvars($partial, [$wanthash])
Get all variables beginning with $partial
; returns an array of the variable names, or a hash with the variable values as hash values if if $wanthash
is specified.
$self->getkey($seq)
Returns the name of the entry with <$seq> as the ID, or undef
if there is no such entry. Usually called after a query()
to fetch the matched entries.
$self->findkey($key)
Find the ID of the entry with the name $key
; the reverse operation of getkey()
.
$self->delete($key)
Delete the entry with name $key
.
$self->delkey($seq)
Delete the entry with the ID $seq
. This function's name is a bit of a misnomer; sorry about that.
$self->getkeys([$wanthash])
Return all entry names as an array, or as a hash with their IDs as hash values if if $wanthash
is specified.
$self->_store($varname, $value)
Private function to store an internal variable to the database. Do not call this directly.
CAVEATS
The query()
function uses a time-consuming callback function _parse_q()
to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)
The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the content exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.
TODO
Internal handling of locale/unicode mappings
Boolean / selective search using combined MATCH_* flags
Fix bugs concerning sub_dbs, or deprecate them altogether
Use Lingua::ZH::TaBE for better word-segmenting algorithms
SEE ALSO
fzindex, fzquery, OurNet::ChatBot
AUTHORS
Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
COPYRIGHT
Copyright 2001, 2003 by Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.