NAME
Text::Prefix::XS - Fast prefix searching
SYNOPSIS
use Text::Prefix::XS;
my @haystacks = qw(
garbage
blarrgh
FOO
meh
AA-ggrr
AB-hi!
);
my @needles = qw(AAA AB FOO FOO-BAR);
my $search = prefix_search_create( map uc($_), @needles );
my %seen_hash;
foreach my $haystack (@haystacks) {
if(my $prefix = prefix_search($search, $haystack)) {
$seen_hash{$prefix}++;
}
}
$seen_hash{'FOO'} == 1;
#Compare to:
my $re = join('|', map quotemeta $_, @needles);
$re = qr/^($re)/;
foreach my $haystack (@haystacks) {
my ($match) = ($haystack =~ $re);
if($match) {
$seen_hash{$match}++;
}
}
$seen_hash{'FOO'} == 1;
DESCRIPTION
This module implements something of an trie algorithm for matching (and extracting) prefixes from text strings.
A common application I had was to pre-filter lots and lots of text for a small amount of preset prefixes.
Interestingly enough, the quickest solution until I wrote this module was to use a large regular expression (as in the synopsis)
FUNCTIONS
The interface is relatively simple. This is alpha software and the API is subject to change
prefix_search_create(@prefixes)
Create an opaque prefix search handle. It returns a thingy, which you should keep around.
Internally it will order the elements in the list, with the longest prefix being first.
It will then construct a search trie using a variety of caching and lookup layers.
prefix_search($thingy, $haystack)
Will check $haystack
for any of the prefixes in @needles
passed to "prefix_search_create". If $haystack
has a prefix, it will be returned by this function; otherwise, the return value is undef
PERFORMANCE
This module performs better than regex under any circumstance. In the future, a benchmark table will be posted - but on average, it's about 30-50% quicker than a regex.
This module would be even quicker if there were some way to implement this as an actual OP
rather than an xsub
call. But the performance is quite nice anyway
SEE ALSO
There are quite a few modules out there which aim for a Trie-like search, but they are all either not written in C, or would not be performant enough for this application.
NOTES / TODO
While my implementation is probably sloppy, the simplicity of the search itself makes it very quick and cruftless. When doing a prefix search on a large amount of text, but with a small number of prefixes, the reduction of overhead is the most important optimization for gaining performance.
CAVEATS
Private perl data structures are allocated internally, therefore it wouldn't do good to use this module across threads. Also, memory leaks will ensue if you destroy the search object. But, search handles are expensive to create, and are assumed to be made for relatively static 'needles'.
This is only because the developer is lazy and tired. Most of this should be fixed in a stable release
AUTHOR AND COPYRIGHT
Copyright (C) 2011 M. Nunberg
You may use and distribute this software under the same terms, conditions, and licensing as Perl itself.