NAME

Text::Levenshtein::Flexible - XS Levenshtein distance calculation with bounds and costs

SYNOPSIS

use Text::Levenshtein::Flexible;

DESCRIPTION

Yet another Levenshtein module written in C, but a tad more flexible than the rest.

This module uses code from PostgreSQL's levenshtein distance function to provide the following features on top of plain distance calculation as it is done by Levenshtein::XS and others:

Distance-limited calculation: if a certain maximum distance is exceeded, the algorithm aborts without result. This helps performance in situations where it is clear that results over a certain maximum are not useful.
Configurable costs for insert, delete ans substitute operations. The traditional Levenshtein algorithm assumes equal cost for insertion and deletion but modifying these allows preferential correction of certain errors.

EXPORT

Nothing is exported by default.

Exportable

The following functions can be exported upon request, e.g.:

use Text::Levenshtein::Flexible qw( levenshtein levenshtein_l_all );
levenshtein
levenshtein_c
levenshtein_l
levenshtein_lc
levenshtein_l_all
levenshtein_lc_all

Procedural interface

The functions listed under "Exportable" consitute the module's procedural API. Neither the names nor the huge parameter lists are particularly pretty so the OO interface is usually recommended.

levenshtein($src, $dst)

Plain Levenshtein distance calculation between the two strings $src and $dst. Always returns an integer. If the strings are too long (currently there is a hard-coded limit of 255 characters), the function may die(), so call it in an eval block if this is a possibility.

levenshtein_c($src, $dst, $cost_ins, $cost_del, $cost_sub)

Distance between the two strings $src and $dst using the specified costs for insertion, deletion and substitution respectively. Always returns an integer unless it dies.

levenshtein_l($src, $dst, $max_distance)

Distance between $src and $dst unless it is bigger than $max_distance (think _limit!), in which case undef is returned. May die just like the other functions.

levenshtein_lc($src, $dst, $max_distance, $cost_ins, $cost_del, $cost_sub)

Distance between $src and $dst using the specified costs, up to $max_distance,

levenshtein_l_all($max_distance, $src, @dst)

For an array @dst of strings, return all that are up to $max_distance from $src. The result is a list of 2-element arrays consisting of string-distance pairs. To get a list of strings sorted by distance:

map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
levenshtein_l_all(2, "bar", "foo", "blah", "baz");

Note that since the *_all functions were converted to XS as well, this function delegates to the OO version internally to avoid too much XS code duplication, so the OO interface is preferable for this in any case.

levenshtein_lc_all($max_distance, $cost_ins, $cost_del, $cost_sub, $src, @dst)

For an array @dst of strings, return all that are up to $max_distance from $src when using the specified costs as in levenshtein_c. The result is the same as for levenshtein_l_all and the remark about the OO version applies equally here.

Note that there is no levenshtein_all() function because it is trivial to write using map.

Object-oriented interface

The OO API will usually be more convenient except for trivial calculations because it allows to specify limits and costs once and pass only variable data to object methods. Being implemented in C/XS it is just as fast as the procedural one, or faster in the case of the list functions.

new($max_distance, $cost_ins, $cost_del, $cost_sub)

All four constructor arguments are optional but must be defined if they are used, i.e. you have to specify a number for $max_distance if you want to use the costs. Pass 1 for costs and some number over 255 times the largest of the cost values for $max_distance (passing something significantly bigger doesn't hurt, in case the hardcoded limit for calculations should grow some day) if you don't care.

distance($src, $dst)

Just for orthogonality, this does the same as levenshtein().

distance_c($src, $dst)

Just like levenshtein_c() but using the previously specified costs.

distance_l($src, $dst)

levenshtein_l()'s modern brother.

distance_lc($src, $dst)

The nicer variant of levenshtein_lc().

distance_l_all($src, @dst)

Not quite as ugly but otherwise equivalent to levenshtein_l_all().

distance_lc_all($src, @dst)

Where levenshtein_lc_all() gets really nasty, this does the same in a saner way.

Of course there's no distance_all() method either.

Speed

According to a few completely made-up benchmarks, Text::Levenshtein::Flexible is at least as fast as either Text::Levenshtein::XS or Text::Fuzzy (Core i5 920) and between 25% and 48% faster on some systems (Phenom II X6 1090T). For pure 8-bit character sets, Text::LevenshteinXS is usually a tad faster, but it can't deal with multibyte characters at all. A small benchmark script is included to test on your system, I'd be interested to hear about any unexpectedly good or bad performance.

SEE ALSO

Text::Levenshtein::XS Text::LevenshteinXS Text::Fuzzy

Dont even bother with anything else unless you're more interested in the algorithm than in practical applications as the algorithm is one of the better examples for something reasonably efficient in C that Perl is terrible at.

To find this module's lastest updates that are not on CPAN yet, check https://github.com/mbethke/Text-Levenshtein-Flexible

ACKNOWLEDGEMENTS

All the credit for speed and algorithmic cleverness goes to Joe Conway and Volkan Yazici who wrote the bulk of this module's code, originally for PostgreSQL.

AUTHOR

Matthias Bethke, <matthias@towiski.de>

COPYRIGHT AND LICENSE

Copyright (C) 2014-2018 by Matthias Bethke

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available. Significant portions of the code are (C) PostgreSQL Global Development Group and The Regents of the University of California. All modified versions must retain the file COPYRIGHT included in the distribution.