NAME

String::Similarity::Group - take a list of strings and group them by similarity within a threshold

SYNOPSIS

use String::Similarity::Group ':all';

my @elements = qw/victory victori matrix latrix ooland/;   

my @groups   = groups( 0.8, \@elements );
# ( [ 'victory', 'victori' ], [ 'matrix', 'latrix' ] )

my @loners   = loners( 0.8, \@elements );
# ( 'ooland' )

# Which of the elements closest matches a string?
my($element, $score) = similarest(\@elements, 'oland');
# ( 'ooland', 0.83 )

DESCRIPTION

Imagine you have a list of filenames, and you want to group them by similarity. You can simply pass at list of strings, the min similarity to match, and you get an array of groups ( array refs of similar elements).

Or if you have a list of strings, and you want to know which is most similar to your control string.

If you have a list of names of people, and you want to know which are the most unique of the bunch.

SUBS

None exported by default.

groups()

First argument is similarity minimum. Second argument is an array ref ( of the strings in question).

Returns array. Each element of this array is an array ref, of a group of elements, that match at least as your similarity minimum argument. If an element did not contain at least one match, it is left out.

my @groups = groups( 0.80, [ qw/vitamins vtamins vitanims profile/] );

lazy matching vs hard matching

As we group, the first element is the 'group leader', by default, as we test elements, we test only to each group leader, and pick the highest matching. This decreases the number of similarity procedures exponentially, and still provides great results (tests included in distro).

You may however want to have stricter, or lazier- matching..

If you use lazy matching, we stop at the first positive match to classify an element onto a group. With hard matching (default), we continue evaluating every element until we have the best match possible- to classify. Hard matching is important when you have a low similarity minimun set- to get more accurate results.

groups()

Default. Finds closest group leader, to group by.

groups_hard()

Thorough grouping. Finds closest element, in every group, to group by.

groups_lazy()

Laziest grouping, tests to first matching group leader, without comparing the others.

loners()

First argument is similarity min. Second argument is an array ref to the strings in question.

Returns array of every element that does not group.

If an element did contain more than one match, it is left out. For example, if your list has very different strings, and you set the min high..

my @elements = qw/victory victori couples singling/;   
my @loners = loners( 0.9, \@elements );
scalar @loners == 2 or die;
# @loners contains couples, singing

Another way to explain what this does: I have a list of 50 names of people. Which are the most unique ones?

similarest()

Arguments are array ref, string, and optionally a similarity minimum. Returns array, first element is element that matches highest, second element is the similarity score.

For example, you have a list of names, you want to see which of these names is most alike the name 'Paul'..

my @names = qw/Paula James Marcus Gregg/;
my( $closest_name, $score ) = similarest(  \@names, 'Paul' );

What if you only want to get a result if the similarity is at least 0.8 ?

my @names = qw/Paula James Marcus Gregg/;
my( $closest_name, $score ) = similarest(  \@names, 'Paul', 0.8 );

If none matches (if all score to 0, or all scores are below your similarity minimum argument), returns undef.

sort_by_silimarity()

Arguments are array ref, string, and optionally a similarity minimum. Returns array or array ref depending on context.

Elements are ordered by highest to lowest similarity.

my @a = qw/assumptionatedexpress socount acount/;
my @b = sort_by_similarity( \@, 'acoun' ); # returns  acount socount assumptionatedexpress

my @b = sort_by_similarity( \@, 'acoun', 0.9 ); # returns acount

similarity minimum

See String::Similarity. Minimum required to be a positive match, float from 0.00 to 1.00.

Varying the similarity minimum

If you relax or tighten the similarity minimum, you get different results.

my @groups   = groups( 0.80, [qw/victory victori matrix latrix ooland] );
# ( [ 'victory', 'victori' ], [ 'matrix', 'latrix' ] )

If you set the minimum very low, it means we tolerate just about any match...

my @groups   = groups( 0.05, [qw/victory victori matrix latrix ooland] );
# ( [ 'victory', 'victori', 'matrix', 'latrix', 'ooland' ] )

Thus we deem there to be one group, in this case, not very useful.

CAVEATS

This is alpha software. Still in development.

DISCUSSION

If you use the same element twice, it is ignored. This is beacause of the internals. That means if you do this:

my @groups   = groups( 0.8, [qw/victory victory] );

Nothing is returned!

If you lower the similarity minimum a lot, we get high groupings, one group tends to accumulate a lot more matches than other groups. This is because inside, we test against every element in a group to make a positive group match.

Should this work differently? See any bugs? Have any suggestions? Please contact the AUTHOR.

SEE ALSO

String::Similarity, excellent package to deem similarity between strings. gbs - included in package, cli interface command.

AUTHOR

Leo Charre leocharre at cpan dot org

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the "GNU General Public License" for more details.