NAME
Text::PhraseDistance - A measure of the degree of proximity of 2 given phrases
SYNOPSIS
use Text::PhraseDistance qw(pdistance);
sub distance {
#your own implementation of a distance between strings
#
#that needs 2 strings (2 arguments) and returns a number
}
# otherwise you can use Text::Levensthein or others, e.g.
# use Text::Levenshtein qw(distance);
my $phrase1="a yellow dog";
my $phrase2="a dog yellow";
my $set="abcdefghijklmnopqrstuvwxyz";
print pdistance($phrase1,$phrase2,$set,\&distance);
DESCRIPTION
This module provides a way to compare two phrases and to give a measure of their proximity. In this context, a phrase is a groups of words formed by a set of characters, separated by elements from the complemetary of that set. E.g. if the set is composed by [abcdefghijklmnopqrstuvwxyz], a phrase is "hello, world!" where the words are "hello" and "world", with ", " and "!" parts of the complementary set.
This module does not provide a "classic" string distance (e.g. Levenshtein), i.e. a way to compare two strings as unique entities. Instead it uses a string distance to compare the words, one by one and it tries to "match" the ones that have a smaller distance. It also calculates a positional distance for every words belonging to the set and for the elements of the complementary set. So for example, for the two phrases:
"a yellow dog"
"a dog yellow"
Levenshtein says that are distance 8. Also for the phrases:
"a yellow dog"
"a good cat"
the Levenshtein distance is 8, but the first 2 phrases are much closer than the second.
With the phrase distance implemented in this module, using the Text::Levenshtein as the string distance, the phrases:
"a yellow dog"
"a good cat"
have distance 8, but the phrases:
"a yellow dog"
"a dog yellow"
have distance 2. This is because this module evaluates the string distance for the words that it is 0 (because there are 3 pairs of words with minimal string distance equal to 0) and the positional distance, that is 0 for the two "a"s plus 1 for "yellow" in the first phrase compared with "yellow" in the second (i.e. they are distant 1 position from each other), plus 1 for "dog" in the first phrase compared with "dog" in the second.
This 2 components of the phrase distance (i.e. the string distance and the positional distance) can have a different cost from the default (that is 1 for both) to give your own type of phrase distance (see below for the syntax).
There is a third component: a cost that weighs on the phrases that have less exact matches.
For example
"dinning lamp on" compared with "living lamp on"
has 2 exact matches ("lamp","on") on 3 words of the 2nd phrase.
But
"living room lamp on" compared with "living lamp on"
has 3 exact matches ("living","lamp","on") on 3 words of the 2nd phrase.
In this case the phrase "dinning lamp on" has 1 "degree" of disavantage as to "living lamp on" when compared with "living room lamp on".
A visual example is:
-------- ------ ----
| living | | lamp | | on |
-------- ------ ----
------ ----
dinning | lamp | | on |
------ ----
-------- ------ ----
| living | room | lamp | | on |
-------- ------ ----
"living lamp on" has 3 components of "living lamp on", but "dinning lamp on" has only 2 components so, with this third component of the distance, it will be penalized.
This 3rd component is disabled by default (i.e. it has a 0 cost), but it can be enabled with custom cost (see below for the syntax). In this case, the string distance used MUST define the exact match with cost 0.
By default, this module sums the phrase distance from the words from the set (i.e. formed by the defined set of characters) and the phrase distance calculated from the "words" belonging the complementary set. In order to change this behaviour, see below.
The algorithm used to find the distance is the "Stable marriage problem" one. This is a matching algorithm, used to harmonize the elements of two sets on the ground of the preference relationships (in this case the string distance of single "words" plus the positional distance plus eventually the exact match weight).
USAGE
You have to import the pdistance function to the current namespace:
use Text::PhraseDistance qw(pdistance);
then you have to declare your distance function:
sub distance {
#your own implementation of a distance between strings
#
#that needs 2 strings (2 arguments) and returns a number
}
otherwise you can use Text::Levensthein or others, e.g.
use Text::Levenshtein qw(distance);
You need also the set of characters for the words, e.g.
my $set="abcdefghijklmnopqrstuvwxyz";
and then the two phrases, e.g.:
my $phrase1="a yellow dog";
my $phrase2="a dog yellow";
so you can call the phrase distance:
print pdistance($phrase1,$phrase2,$set,\&distance);
In order to define a custom distance subroutine, wrapping an existent one (e.g. WagnerFischer with a custom array cost) you can use a closure like this:
my $mydistance;
{
my $array_ref = [0, 1, 2];
$mydistance = sub {
distance( $array_ref, shift, shift );
};
}
OPTIONAL PARAMETERS
pdistance($phrase1,$phrase2,$set,\&distance,{-cost=>[1,0,3],-mode=>'set'});
-mode
accepted values are:
complementary means that the distance is calculated only
from the "words" from the complementary set
both the distance is calculated from both sets
set means that the distance is calculated only
from the "words" from the given set
Default mode is 'both'.
-cost
accepted value is an array with 3 elements: first is the cost for
the string distance, the second is the cost for positional distance
and the third is the cost to penalize the phrases that have less exact
matches.
Default array is [1,1,0].
THANKS
Many thanks to Stefano L. Rodighiero <larsen at perlmonk.org> for the support and part of the code, and to D. Frankowski and B. Winter for the suggestions.
AUTHOR
Copyright 2002,2003 Dree Mistrut <dree@friuli.to>
This package is free software and is provided "as is" without express or implied warranty. You can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
Text::Levenshtein
, Text::WagnerFischer