NAME
Text::Similarity::Overlaps - Score the Matches Found Between Two Strings
SYNOPSIS
# you can measure the similarity between two input strings
# if you don't normalize the score, you get the number of matching words
# if you normalize, you get a score between 0 and 1 that is scaled based
# on the length of the strings
use Text::Similarity::Overlaps;
# my %options = ('normalize' => 1, 'verbose' => 1);
my %options = ('normalize' => 0, 'verbose' => 0);
my $mod = Text::Similarity::Overlaps->new (\%options);
defined $mod or die "Construction of Text::Similarity::Overlaps failed";
my $string1 = 'this is a test for getSimilarityStrings';
my $string2 = 'we can test getSimilarityStrings this day';
my $score = $mod->getSimilarityStrings ($string1, $string2);
print "The number of matching words between string1 and string2 is : $score\n";
# you may want to measure the similarity of a document
# sentence by sentence - the below example shows you
# how - suppose you have two text files file1.txt and
# file2.txt - each having the same number of sentences.
# convert those files into multiple files, where each
# sentence from each file is in a separate file.
# if file1.txt and file3.txt each have three sentences,
# filex.txt will become sentx1.txt sentx2.txt sentx3.txt
# this just calls getSimilarity( ) for each pair of sentences
use Text::Similarity::Overlaps;
my %options = ('normalize' => 1, 'verbose' =>1, 'stoplist' => 'stoplist.txt');
my $mod = Text::Similarity::Overlaps->new (\%options);
defined $mod or die "Construction of Text::Similarity::Overlaps failed";
@file1_sentences = qw / sent11.txt sent12.txt sent13.txt /;
@file2_sentences = qw / sent21.txt sent22.txt sent23.txt /;
# assumes that both documents have same number of sentences
for ($i=0; $i <= $#file1_sentences; $i++) {
my $score = $mod->getSimilarity ($file1_sentences[$i], $file2_sentences[$i]);
print "The similarity of $file1_sentences[$i] and $file2_sentences[$i] is : $score\n";
}
my $score = $mod->getSimilarity ('file1.txt', 'file2.txt');
print "The similarity of the two files is : $score\n";
DESCRIPTION
This module computes the similarity of two text documents or strings by searching for literal word token overlaps. At present comparisons are made between entire documents, and finer granularity is not supported. Files are treated as one long input string, so overlaps can be found across sentence and paragraph boundaries.
Files are first converted into strings by getSimilarity(), then getSimilarityStrings() does the actual processing.
SEE ALSO
L<http://text-similarity.sourceforge.net>
AUTHOR
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Jason Michelizzi
Last modified by : $Id: Overlaps.pm,v 1.18 2008/04/04 18:30:19 tpederse Exp $
COPYRIGHT AND LICENSE
Copyright (C) 2004-2008 by Jason Michelizzi and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA