NAME

Text::Similarity::Overlaps

SYNOPSIS

	  # you may want to measure the similarity of a document
          # sentence by sentence - the below example shows you
	  # how - suppose you have two text files file1.txt and
          # file2.txt - each having the same number of sentences.
          # convert those files into multiple files, where each
          # sentence from each file is in a separate file. 

	  # if file1.txt and file3.txt each have three sentences, 
          # filex.txt will become sentx1.txt sentx2.txt sentx3.txt

	  # this just calls getSimilarity( ) for each pair of sentences

	  use Text::Similarity::Overlaps;
	  my %options = ('normalize' => 1, 'verbose' =>1, 'stoplist' => 'stoplist.txt');
	  my $mod = Text::Similarity::Overlaps->new (\%options);
          defined $mod or die "Construction of Text::Similarity::Overlaps failed";

	  @file1_sentences = qw / sent11.txt sent12.txt sent13.txt /;
	  @file2_sentences = qw / sent21.txt sent22.txt sent23.txt /;

          # assumes that both documents have same number of sentences 

	  for ($i=0; $i <= $#file1_sentences; $i++) {
	          my $score = $mod->getSimilarity ($file1_sentences[$i], $file2_sentences[$i]);
        	  print "The similarity of $file1_sentences[$i] and $file2_sentences[$i] is : $score\n";
	  }

	  my $score = $mod->getSimilarity ('file1.txt', 'file2.txt');
       	  print "The similarity of the two files is : $score\n";

DESCRIPTION

This module computes the similarity of two text documents by searching for literal word token overlaps. At present comparisons are made between entire documents, and finer granularity is not supported. Files are treated as one long input string, so overlaps can be found across sentence and paragraph boundaries.

SEE ALSO

AUTHOR

Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu

Jason Michelizzi

Last modified by : $Id: Overlaps.pm,v 1.15 2008/03/21 22:21:11 tpederse Exp $

COPYRIGHT AND LICENSE

Copyright (C) 2004-2008 by Jason Michelizzi and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA