The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

psame - finds similarities between files or versions of files

SYNOPSIS

psame [options] file1 file2
psame [options] file
psame [options] [-r version] file

The first usage compares the two files. The second usage compare the given file with the latest version from Subversion, CVS or RCS. The third usage will compare against a given version from Subversion, CVS or RCS.

By default, the output with be a side-by-side view of matching regions with a few lines of context.

MOTIVATION

psame allows the user to find lines in one piece of text (generally from a file) that match some lines in a second piece of text.

USE CASES

Code comparision

The diff(1) command is excellent for finding differences between files, but sometimes similarity is more interesting. A common case is when a chunk of code is moved to another part of the same file. In that case comparing the old and new versions of the file with diff will tell you that there has been a deletion of text and an insertion. psame, on the other hand, will tell you where moved code is in the new version. In simple cases, the output from diff is clear enough but comparision with psame can help in the cases where there have been many edits.

DESCRIPTION

Options

-b

ignore changes in whitespace

-i

ignore case

-B

ignore blank lines

-s <num>

ignore simple/short lines (ie. those with less than <num> chars)

-y

side-by-side match view (default)

-V

vertical match view

-n

show non-matches instead of matches

-N

show matches and non-matches

-x <wid>

set terminal width in columns (normally guessed)

-r <ver>

compare with <version> from SVN, CVS or RCS

-S <num>

only show matches with score higher than <num> (see the SCORE section below)

-C <num>

number of lines of context

-a

apply (a)ll useful options - sets the following options: -b -i -B -s 2 -N -S 3

MATCHES

A "match" is some number of consective lines in one file (or file version) that are similar to some number of consective lines in a second file (or file version). In the simplest case with no options specified, the lines in each file must be identical. As an example, consider these two pieces of text (with added line numbers):

text_1

1. The parrot sketch -
2.  'E's kicked the bucket, 'e's
3.
3.  shuffled off 'is mortal coil, run
4.  down the curtain and joined the
5.  bleedin' choir invisibile!
6.  THIS IS AN EX-PARROT!

text_2

1. 'E's kicked the bucket, 'e's
2. shuffled off 'is mortal coil, run
3. down the curtain and joined the
4. bleedin' choir invisibile!
5. 
6. This is an ex-parrot!

Using the default settings, psame will report this:

match 2..5==1..4
  The parrot sketch -                                                    
   'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's     
   shuffled off 'is mortal coil, run =  shuffled off 'is mortal coil, run
   down the curtain and joined the   =  down the curtain and joined the  
   bleedin' choir invisibile!        =  bleedin' choir invisibile!       
   THIS IS AN EX-PARROT!                                                 
                                        This is an ex-parrot!            

which indicates that there are four lines from text_1 (ie. lines 2 to 5) that match four lines from text_2 (ie. 1 to 4). Note that psame is, by default, case sensitive so line 6 of text_1 doesn't match line 6 of text_2 in this case.

Adding the -i option will make psame ignore case, hence find the last line of each file to be equal:

match 2..5==1..4
  The parrot sketch -                                                    
   'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's     
   shuffled off 'is mortal coil, run =  shuffled off 'is mortal coil, run
   down the curtain and joined the   =  down the curtain and joined the  
   bleedin' choir invisibile!        =  bleedin' choir invisibile!       
   THIS IS AN EX-PARROT!                                                 
                                        This is an ex-parrot!            
match 6..6==6..6
   shuffled off 'is mortal coil, run    down the curtain and joined the  
   down the curtain and joined the      bleedin' choir invisibile!       
   bleedin' choir invisibile!                                            
   THIS IS AN EX-PARROT!             =  This is an ex-parrot!            

In this case psame is reporting two distinct matches - one four lines long and the other one line long.

NON-MATCHES

The -n flag will report lines in each file that don't match any lines in the other file. For example, running psame -n on the files above, with no other options gives:

non matches in text_1:
  1..1:
    The parrot sketch -
  6..6:
     THIS IS AN EX-PARROT!
non matches in text_2:
  5..6:

     This is an ex-parrot!

In this case diff(1) will tell us the same thing but is other situations we only want to know about lines in file A that don't appear anywhere in file B. An example might be when modifying the order of sections in a manuscript - we would like to check that all sections are still present, even if in a different place.

SCORE

The score of a match is currently the total number of lines this match covers in both files. The -S option for filtering by score is useful for filtering out small matches so that the larger changes can be seen.

BUGS

None known

LIMITATIONS

The code works well with small input files (up to 10,000 lines or so), but is too slow and memory intensive for larger files.

TO DO

Output formatting should be done with Perl6::Form or some such and the output needs to be more readable. Suggestions are very welcome.

AUTHOR

Kim Rutherford <kmr+same@xenu.org.uk>