NAME
psame - finds similarities between files or versions of files
SYNOPSIS
psame [options] file1 file2
psame [options] file
psame [options] -r version file
psame [options] -r version_a -r version_b file
The first usage compares the two files. The second usage compares the latest version from Subversion, CVS or RCS against the given file. The third usage will compare the given version from Subversion, CVS or RCS against the given file. The fourth usage will compare the two versions of the given file from Subversion, CVS or RCS.
By default, blank lines, whitespace and case are ignored when comparing. The output will be a side-by-side view of matching regions with a few lines of context.
MOTIVATION
psame was written to allow the author an easy way to compare two pieces of text. In particular to find lines in one piece of text (generally from a file) that match some lines in a second piece of text.
USES OF PSAME
Code comparison
The diff(1) command is excellent for finding differences between files, but sometimes similarity is more interesting. A common case is when a chunk of code is moved to another part of the same file. In that case comparing the old and new versions of the file with diff will tell you that there has been a deletion of text and an insertion. psame, on the other hand, will tell you where moved code is in the new version. In simple cases, the output from diff is clear enough but comparison with psame can help in the cases where there have been many edits.
DESCRIPTION
Options
- -b, --dont-ignore-spaces
-
don't ignore changes in whitespace with a line
- -i, --dont-ignore-case
-
don't ignore case when comparing lines
- -B, --dont-ignore-blank-lines
-
don't ignore blank lines
- -M <num>, --minimum-line-length <num>
-
ignore simple/short lines (ie. those with less than <num> chars). If the -b flag is active, the line length is tested after removing whitespace. default: no lines are considered too simple
- -S <num>, --minimum-score <num>
-
only show matches with score higher than <num> (see the SCORE section below) - default 0
- -y, --side-by-side
-
side-by-side match view (default)
- -V, --vertical
-
vertical match view
- -n, --show-only-non-matches
-
show non-matches instead of matches
- -N, --show-non-matches
-
show matches and non-matches
- -x <wid>, --terminal-width <wid>
-
set terminal width in columns (normally guessed)
- -r <ver>, --revision <ver>, -r <ver> -r <ver>
-
compare with version(s) from SVN, CVS or RCS
- -C <num>, --context <num>
-
number of lines of context - default 3
- -m <mode>, --mode <mode>
-
use the <mode> to choose appropriate settings
- text
-
choose settings suitable for text/documentation: -M 10 -S 4
- code
-
choose setting for code: -i -M 5 -S 0
- -v, --version
-
Display version number and exit
- -h, -?, --help
-
show a usage message
MATCHES
A "match" is some number of consecutive lines in one file (or file version) that are similar to some number of consecutive lines in a second file (or file version). In the simplest case with no options specified, the lines in each file must be identical. As an example, consider these two pieces of text (with added line numbers):
text_1
1. The parrot sketch -
2. 'E's kicked the bucket, 'e's
3. shuffled off 'is mortal coil, run
4. down the curtain and joined the
5. bleedin' choir invisible!
6. THIS IS AN EX-PARROT!
text_2
1. 'E's kicked the bucket, 'e's
2. shuffled off 'is mortal coil, run
3. down the curtain and joined the
4. bleedin' choir invisible!
5.
6. This is an ex-parrot!
Default settings
Using the default settings, psame will report this:
match 2..6==1..6
The parrot sketch -
'E's kicked the bucket, 'e's = 'E's kicked the bucket, 'e's
shuffled off 'is mortal coil, = shuffled off 'is mortal coil,
run down the curtain and joined = run down the curtain and joined
the bleedin' choir invisible! = the bleedin' choir invisible!
>
THIS IS AN EX-PARROT! = This is an ex-parrot!
which indicates that there are five lines from text_1 (ie. lines 2 to 6) that match six lines from text_2 (ie. 1 to 6). By default psame is case and white-space insensitive and blank lines are ignored when comparing files. The "=" symbol indicates an match between two lines. The ">" indicates that text_2 has an extra blank line that has been ignored during the comparison.
Case sensitivity and ignoring blank lines
Adding the -B parameter will produce this output:
match 2..5==1..4
The parrot sketch -
'E's kicked the bucket, 'e's = 'E's kicked the bucket, 'e's
shuffled off 'is mortal coil, run = shuffled off 'is mortal coil, run
down the curtain and joined the = down the curtain and joined the
bleedin' choir invisible! = bleedin' choir invisible!
THIS IS AN EX-PARROT!
This is an ex-parrot!
match 6..6==6..6
shuffled off 'is mortal coil, run down the curtain and joined the
down the curtain and joined the bleedin' choir invisible!
bleedin' choir invisible!
THIS IS AN EX-PARROT! = This is an ex-parrot!
In this case blank lines are significant for the comparison. psame reports two distinct matches - one four lines long and the other one line long.
Adding the -i option as well will make psame respect case. Here is the output:
match 2..5==1..4
The parrot sketch -
'E's kicked the bucket, 'e's = 'E's kicked the bucket, 'e's
shuffled off 'is mortal coil, run = shuffled off 'is mortal coil, run
down the curtain and joined the = down the curtain and joined the
bleedin' choir invisible! = bleedin' choir invisible!
THIS IS AN EX-PARROT!
This is an ex-parrot!
Note that the "This is an ex-parrot!" line doesn't match now.
NON-MATCHES
The -n flag will report lines in each file that don't match any lines in the other file. For example, running psame -n on the files above, with no other options gives:
non matches in text_1:
1..1:
The parrot sketch -
ie. line 1 in text_1 doesn't occur anywhere in text_2
In this case diff(1) will tell us the same thing but in other situations we only want to know about lines in file A that don't appear anywhere in file B. An example might be when modifying the order of sections in a manuscript - we would like to check that all sections are still present, even if in a different place.
SCORE
The score of a match is currently the total number of lines this match covers in both files. The -S option for filtering by score is useful for filtering out small matches so that the larger similarity can be seen.
BUGS
None known
LIMITATIONS
The code works well with small input files (up to 10,000 lines or so), but is too slow and memory intensive for larger files.
TO DO
Output formatting should be done with Perl6::Form or some such and the output needs to be more readable. Suggestions are very welcome.
AUTHOR
Kim Rutherford <kmr+same@xenu.org.uk>
http://www.xenu.org.uk