NAME

PDL::EditDistance - Wagner-Fischer edit distance and alignment for PDLs.

SYNOPSIS

use PDL;
use PDL::EditDistance;

##-- input PDLs
$a = pdl([map { ord($_) } qw(G U M B O)]);
$b = pdl([map { ord($_) } qw(G A M B O L)]);

$a1 = pdl([0, map { ord($_) } qw(G U M B O)]);
$b1 = pdl([0, map { ord($_) } qw(G A M B O L)]);

##-------------------------------------------------------------
## Levenshtein distance
$dist          = edit_distance_static($a,$b, 0,1,1,1);
($dist,$align) = edit_align_static($a,$b, 0,1,1,1);

##-------------------------------------------------------------
## Wagner-Fischer distance
@costs         = ($costMatch=0,$costInsert=1,$costDelete=1,$costSubstitute=2);
$dist          = edit_distance_static($a,$b, @costs);
($dist,$align) = edit_align_static($a,$b, @costs);

##-------------------------------------------------------------
## General edit distance
$costsMatch = random($a->nelem+1, $b->nelem+1);
$costsIns   = random($a->nelem+1, $b->nelem+1);
$costsDel   = random($a->nelem+1, $b->nelem+1);
$costsSubst = random($a->nelem+1, $b->nelem+1);
@costs         = ($costsMatch,$costsIns,$costDel,$costsSubst);
$dist          = edit_distance_full($a,$b,@costs);
($dist,$align) = edit_align_full($a,$b,@costs);

##-------------------------------------------------------------
## Alignment
$op_match = align_op_match();      ##-- constant
$op_del   = align_op_insert1();    ##-- constant
$op_ins   = align_op_insert2();    ##-- constant
$op_subst = align_op_substitute(); ##-- constant

($apath,$bpath,$pathlen) = edit_bestpath($align);
($ai,$bi,$ops,$pathlen)  = edit_pathtrace($align);

##-------------------------------------------------------------
## Longest Common Subsequence
$lcs = edit_lcs($a,$b);
($ai,$bi,$lcslen) = lcs_backtrace($a,$b,$lcs);

FUNCTIONS

_edit_pdl

Signature: (a(N); [o]apdl(N+1))

Convenience method. Returns a pdl $apdl() suitable for representing $a(), which can be specified as a UTF-8 or byte-string, as an arrays of numbers, or as a PDL. $apdl(0) is always set to zero.

edit_costs

Signature: (PDL::Type type; int N; int M;
            [o]costsMatch(N+1,M+1); [o]costsIns(N+1,M+1); [o]costsDel(N+1,M+1); [o]costsSubst(N+1,M+1))

Convenience method. Ensures existence and proper dimensionality of cost matrices for inputs of length N and M.

_edit_costs

Signature: (PDL::Type type; int N1; int M1;
            [o]costsMatch(N1,M1); [o]costsIns(N1,M1); [o]costsDel(N1,M1); [o]costsSubst(N1,M1))

Low-level method. Ensures existence and proper dimensionality of cost matrices for inputs of length N1-1 and M1-1.

edit_costs_static

Signature: (PDL::Type type; int N; int M;
            staticCostMatch(); staticCostIns(); staticCostSubst();
            [o]costsMatch(N+1,M+1); [o]costsIns(N+1,M+1); [o]costsDel(N+1,M+1); [o]costsSubst(N+1,M+1))

Convenience method.

edit_distance_full

Signature: (a(N); b(M);
            costsMatch(N+1,M+1); costsIns(N+1,M+1); costsDel(N+1,M+1); costsSubst(N+1,M+1);
            [o]dist(N+1,M+1); [o]align(N+1,M+1))

Convenience method. Compute the edit distance matrix for inputs $a() and $b(), and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst(). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings.

_edit_distance_full

Signature: (a1(N1); b1(M1); costsMatch(N1,M1); costsIns(N1,M1); costsDel(N1,M1); costsSubst(N1,M1); [o]dist(N1,M1))

Low-level method. Compute the edit distance matrix for input PDLs $a1() and $b1() and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst().

The first elements of $a1() and $b1() are ignored.

_edit_distance_full does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

edit_align_full

Signature: (a(N); b(M);
            costsMatch(N+1,M+1); costsIns(N+1,M+1); costsDel(N+1,N+1); costsSubst(N+1,M+1);
            [o]dist(N+1,M+1); [o]align(N+1,M+1))

Convenience method. Compute the edit distance and alignment matrices for inputs $a() and $b(), and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst(). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings.

_edit_align_full

Signature: (a1(N1); b1(M1); costsMatch(N1,M1); costsIns(N1,M1); costsDel(N1,M1); costsSubst(N1,M1); [o]dist(N1,M1); byte [o]align(N1,M1))

Low-level method. Compute the edit distance and alignment matrix for input PDLs $a1() and $b1() and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst().

The first elements of $a1() and $b1() are ignored.

_edit_align_full does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

edit_distance_static

Signature: (a(N); b(M);
            staticCostMatch(); staticCostIns(); staticCostDel(); staticCostSubst();
            [o]dist(N+1,M+1))

Convenience method. Compute the edit distance matrix for inputs $a() and $b() given a static cost schema @costs = ($staticCostMatch(), $staticCostIns(), $staticCostDel(), and $staticCostSubst()). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings. Functionally equivalent to edit_distance_full($matches,@costs,$dist), but slightly faster.

_edit_distance_static

Signature: (a1(N1); b1(M1); costMatch(); costIns(); costDel(); costSubst(); [o]dist(N1,M1))

Low-level method. Compute the edit distance matrix for input PDLs $a1() and $b1() given a static cost schema @costs = ($costMatch(), $costIns(), $costDel(), $costSubst()). Functionally identitical to _edit_distance_matrix_full($matches,@costs,$dist), but slightly faster.

The first elements of $a1() and $b1() are ignored.

_edit_distance_static does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

edit_align_static

Signature: (a(N); b(M);
            staticCostMatch(); staticCostIns(); staticCostDel(); staticCostSubst();
            [o]dist(N+1,M+1); [o]align(N+1,M+1))

Convenience method. Compute the edit distance and alignment matrices for inputs $a() and $b() given a static cost schema @costs = ($staticCostMatch(), $staticCostIns(), $staticCostDel(), and $staticCostSubst()). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings. Functionally equivalent to edit_align_full($matches,@costs,$dist), but slightly faster.

_edit_align_static

Signature: (a1(N1); b1(M1); costMatch(); costIns(); costDel(); costSubst(); [o]dist(N1,M1); byte [o]align(N1,M1))

Low-level method. Compute the edit distance and alignment matrices for input PDLs $a1() and $b1() given a static cost schema @costs = ($costMatch(), $costIns(), $costDel(), $costSubst()). Functionally identitical to _edit_distance_matrix_full($matches,@costs,$dist), but slightly faster.

The first elements of $a1() and $b1() are ignored.

_edit_align_static does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

align_op_insert1

Signature: ([o]a())

Alignment matrix value constant for insertion operations on $a() string.

align_op_insert1 does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

align_op_insert2

Signature: ([o]a())

Alignment matrix value constant for insertion operations on $a() string.

align_op_insert2 does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

align_op_match

Signature: ([o]a())

Alignment matrix value constant for matches.

align_op_match does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

align_op_substitute

Signature: ([o]a())

Alignment matrix value constant for substitution operations.

align_op_substitute does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

align_op_delete

Alias for align_op_insert1()

align_op_insert

Alias for align_op_insert2()

align_ops

Signature: ([o]ops(4))

Alignment matrix value constants 4-element pdl (match,insert1,insert2,substitute).a

edit_bestpath

Signature: (align(N+1,M+1); [o]apath(N+M+2); [o]bpath(N+M+2); [o]pathlen())

Convenience method. Compute best path through alignment matrix $align(). Stores paths for original input strings $a() and $b() in $apath() and $bpath() respectively. Negative values in $apath() and $bpath() indicate insertion/deletion operations. On completion, $pathlen() holds the actual length of the paths.

_edit_bestpath

Signature: (align(N1,M1); int [o]apath(L); int [o]bpath(L); int [o]len(); int ifinal; int jfinal)

Low-level method. Compute best path through alignment matrix $align() from final index ($ifinal,$jfinal). Stores paths for (original) input strings $a() and $b() in $apath() and $bpath() respectively. Negative values in $apath() and $bpath() indicate insertion/deletion operations. On completion, $pathlen() holds the actual length of the paths.

_edit_bestpath does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

edit_pathtrace

Signature: ( align(N+1,M+1); [o]ai(L); [o]bi(L); [o]ops(L); [o]$pathlen() )

Convenience method. Compute alignment path backtrace through alignment matrix $align() from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively. Unlike edit_bestpath(), null-moves for $ai() and $bi() are not stored here as negative values. Returned pdls ($ai,$bi,$ops) are trimmed to the appropriate path length.

_edit_pathtrace

Signature: (align(N1,M1); int [o]ai(L); int [o]bi(L); int [o]ops(L); int [o]len(); int ifinal; int jfinal)

Low-level method. Compute alignment path backtrace through alignment matrix $align() from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively. Unlike edit_bestpath(), null-moves for $ai() and $bi() are not stored here as negative values. Returned pdls ($ai,$bi,$ops) are trimmed to the appropriate path length.

_edit_pathtrace does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

edit_lcs

Signature: (a(N); b(M); int [o]lcs(N+1,M+1);)

Convenience method. Compute the longest common subsequence (LCS) matrix for input PDLs $a1() and $b1(). The output matrix $lcs() contains at cell ($i+1,$j+1) the length of the LCS between $a1(0..$i) and $b1(0..$j); thus $lcs($N,$M) contains the length of the LCS between $a() and $b().

_edit_lcs

Signature: (a1(N1); b1(M1); int [o]lcs(N1,M1))

Low-level method. Compute the longest common subsequence (LCS) matrix for input PDLs $a1() and $b1(). The initial (zeroth) elements of $a1() and $b1() are ignored. The output matrix $lcs() contains at cell ($i,$j) the length of the LCS between $a1(1..$i) and $b1(1..$j); thus $lcs($N1-1,$M1-1) contains the length of the LCS between $a1() and $b1().

_edit_lcs does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

lcs_backtrace

Signature: (a(N); b(M); int lcs(N+1,M+1); int ifinal(); int jfinal(); int [o]ai(L); int [o]bi(L); int [o]len())

Convenience method. Compute longest-common-subsequence backtrace through LCS matrix $lcs() for original input strings ($a(),$b()) from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively.

_lcs_backtrace

Signature: (a1(N1); b1(M1); int lcs(N1,M1); int ifinal(); int jfinal(); [o]ai(L); [o]bi(L); int [o]len())

Low-level method. Compute longest-common-subsequence backtrace through LCS matrix $lcs() for initial-padded strings ($a1(),$b1()) from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively.

_lcs_backtrace does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

ACKNOWLEDGEMENTS

Perl by Larry Wall.

PDL by Karl Glazebrook, Tuomas J. Lukka, Christian Soeller, and others.

KNOWN BUGS

Probably many.

AUTHOR

Bryan Jurish <moocow@cpan.org>

Copyright (C) 2006-2015, Bryan Jurish. All rights reserved.

This package is free software, and entirely without warranty. You may redistribute it and/or modify it under the same terms as Perl itself, either Perl 5.20.2, or at your option any later version of Perl 5.

SEE ALSO

perl(1), PDL(3perl).