NAME
PDL::EditDistance - Wagner-Fischer edit distance and alignment for PDLs.
SYNOPSIS
use PDL;
use PDL::EditDistance;
##-- input PDLs
$a = pdl([map { ord($_) } qw(G U M B O)]);
$b = pdl([map { ord($_) } qw(G A M B O L)]);
$a1 = pdl([0, map { ord($_) } qw(G U M B O)]);
$b1 = pdl([0, map { ord($_) } qw(G A M B O L)]);
##-------------------------------------------------------------
## Levenshtein distance
$dist = edit_distance_static($a,$b, 0,1,1,1);
($dist,$align) = edit_align_static($a,$b, 0,1,1,1);
##-------------------------------------------------------------
## Wagner-Fischer distance
@costs = ($costMatch=0,$costInsert=1,$costDelete=1,$costSubstitute=2);
$dist = edit_distance_static($a,$b, @costs);
($dist,$align) = edit_align_static($a,$b, @costs);
##-------------------------------------------------------------
## General edit distance
$costsMatch = random($a->nelem+1, $b->nelem+1);
$costsIns = random($a->nelem+1, $b->nelem+1);
$costsDel = random($a->nelem+1, $b->nelem+1);
$costsSubst = random($a->nelem+1, $b->nelem+1);
@costs = ($costsMatch,$costsIns,$costDel,$costsSubst);
$dist = edit_distance_full($a,$b,@costs);
($dist,$align) = edit_align_full($a,$b,@costs);
##-------------------------------------------------------------
## Alignment
$op_match = align_op_match(); ##-- constant
$op_del = align_op_insert1(); ##-- constant
$op_ins = align_op_insert2(); ##-- constant
$op_subst = align_op_substitute(); ##-- constant
($apath,$bpath,$pathlen) = edit_bestpath($align);
($ai,$bi,$ops,$pathlen) = edit_pathtrace($align);
##-------------------------------------------------------------
## Longest Common Subsequence
$lcs = edit_lcs($a,$b);
($ai,$bi,$lcslen) = lcs_backtrace($a,$b,$lcs);
FUNCTIONS
_edit_pdl
Signature: (a(N); [o]apdl(N+1))
Convenience method. Returns a pdl $apdl() suitable for representing $a(), which can be specified as a UTF-8 or byte-string, as an arrays of numbers, or as a PDL. $apdl(0) is always set to zero.
edit_costs
Signature: (PDL::Type type; int N; int M;
[o]costsMatch(N+1,M+1); [o]costsIns(N+1,M+1); [o]costsDel(N+1,M+1); [o]costsSubst(N+1,M+1))
Convenience method. Ensures existence and proper dimensionality of cost matrices for inputs of length N and M.
_edit_costs
Signature: (PDL::Type type; int N1; int M1;
[o]costsMatch(N1,M1); [o]costsIns(N1,M1); [o]costsDel(N1,M1); [o]costsSubst(N1,M1))
Low-level method. Ensures existence and proper dimensionality of cost matrices for inputs of length N1-1 and M1-1.
edit_costs_static
Signature: (PDL::Type type; int N; int M;
staticCostMatch(); staticCostIns(); staticCostSubst();
[o]costsMatch(N+1,M+1); [o]costsIns(N+1,M+1); [o]costsDel(N+1,M+1); [o]costsSubst(N+1,M+1))
Convenience method.
edit_distance_full
Signature: (a(N); b(M);
costsMatch(N+1,M+1); costsIns(N+1,M+1); costsDel(N+1,M+1); costsSubst(N+1,M+1);
[o]dist(N+1,M+1); [o]align(N+1,M+1))
Convenience method. Compute the edit distance matrix for inputs $a() and $b(), and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst(). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings.
_edit_distance_full
Signature: (a1(N1); b1(M1); costsMatch(N1,M1); costsIns(N1,M1); costsDel(N1,M1); costsSubst(N1,M1); [o]dist(N1,M1))
Low-level method. Compute the edit distance matrix for input PDLs $a1() and $b1() and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst().
The first elements of $a1() and $b1() are ignored.
_edit_distance_full does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
edit_align_full
Signature: (a(N); b(M);
costsMatch(N+1,M+1); costsIns(N+1,M+1); costsDel(N+1,N+1); costsSubst(N+1,M+1);
[o]dist(N+1,M+1); [o]align(N+1,M+1))
Convenience method. Compute the edit distance and alignment matrices for inputs $a() and $b(), and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst(). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings.
_edit_align_full
Signature: (a1(N1); b1(M1); costsMatch(N1,M1); costsIns(N1,M1); costsDel(N1,M1); costsSubst(N1,M1); [o]dist(N1,M1); byte [o]align(N1,M1))
Low-level method. Compute the edit distance and alignment matrix for input PDLs $a1() and $b1() and cost matrices $costsMatch(), $costsIns(), $costsDel(), and $costsSubst().
The first elements of $a1() and $b1() are ignored.
_edit_align_full does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
edit_distance_static
Signature: (a(N); b(M);
staticCostMatch(); staticCostIns(); staticCostDel(); staticCostSubst();
[o]dist(N+1,M+1))
Convenience method. Compute the edit distance matrix for inputs $a() and $b() given a static cost schema @costs = ($staticCostMatch(), $staticCostIns(), $staticCostDel(), and $staticCostSubst()). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings. Functionally equivalent to edit_distance_full($matches,@costs,$dist), but slightly faster.
_edit_distance_static
Signature: (a1(N1); b1(M1); costMatch(); costIns(); costDel(); costSubst(); [o]dist(N1,M1))
Low-level method. Compute the edit distance matrix for input PDLs $a1() and $b1() given a static cost schema @costs = ($costMatch(), $costIns(), $costDel(), $costSubst()). Functionally identitical to _edit_distance_matrix_full($matches,@costs,$dist), but slightly faster.
The first elements of $a1() and $b1() are ignored.
_edit_distance_static does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
edit_align_static
Signature: (a(N); b(M);
staticCostMatch(); staticCostIns(); staticCostDel(); staticCostSubst();
[o]dist(N+1,M+1); [o]align(N+1,M+1))
Convenience method. Compute the edit distance and alignment matrices for inputs $a() and $b() given a static cost schema @costs = ($staticCostMatch(), $staticCostIns(), $staticCostDel(), and $staticCostSubst()). $a() and $b() may be specified as PDLs, arrays of numbers, or as strings. Functionally equivalent to edit_align_full($matches,@costs,$dist), but slightly faster.
_edit_align_static
Signature: (a1(N1); b1(M1); costMatch(); costIns(); costDel(); costSubst(); [o]dist(N1,M1); byte [o]align(N1,M1))
Low-level method. Compute the edit distance and alignment matrices for input PDLs $a1() and $b1() given a static cost schema @costs = ($costMatch(), $costIns(), $costDel(), $costSubst()). Functionally identitical to _edit_distance_matrix_full($matches,@costs,$dist), but slightly faster.
The first elements of $a1() and $b1() are ignored.
_edit_align_static does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
align_op_insert1
Signature: ([o]a())
Alignment matrix value constant for insertion operations on $a() string.
align_op_insert1 does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
align_op_insert2
Signature: ([o]a())
Alignment matrix value constant for insertion operations on $a() string.
align_op_insert2 does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
align_op_match
Signature: ([o]a())
Alignment matrix value constant for matches.
align_op_match does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
align_op_substitute
Signature: ([o]a())
Alignment matrix value constant for substitution operations.
align_op_substitute does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
align_op_delete
Alias for align_op_insert1()
align_op_insert
Alias for align_op_insert2()
align_ops
Signature: ([o]ops(4))
Alignment matrix value constants 4-element pdl (match,insert1,insert2,substitute).a
edit_bestpath
Signature: (align(N+1,M+1); [o]apath(N+M+2); [o]bpath(N+M+2); [o]pathlen())
Convenience method. Compute best path through alignment matrix $align(). Stores paths for original input strings $a() and $b() in $apath() and $bpath() respectively. Negative values in $apath() and $bpath() indicate insertion/deletion operations. On completion, $pathlen() holds the actual length of the paths.
_edit_bestpath
Signature: (align(N1,M1); int [o]apath(L); int [o]bpath(L); int [o]len(); int ifinal; int jfinal)
Low-level method. Compute best path through alignment matrix $align() from final index ($ifinal,$jfinal). Stores paths for (original) input strings $a() and $b() in $apath() and $bpath() respectively. Negative values in $apath() and $bpath() indicate insertion/deletion operations. On completion, $pathlen() holds the actual length of the paths.
_edit_bestpath does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
edit_pathtrace
Signature: ( align(N+1,M+1); [o]ai(L); [o]bi(L); [o]ops(L); [o]$pathlen() )
Convenience method. Compute alignment path backtrace through alignment matrix $align() from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively. Unlike edit_bestpath(), null-moves for $ai() and $bi() are not stored here as negative values. Returned pdls ($ai,$bi,$ops) are trimmed to the appropriate path length.
_edit_pathtrace
Signature: (align(N1,M1); int [o]ai(L); int [o]bi(L); int [o]ops(L); int [o]len(); int ifinal; int jfinal)
Low-level method. Compute alignment path backtrace through alignment matrix $align() from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively. Unlike edit_bestpath(), null-moves for $ai() and $bi() are not stored here as negative values. Returned pdls ($ai,$bi,$ops) are trimmed to the appropriate path length.
_edit_pathtrace does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
edit_lcs
Signature: (a(N); b(M); int [o]lcs(N+1,M+1);)
Convenience method. Compute the longest common subsequence (LCS) matrix for input PDLs $a1() and $b1(). The output matrix $lcs() contains at cell ($i+1,$j+1) the length of the LCS between $a1(0..$i) and $b1(0..$j); thus $lcs($N,$M) contains the length of the LCS between $a() and $b().
_edit_lcs
Signature: (a1(N1); b1(M1); int [o]lcs(N1,M1))
Low-level method. Compute the longest common subsequence (LCS) matrix for input PDLs $a1() and $b1(). The initial (zeroth) elements of $a1() and $b1() are ignored. The output matrix $lcs() contains at cell ($i,$j) the length of the LCS between $a1(1..$i) and $b1(1..$j); thus $lcs($N1-1,$M1-1) contains the length of the LCS between $a1() and $b1().
_edit_lcs does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
lcs_backtrace
Signature: (a(N); b(M); int lcs(N+1,M+1); int ifinal(); int jfinal(); int [o]ai(L); int [o]bi(L); int [o]len())
Convenience method. Compute longest-common-subsequence backtrace through LCS matrix $lcs() for original input strings ($a(),$b()) from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively.
_lcs_backtrace
Signature: (a1(N1); b1(M1); int lcs(N1,M1); int ifinal(); int jfinal(); [o]ai(L); [o]bi(L); int [o]len())
Low-level method. Compute longest-common-subsequence backtrace through LCS matrix $lcs() for initial-padded strings ($a1(),$b1()) from final index ($ifinal,$jfinal). Stores raw paths for (original) input strings $a() and $b() in $ai() and $bi() respectively.
_lcs_backtrace does not process bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
ACKNOWLEDGEMENTS
Perl by Larry Wall.
PDL by Karl Glazebrook, Tuomas J. Lukka, Christian Soeller, and others.
KNOWN BUGS
Probably many.
AUTHOR
Bryan Jurish <moocow@cpan.org>
Copyright Policy
Copyright (C) 2006-2015, Bryan Jurish. All rights reserved.
This package is free software, and entirely without warranty. You may redistribute it and/or modify it under the same terms as Perl itself, either Perl 5.20.2, or at your option any later version of Perl 5.
SEE ALSO
perl(1), PDL(3perl).