NAME
Text::Levenshtein::Edlib - XS edit distance and optimal alignment path calculation
SYNOPSIS
use feature 'say';
use Text::Levenshtein::Edlib qw/:all/;
say distance 'kitten', 'sitting'; # prints '3'
say 'Distance > 2!'
if !defined distance 'kitten', 'sitting', 2; # prints 'Distance > 2!'
my $align = align('kitten', 'sitting');
say "Edit distance is: $align->{editDistance}";
say "Alphabet length is: $align->{alphabetLength}";
say "Start locations are: @{$align->{startLocations}}";
say "End locations are: @{$align->{endLocations}}";
say "Alignment path is: @{$align->{alignment}}";
say "Alignment path (in CIGAR format): ", to_cigar $align->{alignment};
say "Alignment path (in extended CIGAR format): ",
to_cigar $align->{alignment}, EDLIB_CIGAR_EXTENDED;
DESCRIPTION
Text::Levenshtein::Edlib is a wrapper around the edlib library that computes Levenshtein edit distance and optimal alignment path for a pair of strings.
It does not handle UTF-8 strings, for those Text::Levenshtein::XS can compute edit distance but not alignment path.
This module has two functions:
- distance($query, $target, [$max_distance, [$mode]])
-
This is the basic interface to the library. It is compatible with the function of the same name in Text::Levenshtein::XS.
It returns the edit distance between the two given strings. If the third argument is specified, and the edit distance is greater than the value of the third argument, then the function finishes the computation early and returns undef. See below for the meaning of the optional $mode argument.
- align($query, $target, [$max_distance, [$mode, [$task]]])
-
This is the full-featured interface to the library.
It returns a hashref with the following keys:
editDistance
-
The edit distance of the two strings.
alphabetLength
-
The number of different characters in the query and target together.
endLocations
-
Array of zero-based positions in target where optimal alignment paths end. If gap after query is penalized, gap counts as part of query (NW), otherwise not.
startLocations
-
Array of zero-based positions in target where optimal alignment paths start, they correspond to endLocations. If gap before query is penalized, gap counts as part of query (NW), otherwise not.
alignment
-
Alignment is found for first pair of start and end locations. Alignment is sequence of numbers: 0, 1, 2, 3. 0 stands for match. 1 stands for insertion to target. 2 stands for insertion to query. 3 stands for mismatch. You can use the
EDLIB_EDOP_*
constants instead of 0, 1, 2, and 3. Alignment aligns query to target from begining of query till end of query. If gaps are not penalized, they are not in alignment.
The third argument, $max_distance, works similarly to the third argument of distance: if the distance is more than its value, this function returns an empty hashref. Default value is -1, which disables this optimization.
The fourth argument, $mode, chooses how Edlib should treat gaps before and after query. The options are:
EDLIB_MODE_NW
(default)-
Global method - gaps are not ignored. This is the standard Levenshtein distance, and is the default if $mode is not specified.
EDLIB_MODE_SHW
-
Prefix method - gaps at query end are ignored. So the edit distance between
AACT
andAACTGGC
is 0, because we can ignore theGGC
at the end. EDLIB_MODE_HW
-
Infix method - gaps at both query start and end are ignored. So the edit distance between
ACT
andCGACTGAC
is 0, becauseCG
at the beginning andGAC
at the end of the target are ignored.
The fifth argument, $task, chooses what we want to compute. The options are:
EDLIB_TASK_PATH
(default, slowest)-
All the keys described above will be computed.
EDLIB_TASK_LOC
-
All keys except for
alignment
will be computed. EDLIB_TASK_DISTANCE
(fastest)-
All keys except for
alignment
andstartLocations
will be computed.
The less the function computes, the faster it runs.
EXPORT
All constants by default. You can export the functions align
, distance
and to_cigar
and any of the constants below. You can use the tags :constants
to export every constant, and :all
to export every constant, align
, distance
and to_cigar
.
Exportable constants
EDLIB_CIGAR_EXTENDED
EDLIB_CIGAR_STANDARD
EDLIB_EDOP_DELETE
EDLIB_EDOP_INSERT
EDLIB_EDOP_MATCH
EDLIB_EDOP_MISMATCH
EDLIB_MODE_HW
EDLIB_MODE_NW
EDLIB_MODE_SHW
EDLIB_STATUS_ERROR
EDLIB_STATUS_OK
EDLIB_TASK_DISTANCE
EDLIB_TASK_LOC
EDLIB_TASK_PATH
SEE ALSO
https://github.com/Martinsos/edlib/, http://martinsosic.com/edlib/edlib_8h.html
AUTHOR
Marius Gavrilescu, <marius@ieval.ro>
COPYRIGHT AND LICENSE
Copyright (C) 2017 by Marius Gavrilescu
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.22.3 or, at your option, any later version of Perl 5 you may have available.