NAME

Text::Levenshtein::Edlib - XS edit distance and optimal alignment path calculation

SYNOPSIS

use feature 'say';

use Text::Levenshtein::Edlib qw/:all/;
say distance 'kitten', 'sitting'; # prints '3'
say 'Distance > 2!'
    if !defined distance 'kitten', 'sitting', 2; # prints 'Distance > 2!'

my $align = align('kitten', 'sitting');
say "Edit distance is: $align->{editDistance}";
say "Alphabet length is: $align->{alphabetLength}";
say "Start locations are: @{$align->{startLocations}}";
say "End locations are: @{$align->{endLocations}}";
say "Alignment path is: @{$align->{alignment}}";
say "Alignment path (in CIGAR format): ", to_cigar $align->{alignment};
say "Alignment path (in extended CIGAR format): ",
    to_cigar $align->{alignment}, EDLIB_CIGAR_EXTENDED;

DESCRIPTION

Text::Levenshtein::Edlib is a wrapper around the edlib library that computes Levenshtein edit distance and optimal alignment path for a pair of strings.

It does not handle UTF-8 strings, for those Text::Levenshtein::XS can compute edit distance but not alignment path.

This module has two functions:

distance($query, $target, [$max_distance, [$mode]])

This is the basic interface to the library. It is compatible with the function of the same name in Text::Levenshtein::XS.

It returns the edit distance between the two given strings. If the third argument is specified, and the edit distance is greater than the value of the third argument, then the function finishes the computation early and returns undef. See below for the meaning of the optional $mode argument.

align($query, $target, [$max_distance, [$mode, [$task]]])

This is the full-featured interface to the library.

It returns a hashref with the following keys:

editDistance

The edit distance of the two strings.

alphabetLength

The number of different characters in the query and target together.

endLocations

Array of zero-based positions in target where optimal alignment paths end. If gap after query is penalized, gap counts as part of query (NW), otherwise not.

startLocations

Array of zero-based positions in target where optimal alignment paths start, they correspond to endLocations. If gap before query is penalized, gap counts as part of query (NW), otherwise not.

alignment

Alignment is found for first pair of start and end locations. Alignment is sequence of numbers: 0, 1, 2, 3. 0 stands for match. 1 stands for insertion to target. 2 stands for insertion to query. 3 stands for mismatch. You can use the EDLIB_EDOP_* constants instead of 0, 1, 2, and 3. Alignment aligns query to target from begining of query till end of query. If gaps are not penalized, they are not in alignment.

The third argument, $max_distance, works similarly to the third argument of distance: if the distance is more than its value, this function returns an empty hashref. Default value is -1, which disables this optimization.

The fourth argument, $mode, chooses how Edlib should treat gaps before and after query. The options are:

EDLIB_MODE_NW (default)

Global method - gaps are not ignored. This is the standard Levenshtein distance, and is the default if $mode is not specified.

EDLIB_MODE_SHW

Prefix method - gaps at query end are ignored. So the edit distance between AACT and AACTGGC is 0, because we can ignore the GGC at the end.

EDLIB_MODE_HW

Infix method - gaps at both query start and end are ignored. So the edit distance between ACT and CGACTGAC is 0, because CG at the beginning and GAC at the end of the target are ignored.

The fifth argument, $task, chooses what we want to compute. The options are:

EDLIB_TASK_PATH (default, slowest)

All the keys described above will be computed.

EDLIB_TASK_LOC

All keys except for alignment will be computed.

EDLIB_TASK_DISTANCE (fastest)

All keys except for alignment and startLocations will be computed.

The less the function computes, the faster it runs.

EXPORT

All constants by default. You can export the functions align, distance and to_cigar and any of the constants below. You can use the tags :constants to export every constant, and :all to export every constant, align, distance and to_cigar.

Exportable constants

EDLIB_CIGAR_EXTENDED
EDLIB_CIGAR_STANDARD
EDLIB_EDOP_DELETE
EDLIB_EDOP_INSERT
EDLIB_EDOP_MATCH
EDLIB_EDOP_MISMATCH
EDLIB_MODE_HW
EDLIB_MODE_NW
EDLIB_MODE_SHW
EDLIB_STATUS_ERROR
EDLIB_STATUS_OK
EDLIB_TASK_DISTANCE
EDLIB_TASK_LOC
EDLIB_TASK_PATH

SEE ALSO

https://github.com/Martinsos/edlib/, http://martinsosic.com/edlib/edlib_8h.html

AUTHOR

Marius Gavrilescu, <marius@ieval.ro>

COPYRIGHT AND LICENSE

Copyright (C) 2017 by Marius Gavrilescu

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.22.3 or, at your option, any later version of Perl 5 you may have available.