NAME

Bio::Cigar - Parse CIGAR strings and translate coordinates to/from reference/query

SYNOPSIS

use 5.014;
use Bio::Cigar;
my $cigar = Bio::Cigar->new("2M1D1M1I4M");
say "Query length is ", $cigar->query_length;
say "Reference length is ", $cigar->reference_length;

my ($qpos, $op) = $cigar->rpos_to_qpos(3);
say "Alignment operation at reference position 3 is $op";

DESCRIPTION

Bio::Cigar is a small library to parse CIGAR strings ("Compact Idiosyncratic Gapped Alignment Report"), such as those used in the SAM file format. CIGAR strings are a run-length encoding which minimally describes the alignment of a query sequence to an (often longer) reference sequence.

Parsing follows the SAM v1 spec for the CIGAR column.

Parsed strings are represented by an object that provides a few utility methods.

ATTRIBUTES

All attributes are read-only.

string

The CIGAR string for this object.

reference_length

The length of the reference sequence segment aligned with the query sequence described by the CIGAR string.

query_length

The length of the query sequence described by the CIGAR string.

ops

An arrayref of [length, operation] tuples describing the CIGAR string. Lengths are integers, possible operations are below.

CIGAR operations

The CIGAR operations are given in the following table, taken from the SAM v1 spec:

Op  Description
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
M   alignment match (can be a sequence match or mismatch)
I   insertion to the reference
D   deletion from the reference
N   skipped region from the reference
S   soft clipping (clipped sequences present in SEQ)
H   hard clipping (clipped sequences NOT present in SEQ)
P   padding (silent deletion from padded reference)
=   sequence match
X   sequence mismatch

• H can only be present as the first and/or last operation.
• S may only have H operations between them and the ends of the string.
• For mRNA-to-genome alignment, an N operation represents an intron.
  For other types of alignments, the interpretation of N is not defined.
• Sum of the lengths of the M/I/S/=/X operations shall equal the length of SEQ.

CONSTRUCTOR

new

Takes a CIGAR string as the sole argument and returns a new Bio::Cigar object.

METHODS

rpos_to_qpos

Takes a reference position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the query sequence. Indels affect how the numbering maps from reference to query.

In list context returns a tuple of [query position, operation at position]. Operation is a single-character string. See the table of CIGAR operations.

If the reference position does not map to the query sequence (as with a deletion, for example), returns undef or [undef, operation].

qpos_to_rpos

Takes a query position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the reference sequence. Indels affect how the numbering maps from query to reference.

In list context returns a tuple of [references position, operation at position]. Operation is a single-character string. See the table of CIGAR operations.

If the query position does not map to the reference sequence (as with an insertion, for example), returns undef or [undef, operation].

op_at_rpos

Takes a reference position and returns the operation at that position. Simply a shortcut for calling "rpos_to_qpos" in list context and discarding the first return value.

op_at_qpos

Takes a query position and returns the operation at that position. Simply a shortcut for calling "qpos_to_rpos" in list context and discarding the first return value.

AUTHOR

Thomas Sibley <trsibley@uw.edu>

COPYRIGHT

Copyright 2014- Mullins Lab, Department of Microbiology, University of Washington.

LICENSE

This library is free software; you can redistribute it and/or modify it under the GNU General Public License, version 2.

SEE ALSO

SAMv1 spec