NAME

Bio::Cigar - Parse CIGAR strings and translate coordinates to/from reference/query

SYNOPSIS

use 5.014;
use Bio::Cigar;
my $cigar = Bio::Cigar->new("2M1D1M1I4M");
say "Query length is ", $cigar->query_length;
say "Reference length is ", $cigar->reference_length;

my ($qpos, $op) = $cigar->rpos_to_qpos(3);
say "Alignment operation at reference position 3 is $op";

my $query = "GCAAATGC";
my $ref   = "AAAAGCAAATGC";
my $aln   = $cigar->align($query, $ref, 5);  # align query to pos 5 of ref
say foreach @$aln;

DESCRIPTION

Bio::Cigar is a small library to parse CIGAR strings ("Compact Idiosyncratic Gapped Alignment Report"), such as those used in the SAM file format. CIGAR strings are a run-length encoding which minimally describes the alignment of a query sequence to an (often longer) reference sequence.

Parsing follows the SAM v1 spec for the CIGAR column.

Parsed strings are represented by an object that provides a few utility methods.

ATTRIBUTES

All attributes are read-only.

string

The CIGAR string for this object.

reference_length

The length of the reference sequence segment aligned with the query sequence described by the CIGAR string.

query_length

The length of the query sequence described by the CIGAR string.

ops

An arrayref of [length, operation] tuples describing the CIGAR string. Lengths are integers, possible operations are below.

CIGAR operations

The CIGAR operations are given in the following table, taken from the SAM v1 spec:

Op  Description
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
M   alignment match (can be a sequence match or mismatch)
I   insertion to the reference
D   deletion from the reference
N   skipped region from the reference
S   soft clipping (clipped sequences present in SEQ)
H   hard clipping (clipped sequences NOT present in SEQ)
P   padding (silent deletion from padded reference)
=   sequence match
X   sequence mismatch

• H can only be present as the first and/or last operation.
• S may only have H operations between them and the ends of the string.
• For mRNA-to-genome alignment, an N operation represents an intron.
  For other types of alignments, the interpretation of N is not defined.
• Sum of the lengths of the M/I/S/=/X operations shall equal the length of SEQ.

CONSTRUCTOR

new

Takes a CIGAR string as the sole argument and returns a new Bio::Cigar object.

METHODS

rpos_to_qpos

Takes a reference position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the query sequence. Indels affect how the numbering maps from reference to query.

In list context returns a tuple of [query position, operation at position]. Operation is a single-character string. See the table of CIGAR operations.

If the reference position does not map to the query sequence (as with a deletion, for example), returns undef or [undef, operation].

qpos_to_rpos

Takes a query position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the reference sequence. Indels affect how the numbering maps from query to reference.

In list context returns a tuple of [references position, operation at position]. Operation is a single-character string. See the table of CIGAR operations.

If the query position does not map to the reference sequence (as with an insertion, for example), returns undef or [undef, operation].

op_at_rpos

Takes a reference position and returns the operation at that position. Simply a shortcut for calling "rpos_to_qpos" in list context and discarding the first return value.

op_at_qpos

Takes a query position and returns the operation at that position. Simply a shortcut for calling "qpos_to_rpos" in list context and discarding the first return value.

reversed

Returns a new Bio::Cigar object with a CIGAR string that's the reverse of this one, i.e. the last operation becomes the first, the second-to-last the second, etc. until the first operation becomes the last.

align($query, $reference, $start_pos=1, $reversed=0)

Takes a query sequence and a reference sequence and aligns them according to the CIGAR string, using gap characters (-) for indels and spaces for soft clipping. This is pure string manipulation and as such the match and mismatch operators (= and X) are assumed to be correct for the given input sequences and not verified. Returns an array ref of [query seq, ref seq].

Optionally, the leftmost reference position (origin 1) can be passed, i.e. the query is aligned starting at that position.

When $reversed is given a true value, the reverse complement of the passed query sequence is used to generate the alignment. Only the IUPAC nucleotide codes ATCGU are currently supported for reverse complementation.

AUTHOR

Thomas Sibley <trsibley@uw.edu>

Felix Kühnl <felix@bioinf.uni-leipzig.de>

COPYRIGHT

Copyright 2014- Mullins Lab, Department of Microbiology, University of Washington.

LICENSE

This library is free software; you can redistribute it and/or modify it under the GNU General Public License, version 2.

SEE ALSO

SAMv1 spec