LICENSE
Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute Copyright [2016-2024] EMBL-European Bioinformatics Institute
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
CONTACT
Please email comments or questions to the public Ensembl
developers list at <http://lists.ensembl.org/mailman/listinfo/dev>.
Questions may also be sent to the Ensembl help desk at
<http://www.ensembl.org/Help/Contact>.
NAME
Bio::EnsEMBL::BaseAlignFeature - Baseclass providing a common abstract implementation for alignment features
SYNOPSIS
my $feat = new Bio::EnsEMBL::DnaPepAlignFeature(
-slice => $slice,
-start => 100,
-end => 120,
-strand => 1,
-hseqname => 'SP:RF1231',
-hstart => 200,
-hend => 220,
-analysis => $analysis,
-cigar_string => '10M3D5M2I',
-align_type => 'ensembl'
);
where $analysis is a Bio::EnsEMBL::Analysis object.
Alternatively if you have an array of ungapped features:
my $feat =
new Bio::EnsEMBL::DnaPepAlignFeature( -features => \@features );
where @features is an array of Bio::EnsEMBL::FeaturePair objects.
There is a method to (re)create ungapped features from the cigar_string:
my @ungapped_features = $feat->ungapped_features();
where @ungapped_features is an array of Bio::EnsEMBL::FeaturePair's.
Bio::EnsEMBL::BaseAlignFeature inherits from:
Bio::EnsEMBL::FeaturePair, which in turn inherits from:
Bio::EnsEMBL::Feature,
thus methods from both parent classes are available.
The cigar_string is a condensed representation of the matches and gaps
which make up the gapped alignment (where CIGAR stands for
Concise Idiosyncratic Gapped Alignment Report).
CIGAR format is: n <matches> [ x <deletes or inserts> m <matches> ]*
where M = match, D = delete, I = insert; n, m are match lengths;
x is delete or insert length.
Spaces are omitted, thus: "23M4I12M2D1M"
as are counts for any lengths of 1, thus 1M becomes M: "23M4I12M2DM"
To make things clearer this is how a blast HSP would be parsed:
>AK014066
Length = 146
Minus Strand HSPs:
Score = 76 (26.8 bits), Expect = 1.4, P = 0.74
Identities = 20/71 (28%), Positives = 29/71 (40%), Frame = -1
Query: 479 GLQAPPPTPQGCRLIPPPPLGLQAPLPTLRAVGSSHHHP*GRQGSSLSSFRSSLASKASA 300
G APPP PQG R P P G + P L + + ++ R +A +
Sbjct: 7 GALAPPPAPQG-RWAFPRPTG-KRPATPLHGTARQDRQVRRSEAAKVTGCRGRVAPHVAP 64
Query: 299 SSPHNPSPLPS 267
H P+P P+
Sbjct: 65 PLTHTPTPTPT 75
The alignment goes from 479 down to 267 in the query sequence on the reverse
strand, and from 7 to 75 in the subject sequence.
The alignment is made up of the following ungapped pieces:
query_seq start 447 , sbjct_seq hstart 7 , match length 33 , strand -1
query_seq start 417 , sbjct_seq hstart 18 , match length 27 , strand -1
query_seq start 267 , sbjct_seq hstart 27 , match length 147 , strand -1
When assembled into a DnaPepAlignFeature where:
(seqname, start, end, strand) refer to the query sequence,
(hseqname, hstart, hend, hstrand) refer to the subject sequence,
these ungapped pieces are represented by the cigar string:
33M3I27M3I147M
with start 267, end 479, strand -1, and hstart 7, hend 75, hstrand 1.
CAVEATS
AlignFeature cigar strings have the opposite 'sense'
('D' and 'I' swapped) compared with Exonerate cigar strings.
Exonerate modules in Bio::EnsEMBL::Analysis use this convention:
The longer genomic sequence specified by:
exonerate: target
AlignFeature: (sequence, start, end, strand)
A shorter sequence (such as EST or protein) specified by:
exonerate: query
AlignFeature: (hsequence, hstart, hend, hstrand)
The resulting AlignFeature cigar strings have 'D' and 'I'
swapped compared with the Exonerate output, i.e.:
exonerate: M 123 D 1 M 11 I 1 M 39
AlignFeature: 123MI11MD39M
METHODS
new
Arg [..] : List of named arguments. (-cigar_string , -features, -align_type) defined
in this constructor, others defined in FeaturePair and
SeqFeature superclasses. Either cigar_string or a list
of ungapped features should be provided - not both.
Example : $baf = new BaseAlignFeatureSubclass(-cigar_string => '3M3I12M', -align_type => 'ensembl');
Description: Creates a new BaseAlignFeature using either a cigar string or
a list of ungapped features. BaseAlignFeature is an abstract
baseclass and should not actually be instantiated - rather its
subclasses should be.
Returntype : Bio::EnsEMBL::BaseAlignFeature
Exceptions : thrown if both feature and cigar string args are provided
thrown if neither feature nor cigar string args are provided
warn if cigar string is provided without cigar type
Caller : general
Status : Stable
cigar_string
Arg [1] : string $cigar_string
Example : $feature->cigar_string( "12MI3M" );
Description: get/set for attribute cigar_string.
cigar_string describes the alignment:
"xM" stands for x matches (or mismatches),
"xI" for x inserts into the query sequence,
"xD" for x deletions from the query sequence
where the query sequence is specified by (seqname, start, ...)
and the subject sequence by (hseqname, hstart, ...).
An "x" that is 1 can be omitted.
See the SYNOPSIS for an example.
Returntype : string
Exceptions : none
Caller : general
Status : Stable
align_type
Arg [1] : type $align_type
Example : $feature->align_type( "ensembl" );
Description: get/set for attribute align_type.
align_type specifies which cigar string
is used to describe the alignment:
The default is 'ensembl'
Returntype : string
Exceptions : none
Caller : general
Status : Stable
alignment_length
Arg [1] : None
Description: return the alignment length (including indels) based on the alignment_type ('ensembl', 'mdtag')
Returntype : int
Exceptions :
Caller :
Status : Stable
_ensembl_cigar_alignment_length
Arg [1] : None
Description: return the alignment length (including indels) based on the cigar_string
Returntype : int
Exceptions :
Caller :
Status : Stable
ungapped_features
Args : none
Example : @ungapped_features = $align_feature->get_feature
Description: converts the internal cigar_string into an array of
ungapped feature pairs
Returntype : list of Bio::EnsEMBL::FeaturePair
Exceptions : cigar_string not set internally
Caller : general
Status : Stable
strands_reversed
Arg [1] : int $strands_reversed
Description: get/set for attribute strands_reversed
0 means that strand and hstrand are the original strands obtained
from the alignment program used
1 means that strand and hstrand have been flipped as compared to
the original result provided by the alignment program used.
You may want to use the reverse_complement method to restore the
original strandness.
Returntype : int
Exceptions : none
Caller : general
Status : Stable
reverse_complement
Args : none
Description: reverse complement the FeaturePair based on the cigar type
modifing strand, hstrand and cigar_string in consequence
Returntype : none
Exceptions : none
Caller : general
Status : Stable
_ensembl_reverse_complement
Args : none
Description: reverse complement the FeaturePair for ensembl cigar string,
modifing strand, hstrand and cigar_string in consequence
Returntype : none
Exceptions : none
Caller : general
Status : Stable
transform
Arg 1 : String $coordinate_system_name
Arg [2] : String $coordinate_system_version
Example : $feature = $feature->transform('contig');
$feature = $feature->transform('chromosome', 'NCBI33');
Description: Moves this AlignFeature to the given coordinate system.
If the feature cannot be transformed to the destination
coordinate system undef is returned instead.
Returntype : Bio::EnsEMBL::BaseAlignFeature;
Exceptions : wrong parameters
Caller : general
Status : Medium Risk
_parse_ensembl_cigar
Args : none
Description: PRIVATE (internal) method - creates ungapped features from
internally stored cigar line in ensembl format
Returntype : list of Bio::EnsEMBL::FeaturePair
Exceptions : none
Caller : ungapped_features
Status : Stable
_parse_features
Arg [1] : listref Bio::EnsEMBL::FeaturePair $ungapped_features
Description: creates internal cigar_string and start,end hstart,hend
entries.
Returntype : none, fills in values of self
Exceptions : argument list undergoes many sanity checks - throws under many
invalid conditions
Caller : new
Status : Stable
_hit_unit
Args : none
Description: abstract method, overwrite with something that returns
one or three
Returntype : int 1,3
Exceptions : none
Caller : internal
Status : Stable
_query_unit
Args : none
Description: abstract method, overwrite with something that returns
one or three
Returntype : int 1,3
Exceptions : none
Caller : internal
Status : Stable
_mdtag_alignment_length
Arg [1] : None
Description: return the alignment length (including indels) based on the mdtag (mdz) string
Returntype : int
Exceptions : none
Caller : internal
Status : Stable
_get_mdz_chunks
Arg [1] : mdtag string - MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer: SAM/BAM specification)
Description: parses the mdtag string and group it according the type
eg: MD:Z:35^VIVALE31^GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE10 returns ['35', '^', 'VIVALE', '31', '^', 'GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE', '10']
Returntype : array of strings
Exceptions : none
Caller : internal
Status : Stable
_get_mdz_alignment_length
Arg [1] : array of strings
Description: calculate the alignment length from the given chunks
Returntype : array of strings
Exceptions : none
Caller : internal
Status : Stable
_get_mdz_chunk_type
Arg [1] : char
Description: get the chunk type
Returntype : string
Exceptions : none
Caller : internal
Status : Stable
_mdz_alignment_string
Arg [1] : input sequence
Arg [2] : MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer: SAM/BAM specification)
eg: MD:Z:96^RHKTDSFVGLMGKRALNS0V14
Example : $pf->alignment_strings
Description: Allows to rebuild the alignment string of both the seq and hseq sequence
Returntype : array reference containing 2 strings
the first corresponds to seq
the second corresponds to hseq
Exceptions : none
Caller : general
Status : Stable