LICENSE

Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute Copyright [2016-2024] EMBL-European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

CONTACT

Please email comments or questions to the public Ensembl
developers list at <http://lists.ensembl.org/mailman/listinfo/dev>.

Questions may also be sent to the Ensembl help desk at
<http://www.ensembl.org/Help/Contact>.

NAME

Bio::EnsEMBL::BaseAlignFeature - Baseclass providing a common abstract implementation for alignment features

SYNOPSIS

my $feat = new Bio::EnsEMBL::DnaPepAlignFeature(
  -slice        => $slice,
  -start        => 100,
  -end          => 120,
  -strand       => 1,
  -hseqname     => 'SP:RF1231',
  -hstart       => 200,
  -hend         => 220,
  -analysis     => $analysis,
  -cigar_string => '10M3D5M2I',
  -align_type   => 'ensembl'
);

where $analysis is a Bio::EnsEMBL::Analysis object.

Alternatively if you have an array of ungapped features:

  my $feat =
    new Bio::EnsEMBL::DnaPepAlignFeature( -features => \@features );

where @features is an array of Bio::EnsEMBL::FeaturePair objects.

There is a method to (re)create ungapped features from the cigar_string:

  my @ungapped_features = $feat->ungapped_features();

where @ungapped_features is an array of Bio::EnsEMBL::FeaturePair's.

Bio::EnsEMBL::BaseAlignFeature inherits from:
  Bio::EnsEMBL::FeaturePair, which in turn inherits from:
    Bio::EnsEMBL::Feature,
thus methods from both parent classes are available.


The cigar_string is a condensed representation of the matches and gaps
which make up the gapped alignment (where CIGAR stands for
Concise Idiosyncratic Gapped Alignment Report).

CIGAR format is: n <matches> [ x <deletes or inserts> m <matches> ]*
where M = match, D = delete, I = insert; n, m are match lengths;
x is delete or insert length.

Spaces are omitted, thus: "23M4I12M2D1M"
as are counts for any lengths of 1, thus 1M becomes M: "23M4I12M2DM"


To make things clearer this is how a blast HSP would be parsed:

>AK014066
       Length = 146

  Minus Strand HSPs:

  Score = 76 (26.8 bits), Expect = 1.4, P = 0.74
  Identities = 20/71 (28%), Positives = 29/71 (40%), Frame = -1

Query:   479 GLQAPPPTPQGCRLIPPPPLGLQAPLPTLRAVGSSHHHP*GRQGSSLSSFRSSLASKASA 300
             G  APPP PQG R   P P G + P   L             + + ++  R  +A   +
Sbjct:     7 GALAPPPAPQG-RWAFPRPTG-KRPATPLHGTARQDRQVRRSEAAKVTGCRGRVAPHVAP 64

Query:   299 SSPHNPSPLPS 267
                H P+P P+
Sbjct:    65 PLTHTPTPTPT 75

The alignment goes from 479 down to 267 in the query sequence on the reverse
strand, and from 7 to 75 in the subject sequence.

The alignment is made up of the following ungapped pieces:

query_seq start 447 , sbjct_seq hstart  7 , match length  33 , strand -1
query_seq start 417 , sbjct_seq hstart 18 , match length  27 , strand -1
query_seq start 267 , sbjct_seq hstart 27 , match length 147 , strand -1

When assembled into a DnaPepAlignFeature where:
  (seqname, start, end, strand) refer to the query sequence,
  (hseqname, hstart, hend, hstrand) refer to the subject sequence,
these ungapped pieces are represented by the cigar string:
  33M3I27M3I147M
with start 267, end 479, strand -1, and hstart 7, hend 75, hstrand 1.

CAVEATS

AlignFeature cigar strings have the opposite 'sense'
('D' and 'I' swapped) compared with Exonerate cigar strings.

Exonerate modules in Bio::EnsEMBL::Analysis use this convention:

 The longer genomic sequence specified by:
    exonerate:    target
    AlignFeature: (sequence, start, end, strand)

 A shorter sequence (such as EST or protein) specified by:
    exonerate:    query
    AlignFeature: (hsequence, hstart, hend, hstrand)

The resulting AlignFeature cigar strings have 'D' and 'I'
swapped compared with the Exonerate output, i.e.:

  exonerate:    M 123 D 1 M 11 I 1 M 39
  AlignFeature: 123MI11MD39M

METHODS

new

Arg [..]   : List of named arguments. (-cigar_string , -features, -align_type) defined
             in this constructor, others defined in FeaturePair and 
             SeqFeature superclasses.  Either cigar_string or a list
             of ungapped features should be provided - not both.
Example    : $baf = new BaseAlignFeatureSubclass(-cigar_string => '3M3I12M', -align_type => 'ensembl');
Description: Creates a new BaseAlignFeature using either a cigar string or
             a list of ungapped features.  BaseAlignFeature is an abstract
             baseclass and should not actually be instantiated - rather its
             subclasses should be.
Returntype : Bio::EnsEMBL::BaseAlignFeature
Exceptions : thrown if both feature and cigar string args are provided
             thrown if neither feature nor cigar string args are provided
             warn if cigar string is provided without cigar type
Caller     : general
Status     : Stable

cigar_string

Arg [1]    : string $cigar_string
Example    : $feature->cigar_string( "12MI3M" );
Description: get/set for attribute cigar_string.
             cigar_string describes the alignment:
               "xM" stands for x matches (or mismatches),
               "xI" for x inserts into the query sequence,
               "xD" for x deletions from the query sequence
               where the query sequence is specified by (seqname, start, ...)
               and the subject sequence by (hseqname, hstart, ...).
             An "x" that is 1 can be omitted.
             See the SYNOPSIS for an example.
Returntype : string
Exceptions : none
Caller     : general
Status     : Stable

align_type

Arg [1]    : type $align_type
Example    : $feature->align_type( "ensembl" );
Description: get/set for attribute align_type.
             align_type specifies which cigar string 
             is used to describe the alignment:
             The default is 'ensembl'
Returntype : string
Exceptions : none
Caller     : general
Status     : Stable

alignment_length

Arg [1]    : None
Description: return the alignment length (including indels) based on the alignment_type ('ensembl', 'mdtag')
Returntype : int
Exceptions : 
Caller     : 
Status     : Stable

_ensembl_cigar_alignment_length

Arg [1]    : None
Description: return the alignment length (including indels) based on the cigar_string
Returntype : int
Exceptions :
Caller     :
Status     : Stable

ungapped_features

Args       : none
Example    : @ungapped_features = $align_feature->get_feature
Description: converts the internal cigar_string into an array of
             ungapped feature pairs
Returntype : list of Bio::EnsEMBL::FeaturePair
Exceptions : cigar_string not set internally
Caller     : general
Status     : Stable

strands_reversed

 Arg [1]    : int $strands_reversed
 Description: get/set for attribute strands_reversed
              0 means that strand and hstrand are the original strands obtained
                from the alignment program used
              1 means that strand and hstrand have been flipped as compared to
                the original result provided by the alignment program used.
                You may want to use the reverse_complement method to restore the
                original strandness.
 Returntype : int
 Exceptions : none
 Caller     : general
 Status     : Stable

reverse_complement

Args       : none
Description: reverse complement the FeaturePair based on the cigar type
             modifing strand, hstrand and cigar_string in consequence
Returntype : none
Exceptions : none
Caller     : general
Status     : Stable

_ensembl_reverse_complement

Args       : none
Description: reverse complement the FeaturePair for ensembl cigar string,
             modifing strand, hstrand and cigar_string in consequence
Returntype : none
Exceptions : none
Caller     : general
Status     : Stable

transform

Arg  1     : String $coordinate_system_name
Arg [2]    : String $coordinate_system_version
Example    : $feature = $feature->transform('contig');
             $feature = $feature->transform('chromosome', 'NCBI33');
Description: Moves this AlignFeature to the given coordinate system.
             If the feature cannot be transformed to the destination 
             coordinate system undef is returned instead.
Returntype : Bio::EnsEMBL::BaseAlignFeature;
Exceptions : wrong parameters
Caller     : general
Status     : Medium Risk

_parse_ensembl_cigar

Args       : none
Description: PRIVATE (internal) method - creates ungapped features from 
             internally stored cigar line in ensembl format
Returntype : list of Bio::EnsEMBL::FeaturePair
Exceptions : none
Caller     : ungapped_features
Status     : Stable

_parse_features

Arg  [1]   : listref Bio::EnsEMBL::FeaturePair $ungapped_features
Description: creates internal cigar_string and start,end hstart,hend
             entries.
Returntype : none, fills in values of self
Exceptions : argument list undergoes many sanity checks - throws under many
             invalid conditions
Caller     : new
Status     : Stable

_hit_unit

Args       : none
Description: abstract method, overwrite with something that returns
             one or three
Returntype : int 1,3
Exceptions : none
Caller     : internal
Status     : Stable

_query_unit

Args       : none
Description: abstract method, overwrite with something that returns
             one or three
Returntype : int 1,3
Exceptions : none
Caller     : internal
Status     : Stable

_mdtag_alignment_length

Arg [1]    : None
Description: return the alignment length (including indels) based on the mdtag (mdz) string
Returntype : int
Exceptions : none
Caller     : internal
Status     : Stable

_get_mdz_chunks

Arg [1]    : mdtag string - MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer:  SAM/BAM specification)
Description: parses the mdtag string and group it according the type
             eg: MD:Z:35^VIVALE31^GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE10 returns ['35', '^', 'VIVALE', '31', '^', 'GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE', '10']
Returntype : array of strings
Exceptions : none
Caller     : internal
Status     : Stable

_get_mdz_alignment_length

Arg [1]    : array of strings
Description: calculate the alignment length from the given chunks
Returntype : array of strings
Exceptions : none
Caller     : internal
Status     : Stable

_get_mdz_chunk_type

Arg [1]    : char
Description: get the chunk type
Returntype : string
Exceptions : none
Caller     : internal
Status     : Stable

_mdz_alignment_string

Arg [1]    : input sequence
Arg [2]    : MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer:  SAM/BAM specification)
             eg: MD:Z:96^RHKTDSFVGLMGKRALNS0V14
Example    : $pf->alignment_strings
Description: Allows to rebuild the alignment string of both the seq and hseq sequence
Returntype : array reference containing 2 strings
             the first corresponds to seq
             the second corresponds to hseq
Exceptions : none
Caller     : general
Status     : Stable