NAME
Bio::MUST::Core::Ali - Multiple sequence alignment
VERSION
version 0.242020
SYNOPSIS
#!/usr/bin/env perl
use Modern::Perl '2011';
# same as:
# use strict;
# use warnings;
# use feature qw(say);
use Bio::MUST::Core;
use aliased 'Bio::MUST::Core::Ali';
# read Ali form disk
my $ali = Ali->load('example.ali');
# get some properties
say 'height: ' . $ali->height; # number of seqs
say 'width: ' . $ali->width; # number of sites
say '% miss: ' . $ali->perc_miss; # fraction of missing chars (%)
say 'seqs are ' . $ali->is_protein ? 'prot' : 'nucl';
# turn seqs to uppercase
$ali->uc_seqs;
# filter out seqs with no species associated
my @seqs = $ali->filter_seqs( sub { not $_->is_genus_only } );
use aliased 'Bio::MUST::Core::IdList';
my $list = IdList->new( ids => \@seqs );
$ali->apply_list($list);
# alternatively:
# $ali = Ali->new( seqs => \@seqs );
# filter out gap-rich sites
$ali->idealize(0.2); # min 20% non-gaps per site
# filter out non-informative sites
use aliased 'Bio::MUST::Core::SeqMask';
my $mask = SeqMask->parsimony_mask($ali);
$ali->apply_mask($mask);
# write down reduced Ali to disk
$ali->store('example-uc-genus-sp-20.ali');
$ali->store_fasta('example-uc-genus-sp-20.fasta');
# see below for additional methods
# ...
DESCRIPTION
This module implements the multiple sequence alignment (MSA) class and its methods. An Ali is modeled as an array of Bio::MUST::Core::Seq objects. Consequently, sequence ids do not absolutely need to be unique for it to work (though id uniqueness helps a lot for sequence access and filtering).
An Ali knows whether it contains nucleotide or protein sequences, whether those are aligned or not, as well as its Seq count and exact width. All these properties are introspected on the fly, which means that, while they can be expensive to compute, they are always accurate.
This is important because an Ali provides methods for inserting, deleting and editing its Seq objects. Further, it does its best to maintain a semantic distinction between true gap states (encoded as '*') and missing regions of undetermined length (encoded as pure whitespace, for example at the end of a short sequence).
It also has methods for mapping its sequence ids (for example, before export to the PHYLIP format), as well as methods to selectively retain sequences and sites based on Bio::MUST::Core::IdList and Bio::MUST::Core::SeqMask objects. For example, the idealize
method discards shared gaps and optionally removes gap-rich sites only due to a tiny number of sequences.
Finally, an Ali can be stored in MUST pseudo-FASTA format (which handles meaningful whitespace and allows comment lines in the header) or be imported/exported from/to several popular MSA formats, such as plain FASTA, STOCKHOLM and PHYLIP.
ATTRIBUTES
seqs
ArrayRef of Bio::MUST::Core::Seq objects (optional)
Most of the accessor methods described below are implemented by delegation to this public attribute using "Moose native delegation" in Moose::Meta::Attribute::Native. Their documentation thus heavily borrows from the corresponding help pages.
file
Path::Class::File object (optional)
This optional attribute is initialized by class methods that load
an Ali from disk. It is meant to improve the introspection capabilities of the Ali. For now, this attribute is not used by the store
methods, though it might provide them with a default value in the future.
guessing
Boolean (optional)
By default, an Ali object tries to guess whether it is aligned or not by looking for gap-like characters in any of its Seq objects (see Bio::MUST::Core::Seq for the exact test performed on each sequence).
When this smart behavior causes issues, one can disable it by unsetting this boolean attribute (see dont_guess
and guess
accessor methods).
comments
ArrayRef of strings (optional)
An Ali object is commentable, which means that it supports all the methods pertaining to comment lines described in Bio::MUST::Core::Roles::Commentable (such as header
).
CONSTRUCTORS
new
Default constructor (class method) returning a new Ali.
use aliased 'Bio::MUST::Core::Ali';
my $ali1 = Ali->new();
my @seqs = $ali->all_seqs;
my $ali2 = Ali->new( seqs => \@seqs );
This method accepts four optional arguments (see ATTRIBUTES above): seqs
, file
, guessing
and comments
.
clone
Creates a deep copy (a clone) of the Ali. Returns the copy.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
my $ali_copy = $ali->clone;
# you can now mess with $ali_copy without affecting $ali
This method does not accept any arguments.
ACCESSORS
add_seq
Adds one (or more) new sequence(s) at the end of the Ali. Returns the new number of sequences of the Ali.
use aliased 'Bio::MUST::Core::Seq';
my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' );
$ali->add_seq($new_seq);
This method accepts any number of arguments.
get_seq
Returns a sequence of the Ali by its index. You can also use negative index numbers, just as with Perl's core array handling. If the specified sequence does not exist, this method will return undef
.
my $seq = $ali->get_seq($index);
croak "Seq $index not found in Ali!" unless defined $seq;
This method accepts just one argument (and not an array slice).
get_seq_with_id
Returns a sequence of the Ali by its id. If the specified id is not unique, only the first matching sequence will be returned, whereas if no sequence exists for the specified id, this method will return undef
.
my $id = 'Pyrus malus_3750@658052655';
my $seq = $ali->get_seq_with_id($id);
croak "Seq $id not found in Ali!" unless defined $seq;
This method accepts just one argument.
set_seq
Given an index and a sequence, sets the specified Ali element to the sequence. This method returns the new sequence at $index
.
use aliased 'Bio::MUST::Core::Seq';
my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' );
$ali->set_seq($index, $new_seq);
This method requires two arguments.
delete_seq
Removes the sequence at the given index from the Ali. This method returns the deleted sequence. If the specified sequence does not exist, it will return undef
instead.
$ali->delete_seq($index);
This method requires one argument.
insert_seq
Inserts a new sequence into the Ali at the given index. This method returns the new sequence at $index
.
use aliased 'Bio::MUST::Core::Seq';
my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' );
$ali->insert_seq($index, $new_seq);
This method requires two arguments.
all_seqs
Returns all the sequences of the Ali (not an array reference).
my @seqs = $ali->all_seqs;
This method does not accept any arguments.
first_seq
Returns the first sequence of the Ali matching a given criterion, just like List::Util's first
function. This method requires a subroutine implementing the matching logic.
# emulate get_seq_with_id method
my $id2find = 'seq13';
my $seq = $ali->first_seq( sub { $_->full_id eq $id2find } );
This method requires a single argument.
filter_seqs
Returns every sequence of the Ali matching a given criterion, just like Perl's core grep
function. This method requires a subroutine implementing the matching logic.
# keep only long sequences (ignoring gaps and missing states)
my @long_seqs = $ali->filter_seqs( sub { $_->nomiss_seq_len > 500 } );
This method requires a single argument.
all_new_seqs
Returns all the sequences of the Ali tagged as #NEW# (not an array reference).
my @new_seqs = $ali->all_new_seqs;
This method does not accept any arguments.
all_but_new_seqs
Returns all the sequences of the Ali except those tagged as #NEW# (not an array reference).
my @preexisting_seqs = $ali->all_but_new_seqs;
This method does not accept any arguments.
all_seq_ids
Returns all the sequence ids (Bio::MUST::Core::SeqId objects) of the Ali (not an array reference). This is only a convenience method.
use Test::Deeply;
my @ids1 = $ali->all_seq_ids;
my @ids2 = map { $_->seq_id } $ali->all_seqs;
is_deeply \@ids1, \@ids2, 'should be true';
my @orgs = map { $_->org } @ids1;
This method does not accept any arguments.
filename
Returns the stringified filename of the Ali.
This method does not accept any arguments.
guess
Turn on the smart detection of gaps (see guessing
attribute above).
This method does not accept any arguments.
dont_guess
Turn off the smart detection of gaps (see guessing
attribute above).
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('ensembl.fasta');
$ali->dont_guess;
This method does not accept any arguments.
PROPERTIES
has_uniq_ids
Returns true if all the sequence ids are unique.
carp 'Warning: duplicate sequence ids!' unless $ali->has_uniq_ids;
This method does not accept any arguments.
is_protein
Returns true if any sequence of the Ali looks like a protein. See Bio::MUST::Core::Seq for the exact test performed on each sequence.
say 'Your file includes nucleotide sequences' unless $ali->is_protein;
This method does not accept any arguments.
is_aligned
Returns true if any sequence of the Ali appears to be aligned. See Bio::MUST::Core::Seq for the exact test performed on each sequence.
If the boolean attribute guessing is not set, always returns false.
carp 'Warning: file does not look aligned!' unless $ali->is_aligned;
This method does not accept any arguments.
count_seqs
Returns the number of sequences of the Ali. The alias method height
is provided for convenience.
my $height = $ali->count_seqs;
This method does not accept any arguments.
width
Returns the width of the Ali (in characters). If the Ali is not aligned, returns the length of the longest sequence instead.
To avoid potential bugs due to caching, this method dynamically computes the Ali width at each call. Moreover, the Ali is always uniformized (see below) beforehands to ensure accurate width value. Therefore, this method is expensive and should not be called repeatedly (e.g., in a loop condition).
# you'd better looping through sites like this...
my $width = $ali->width;
for my $site (0..$width-1) {
...
}
This method does not accept any arguments.
seq_len_stats
Returns a list of 5 values summarizing the Ali seq lengths (ignoring gaps). The values are the following: Q0 (min), Q1, Q2 (median), Q3, and Q4 (max).
This method does not accept any arguments.
perc_miss
Returns the percentage of missing (and gap-like) character states in the Ali.
As this method internally calls Ali::width
, the remarks above also apply.
my $miss_level = $ali->perc_miss;
This method does not accept any arguments.
MUTATORS
uc_seqs
Turn all the sequences of the Ali to uppercase and returns it.
This method does not accept any arguments.
recode_seqs
Recode all the sequences of the Ali and returns it.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('biased.ali');
# set up RY recoding for suppressing codon bias
my %base_for = (
A => 'A', G => 'A', # purines
C => 'C', T => 'C', # pyrimidines
);
my $ali_rec = $ali->recode_seqs( \%base_for );
$ali_rec->store('biased_ry.ali');
This method requires one argument.
degap_seqs
Remove the gaps in all the sequences of the Ali and returns it.
This method does not accept any arguments.
spacify_seqs
Spacifies all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
This method does not accept any arguments.
gapify_seqs
Gapifies all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
This method accepts an optional argument.
trim_seqs
Trims all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
This method does not accept any arguments.
pad_seqs
Pads all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.
This method does not accept any arguments.
uniformize
Performs the three gap-cleaning operations in turn on all the sequences of the Ali and returns it, which ensures that it is semantically clean and rectangular.
This is only a convenience method called internally by the Ali object before selected methods (such as storage-like methods). However, it might prove useful in some circumstances, hence it is not defined as private.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
$ali->add_seq(...);
# more editing of the Ali sequences
$ali->uniformize;
This method does not accept any arguments.
clear_new_tags
Clear the #NEW# tag (if any) from all the sequences of the Ali and returns it.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input-42.ali');
$ali->clear_new_tags;
my @new_seqs = $ali->all_new_seqs;
# array should be empty
This method does not accept any arguments.
shorten_ids
Replaces all the sequence ids of the Ali by their abbreviated forms as specified by the passed Bio::MUST::Core::IdMapper and returns the Ali.
Note that this method will work only if the sequence ids have not been already shortened or modified in any way since the creation of the IdMapper. Long ids without abbreviated forms in the IdMapper are left untouched.
use aliased 'Bio::MUST::Core::Ali';
use aliased 'Bio::MUST::Core::IdMapper';
my $ali = Ali->load('input.ali');
my $mapper = IdMapper->std_mapper( $ali, { id_prefix => 'lcl|seq' } );
$ali->shorten_ids($mapper);
$ali->store_fasta('input.4blast.fasta');
# makeblastdb
# Note: the temp_fasta method does exactly that
This method requires one argument.
restore_ids
Replaces all the sequence ids of the Ali by their long forms as specified by the passed Bio::MUST::Core::IdMapper and returns the Ali.
Note that this method will work only if the sequence ids have been previously abbreviated (see above) and have not been modified in any way since then. Again, abbreviated ids without long forms in the IdMapper are left untouched.
use aliased 'Bio::MUST::Core::IdMapper';
my $mapper = IdMapper->gi_mapper($ali);
$ali->shorten_ids($mapper);
...
$ali->restore_ids($mapper);
This method requires one argument.
apply_list
Selectively retains or discards sequences from the Ali based on the content of the passed Bio::MUST::Core::IdList object and returns the Ali.
use aliased 'Bio::MUST::Core::IdList';
my $list = IdList->load('interesting_seqs.idl');
$ali->apply_list($list); # discard non-interesting seqs
This method requires one argument.
apply_mask
Selectively retains or discards sites from the Ali based on the content of the passed Bio::MUST::Core::SeqMask object and returns the Ali.
use aliased 'Bio::MUST::Core::SeqMask';
my $variable_sites = SeqMask->variable_mask($ali);
$ali->apply_mask($variable_sites); # discard constant sites
This method requires one argument.
idealize
Computes and applies an ideal sequence mask to the Ali and returns it. This is only a convenience method.
When invoked without arguments, it will discard the gaps that are universally shared by all the sequences. Otherwise, the provided argument corresponds to the threshold of the ideal_mask
method described in Bio::MUST::Core::SeqMask.
use aliased 'Bio::MUST::Core::IdList';
my $fast_seqs = IdList->load('fast_evolving_seqs.idl');
my $seqs2keep = $fast_seqs->negative_list($ali);
$ali->apply_list($seqs2keep); # discard fast-evolving seqs
$ali->idealize; # discard newly shared gaps caused by fast seqs
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('hmm_based.ali');
$ali->idealize(0.05); # discard insertions due to <5% of the seqs
This method accepts an optional argument.
MISC METHODS
gapmiss_regex
Returns a regular expression matching gaps and ambiguous or missing states. The exact regex returned depends on the type of sequences in the Ali (nucl. or proteins).
my $regex = $ali->gapmiss_regex;
my $first_seq = $ali->get_seq(0)->seq;
my $gapmiss_n = $first_seq =~ m/($regex)/xmsg;
say "The first sequence has $gapmiss_n gaps or ambiguous/missing sites";
This method does not accept any arguments.
map_coords
Converts a set of site positions from Ali coordinates to coordinates of the specified sequence (thereby ignoring positions due to gaps). Returns the converted sites in sequence coordinates as an array refrence.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
my $id = 'GIV-Norovirus Hum.GIV.1.POL_1338688@508124125';
my $ali_coords = [ 4, 25, 73, 89, 104, 116 ];
my $seq_coords = $ali->map_coords($id, $ali_coords);
# $seq_coords is [ 3, 23, 59, 71, 71, 74 ]
This method requires two arguments: the id of a sequence and an array reference of input sites in Ali coordinates.
I/O METHODS
load
Class method (constructor) returning a new Ali read from disk. This method will transparently import plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
use Test::Deeply;
use aliased 'Bio::MUST::Core::Ali';
my $ali1 = Ali->load('example.ali');
my $ali2 = Ali->load('example.fasta');
my @seqs1 = $ali1->all_seqs;
my @seqs2 = $ali2->all_seqs;
is_deeply, \@seqs1, \@seqs2, 'should be true';
This method requires one argument.
store
Writes the Ali to disk in the MUST pseudo-FASTA format (ALI files).
Note that the ALI format is only used when the suffix of the outfile name is '.ali'. In all other cases (including lack of suffix), this method automatically forwards the call to store_fasta
.
$ali->store('output.ali');
# output.ali is written in ALI format
$ali->store('output.fasta');
# output.fasta is written in FASTA format
This method requires one argument (but see store_fasta
in case of automatic forwarding of the method call).
store_fasta
Writes the Ali to disk in the plain FASTA format.
For compatibility purposes, this method automatically fetches sequence ids using the foreign_id
method instead of the native full_id
method, both described in Bio::MUST::Core::SeqId.
$ali->store_fasta( 'output.fasta' );
$ali->store_fasta( 'output.fasta', {chunk => -1, degap => 1} );
This method requires one argument and accepts a second optional argument controlling the output format. It is a hash reference that may contain one or more of the following keys:
- clean: replace all ambiguous and missing states by C<X> (default: false)
- degap: boolean value controlling degapping (default: false)
- chunk: line width (default is 60 chars; negative values means no wrap)
Finally, it is possible to fine-tune the behavior of the clean
option by providing another character than X
through the gapify
key. This can be useful to replace all ambiguous and missing states by gaps, as shown below:
$ali->store_fasta( 'output.fasta, { clean => 1, gapify => '*' } );
temp_fasta
Writes a temporary copy of the Ali to disk in the plain FASTA format using numeric sequence ids and returns the name of the temporary file. This is only a convenience method.
In list context, returns the IdMapper object along with temporary filename.
my $infile = $ali->temp_fasta( { degap => 1 } );
my $output = `script.sh $infile`;
...
This method accepts the same optional argument hash as store_fasta
. However, an additional option (id_prefix
) is available to control the way abbreviated sequence ids are prefixed by the std_mapper
method (see Bio::MUST::Core::Listable).
my $infile1 = $ali1->temp_fasta( { id_prefix => 'file1-' } );
my $infile2 = $ali2->temp_fasta( { id_prefix => 'file2-' } );
Finally, it is possible to pass an existing Bio::MUST::Core::IdMapper object by providing the mapper
option.
load_phylip
Class method (constructor) returning a new Ali read from a file in the PHYLIP format. Both sequential and interleaved formats are supported.
The only constraint is that sequence ids MUST NOT contain whitespace and be followed by at least one whitespace character. This means that some old-school PHYLIP files not following this convention will not be processed correctly.
When using the interleaved format, sequence ids may or may not be repeated in each block.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load_phylip('phylip.phy');
say $ali->count_seqs;
say $ali->width;
# outputs:
# 10
# 709
This method requires one argument.
store_phylip
Writes the Ali to disk in the interleaved (or sequential) PHYLIP format.
To ensure maximal flexibility, this method fetches sequence ids using the native full_id
method described in Bio::MUST::Core::SeqId, but truncates them to 10 characters, as expected by the original PHYLIP software package. No other tinkering is carried out on the ids. Thus, if the ids contain whitespace or are not unique in their 10 first characters, it is advised to first map them using one of the constructors in Bio::MUST::Core::IdMapper.
use aliased 'Bio::MUST::Core::Ali';
use aliased 'Bio::MUST::Core::IdMapper';
my $ali = Ali->load('input.ali');
my $mapper = IdMapper->std_mapper($ali);
$ali->shorten_ids($mapper);
$ali->store_phylip( 'input.phy', { chunk => 50 } );
This method requires one argument and accepts a second optional argument controlling the output format. It is a hash reference that may contain one or more of the following keys:
- short: truncate ids to 10 chars, as in original PHYLIP (defaut: yes)
- clean: replace all ambiguous and missing states by 'X' (default: false)
- chunk: line width (default: 60 chars; negative values means no wrap)
To store the Ali in PHYLIP sequential format, specify a negative chunk (-1).
load_stockholm
Class method (constructor) returning a new Ali read from a file in the STOCKHOLM format. =GF comments are retained (see above) but not the other comment classes (=GS, =GR and =GC).
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('upsk.stockholm');
say $ali->header;
# outputs:
# ID UPSK
# SE Predicted; Infernal
# SS Published; PMID 9223489
# RN [1]
# RM 9223489
# RT The role of the pseudoknot at the 3' end of turnip yellow mosaic
# RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
# RT polymerase.
# RA Deiman BA, Kortlever RM, Pleij CW;
# RL J Virol 1997;71:5990-5996.
This method requires one argument.
instant_store
Class method intended to transform a large sequence file read from disk without loading it in memory. This method will transparently process plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
my $chunk = 200;
my $split = sub {
my $seq = shift;
my $base_id = ( split /\s+/xms, $seq->full_id )[0];
my $max_pos = $seq->seq_len - $chunk;
my $n = 0;
my $out_str;
for (my $pos = 0; $pos <= $max_pos; $pos += $chunk, $n++) {
$out_str .= ">$base_id.$n\n" . $seq->edit_seq($pos,
$pos + $chunk <= $max_pos ? $chunk : 2 * $chunk
) . "\n";
}
return $out_str;
};
use aliased 'Bio::MUST::Core::Ali';
Ali->instant_store(
'outfile.fasta', { infile => 'infile.fasta', coderef => $split }
);
This method requires two arguments. The sercond is a hash reference that must contain the following keys: - infile: input sequence file - coderef: subroutine implementing the transforming logic
instant_count
Class method returning the number of seqs in any sequence file read from disk without loading it in memory. This method will transparently process plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
use aliased 'Bio::MUST::Core::Ali';
my $seq_n = Ali->instant_count('input.ali');
say $seq_n;
ALIASES
height
Alias for count_seqs
method. For API consistency.
AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
CONTRIBUTORS
Catherine COLSON <ccolson@doct.uliege.be>
Arnaud DI FRANCO <arnaud.difranco@gmail.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.