NAME

Bio::MUST::Core::Ali - Multiple sequence alignment

VERSION

version 0.242020

SYNOPSIS

#!/usr/bin/env perl

use Modern::Perl '2011';
# same as:
# use strict;
# use warnings;
# use feature qw(say);

use Bio::MUST::Core;
use aliased 'Bio::MUST::Core::Ali';

# read Ali form disk
my $ali = Ali->load('example.ali');

# get some properties
say 'height:  ' . $ali->height;         # number of seqs
say 'width:   ' . $ali->width;          # number of sites
say '% miss:  ' . $ali->perc_miss;      # fraction of missing chars (%)
say 'seqs are ' . $ali->is_protein ? 'prot' : 'nucl';

# turn seqs to uppercase
$ali->uc_seqs;

# filter out seqs with no species associated
my @seqs = $ali->filter_seqs( sub { not $_->is_genus_only } );
use aliased 'Bio::MUST::Core::IdList';
my $list = IdList->new( ids => \@seqs );
$ali->apply_list($list);

# alternatively:
# $ali = Ali->new( seqs => \@seqs );

# filter out gap-rich sites
$ali->idealize(0.2);                    # min 20% non-gaps per site

# filter out non-informative sites
use aliased 'Bio::MUST::Core::SeqMask';
my $mask = SeqMask->parsimony_mask($ali);
$ali->apply_mask($mask);

# write down reduced Ali to disk
$ali->store('example-uc-genus-sp-20.ali');
$ali->store_fasta('example-uc-genus-sp-20.fasta');

# see below for additional methods
# ...

DESCRIPTION

This module implements the multiple sequence alignment (MSA) class and its methods. An Ali is modeled as an array of Bio::MUST::Core::Seq objects. Consequently, sequence ids do not absolutely need to be unique for it to work (though id uniqueness helps a lot for sequence access and filtering).

An Ali knows whether it contains nucleotide or protein sequences, whether those are aligned or not, as well as its Seq count and exact width. All these properties are introspected on the fly, which means that, while they can be expensive to compute, they are always accurate.

This is important because an Ali provides methods for inserting, deleting and editing its Seq objects. Further, it does its best to maintain a semantic distinction between true gap states (encoded as '*') and missing regions of undetermined length (encoded as pure whitespace, for example at the end of a short sequence).

It also has methods for mapping its sequence ids (for example, before export to the PHYLIP format), as well as methods to selectively retain sequences and sites based on Bio::MUST::Core::IdList and Bio::MUST::Core::SeqMask objects. For example, the idealize method discards shared gaps and optionally removes gap-rich sites only due to a tiny number of sequences.

Finally, an Ali can be stored in MUST pseudo-FASTA format (which handles meaningful whitespace and allows comment lines in the header) or be imported/exported from/to several popular MSA formats, such as plain FASTA, STOCKHOLM and PHYLIP.

ATTRIBUTES

seqs

ArrayRef of Bio::MUST::Core::Seq objects (optional)

Most of the accessor methods described below are implemented by delegation to this public attribute using "Moose native delegation" in Moose::Meta::Attribute::Native. Their documentation thus heavily borrows from the corresponding help pages.

file

Path::Class::File object (optional)

This optional attribute is initialized by class methods that load an Ali from disk. It is meant to improve the introspection capabilities of the Ali. For now, this attribute is not used by the store methods, though it might provide them with a default value in the future.

guessing

Boolean (optional)

By default, an Ali object tries to guess whether it is aligned or not by looking for gap-like characters in any of its Seq objects (see Bio::MUST::Core::Seq for the exact test performed on each sequence).

When this smart behavior causes issues, one can disable it by unsetting this boolean attribute (see dont_guess and guess accessor methods).

comments

ArrayRef of strings (optional)

An Ali object is commentable, which means that it supports all the methods pertaining to comment lines described in Bio::MUST::Core::Roles::Commentable (such as header).

CONSTRUCTORS

new

Default constructor (class method) returning a new Ali.

use aliased 'Bio::MUST::Core::Ali';
my $ali1 = Ali->new();
my @seqs = $ali->all_seqs;
my $ali2 = Ali->new( seqs => \@seqs );

This method accepts four optional arguments (see ATTRIBUTES above): seqs, file, guessing and comments.

clone

Creates a deep copy (a clone) of the Ali. Returns the copy.

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
my $ali_copy = $ali->clone;
# you can now mess with $ali_copy without affecting $ali

This method does not accept any arguments.

ACCESSORS

add_seq

Adds one (or more) new sequence(s) at the end of the Ali. Returns the new number of sequences of the Ali.

use aliased 'Bio::MUST::Core::Seq';
my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' );
$ali->add_seq($new_seq);

This method accepts any number of arguments.

get_seq

Returns a sequence of the Ali by its index. You can also use negative index numbers, just as with Perl's core array handling. If the specified sequence does not exist, this method will return undef.

my $seq = $ali->get_seq($index);
croak "Seq $index not found in Ali!" unless defined $seq;

This method accepts just one argument (and not an array slice).

get_seq_with_id

Returns a sequence of the Ali by its id. If the specified id is not unique, only the first matching sequence will be returned, whereas if no sequence exists for the specified id, this method will return undef.

my $id = 'Pyrus malus_3750@658052655';
my $seq = $ali->get_seq_with_id($id);
croak "Seq $id not found in Ali!" unless defined $seq;

This method accepts just one argument.

set_seq

Given an index and a sequence, sets the specified Ali element to the sequence. This method returns the new sequence at $index.

use aliased 'Bio::MUST::Core::Seq';
my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' );
$ali->set_seq($index, $new_seq);

This method requires two arguments.

delete_seq

Removes the sequence at the given index from the Ali. This method returns the deleted sequence. If the specified sequence does not exist, it will return undef instead.

$ali->delete_seq($index);

This method requires one argument.

insert_seq

Inserts a new sequence into the Ali at the given index. This method returns the new sequence at $index.

use aliased 'Bio::MUST::Core::Seq';
my $new_seq = Seq->new( seq_id => 'seq1', seq => 'ACGT' );
$ali->insert_seq($index, $new_seq);

This method requires two arguments.

all_seqs

Returns all the sequences of the Ali (not an array reference).

my @seqs = $ali->all_seqs;

This method does not accept any arguments.

first_seq

Returns the first sequence of the Ali matching a given criterion, just like List::Util's first function. This method requires a subroutine implementing the matching logic.

# emulate get_seq_with_id method
my $id2find = 'seq13';
my $seq = $ali->first_seq( sub { $_->full_id eq $id2find } );

This method requires a single argument.

filter_seqs

Returns every sequence of the Ali matching a given criterion, just like Perl's core grep function. This method requires a subroutine implementing the matching logic.

# keep only long sequences (ignoring gaps and missing states)
my @long_seqs = $ali->filter_seqs( sub { $_->nomiss_seq_len > 500 } );

This method requires a single argument.

all_new_seqs

Returns all the sequences of the Ali tagged as #NEW# (not an array reference).

my @new_seqs = $ali->all_new_seqs;

This method does not accept any arguments.

all_but_new_seqs

Returns all the sequences of the Ali except those tagged as #NEW# (not an array reference).

my @preexisting_seqs = $ali->all_but_new_seqs;

This method does not accept any arguments.

all_seq_ids

Returns all the sequence ids (Bio::MUST::Core::SeqId objects) of the Ali (not an array reference). This is only a convenience method.

use Test::Deeply;
my @ids1 = $ali->all_seq_ids;
my @ids2 = map { $_->seq_id } $ali->all_seqs;
is_deeply \@ids1, \@ids2, 'should be true';
my @orgs = map { $_->org } @ids1;

This method does not accept any arguments.

filename

Returns the stringified filename of the Ali.

This method does not accept any arguments.

guess

Turn on the smart detection of gaps (see guessing attribute above).

This method does not accept any arguments.

dont_guess

Turn off the smart detection of gaps (see guessing attribute above).

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('ensembl.fasta');
$ali->dont_guess;

This method does not accept any arguments.

PROPERTIES

has_uniq_ids

Returns true if all the sequence ids are unique.

carp 'Warning: duplicate sequence ids!' unless $ali->has_uniq_ids;

This method does not accept any arguments.

is_protein

Returns true if any sequence of the Ali looks like a protein. See Bio::MUST::Core::Seq for the exact test performed on each sequence.

say 'Your file includes nucleotide sequences' unless $ali->is_protein;

This method does not accept any arguments.

is_aligned

Returns true if any sequence of the Ali appears to be aligned. See Bio::MUST::Core::Seq for the exact test performed on each sequence.

If the boolean attribute guessing is not set, always returns false.

carp 'Warning: file does not look aligned!' unless $ali->is_aligned;

This method does not accept any arguments.

count_seqs

Returns the number of sequences of the Ali. The alias method height is provided for convenience.

my $height = $ali->count_seqs;

This method does not accept any arguments.

width

Returns the width of the Ali (in characters). If the Ali is not aligned, returns the length of the longest sequence instead.

To avoid potential bugs due to caching, this method dynamically computes the Ali width at each call. Moreover, the Ali is always uniformized (see below) beforehands to ensure accurate width value. Therefore, this method is expensive and should not be called repeatedly (e.g., in a loop condition).

# you'd better looping through sites like this...
my $width = $ali->width;
for my $site (0..$width-1) {
    ...
}

This method does not accept any arguments.

seq_len_stats

Returns a list of 5 values summarizing the Ali seq lengths (ignoring gaps). The values are the following: Q0 (min), Q1, Q2 (median), Q3, and Q4 (max).

This method does not accept any arguments.

perc_miss

Returns the percentage of missing (and gap-like) character states in the Ali.

As this method internally calls Ali::width, the remarks above also apply.

my $miss_level = $ali->perc_miss;

This method does not accept any arguments.

MUTATORS

uc_seqs

Turn all the sequences of the Ali to uppercase and returns it.

This method does not accept any arguments.

recode_seqs

Recode all the sequences of the Ali and returns it.

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('biased.ali');

# set up RY recoding for suppressing codon bias
my %base_for = (
    A => 'A',   G => 'A',       # purines
    C => 'C',   T => 'C',       # pyrimidines
);

my $ali_rec = $ali->recode_seqs( \%base_for );
$ali_rec->store('biased_ry.ali');

This method requires one argument.

degap_seqs

Remove the gaps in all the sequences of the Ali and returns it.

This method does not accept any arguments.

spacify_seqs

Spacifies all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.

This method does not accept any arguments.

gapify_seqs

Gapifies all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.

This method accepts an optional argument.

trim_seqs

Trims all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.

This method does not accept any arguments.

pad_seqs

Pads all the sequences of the Ali and returns it. See the corresponding method in Bio::MUST::Core::Seq for the exact effect of this gap-cleaning operation.

This method does not accept any arguments.

uniformize

Performs the three gap-cleaning operations in turn on all the sequences of the Ali and returns it, which ensures that it is semantically clean and rectangular.

This is only a convenience method called internally by the Ali object before selected methods (such as storage-like methods). However, it might prove useful in some circumstances, hence it is not defined as private.

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
$ali->add_seq(...);
# more editing of the Ali sequences
$ali->uniformize;

This method does not accept any arguments.

clear_new_tags

Clear the #NEW# tag (if any) from all the sequences of the Ali and returns it.

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input-42.ali');
$ali->clear_new_tags;
my @new_seqs = $ali->all_new_seqs;
# array should be empty

This method does not accept any arguments.

shorten_ids

Replaces all the sequence ids of the Ali by their abbreviated forms as specified by the passed Bio::MUST::Core::IdMapper and returns the Ali.

Note that this method will work only if the sequence ids have not been already shortened or modified in any way since the creation of the IdMapper. Long ids without abbreviated forms in the IdMapper are left untouched.

use aliased 'Bio::MUST::Core::Ali';
use aliased 'Bio::MUST::Core::IdMapper';
my $ali = Ali->load('input.ali');
my $mapper = IdMapper->std_mapper( $ali, { id_prefix => 'lcl|seq' } );
$ali->shorten_ids($mapper);
$ali->store_fasta('input.4blast.fasta');
# makeblastdb

# Note: the temp_fasta method does exactly that

This method requires one argument.

restore_ids

Replaces all the sequence ids of the Ali by their long forms as specified by the passed Bio::MUST::Core::IdMapper and returns the Ali.

Note that this method will work only if the sequence ids have been previously abbreviated (see above) and have not been modified in any way since then. Again, abbreviated ids without long forms in the IdMapper are left untouched.

use aliased 'Bio::MUST::Core::IdMapper';
my $mapper = IdMapper->gi_mapper($ali);
$ali->shorten_ids($mapper);
...
$ali->restore_ids($mapper);

This method requires one argument.

apply_list

Selectively retains or discards sequences from the Ali based on the content of the passed Bio::MUST::Core::IdList object and returns the Ali.

use aliased 'Bio::MUST::Core::IdList';
my $list = IdList->load('interesting_seqs.idl');
$ali->apply_list($list);                # discard non-interesting seqs

This method requires one argument.

apply_mask

Selectively retains or discards sites from the Ali based on the content of the passed Bio::MUST::Core::SeqMask object and returns the Ali.

use aliased 'Bio::MUST::Core::SeqMask';
my $variable_sites = SeqMask->variable_mask($ali);
$ali->apply_mask($variable_sites);      # discard constant sites

This method requires one argument.

idealize

Computes and applies an ideal sequence mask to the Ali and returns it. This is only a convenience method.

When invoked without arguments, it will discard the gaps that are universally shared by all the sequences. Otherwise, the provided argument corresponds to the threshold of the ideal_mask method described in Bio::MUST::Core::SeqMask.

use aliased 'Bio::MUST::Core::IdList';
my $fast_seqs = IdList->load('fast_evolving_seqs.idl');
my $seqs2keep = $fast_seqs->negative_list($ali);
$ali->apply_list($seqs2keep);           # discard fast-evolving seqs
$ali->idealize;      # discard newly shared gaps caused by fast seqs

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('hmm_based.ali');
$ali->idealize(0.05);    # discard insertions due to <5% of the seqs

This method accepts an optional argument.

MISC METHODS

gapmiss_regex

Returns a regular expression matching gaps and ambiguous or missing states. The exact regex returned depends on the type of sequences in the Ali (nucl. or proteins).

my $regex = $ali->gapmiss_regex;
my $first_seq = $ali->get_seq(0)->seq;
my $gapmiss_n = $first_seq =~ m/($regex)/xmsg;
say "The first sequence has $gapmiss_n gaps or ambiguous/missing sites";

This method does not accept any arguments.

map_coords

Converts a set of site positions from Ali coordinates to coordinates of the specified sequence (thereby ignoring positions due to gaps). Returns the converted sites in sequence coordinates as an array refrence.

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
my $id = 'GIV-Norovirus Hum.GIV.1.POL_1338688@508124125';
my $ali_coords = [ 4, 25, 73, 89, 104, 116 ];
my $seq_coords = $ali->map_coords($id, $ali_coords);
# $seq_coords is [ 3, 23, 59, 71,  71,  74 ]

This method requires two arguments: the id of a sequence and an array reference of input sites in Ali coordinates.

I/O METHODS

load

Class method (constructor) returning a new Ali read from disk. This method will transparently import plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).

use Test::Deeply;
use aliased 'Bio::MUST::Core::Ali';
my $ali1 = Ali->load('example.ali');
my $ali2 = Ali->load('example.fasta');
my @seqs1 = $ali1->all_seqs;
my @seqs2 = $ali2->all_seqs;
is_deeply, \@seqs1, \@seqs2, 'should be true';

This method requires one argument.

store

Writes the Ali to disk in the MUST pseudo-FASTA format (ALI files).

Note that the ALI format is only used when the suffix of the outfile name is '.ali'. In all other cases (including lack of suffix), this method automatically forwards the call to store_fasta.

$ali->store('output.ali');
# output.ali is written in ALI format
$ali->store('output.fasta');
# output.fasta is written in FASTA format

This method requires one argument (but see store_fasta in case of automatic forwarding of the method call).

store_fasta

Writes the Ali to disk in the plain FASTA format.

For compatibility purposes, this method automatically fetches sequence ids using the foreign_id method instead of the native full_id method, both described in Bio::MUST::Core::SeqId.

$ali->store_fasta( 'output.fasta' );
$ali->store_fasta( 'output.fasta', {chunk => -1, degap => 1} );

This method requires one argument and accepts a second optional argument controlling the output format. It is a hash reference that may contain one or more of the following keys:

- clean: replace all ambiguous and missing states by C<X> (default: false)
- degap: boolean value controlling degapping (default: false)
- chunk: line width (default is 60 chars; negative values means no wrap)

Finally, it is possible to fine-tune the behavior of the clean option by providing another character than X through the gapify key. This can be useful to replace all ambiguous and missing states by gaps, as shown below:

$ali->store_fasta( 'output.fasta, { clean => 1, gapify => '*' } );

temp_fasta

Writes a temporary copy of the Ali to disk in the plain FASTA format using numeric sequence ids and returns the name of the temporary file. This is only a convenience method.

In list context, returns the IdMapper object along with temporary filename.

my $infile = $ali->temp_fasta( { degap => 1 } );
my $output = `script.sh $infile`;
...

This method accepts the same optional argument hash as store_fasta. However, an additional option (id_prefix) is available to control the way abbreviated sequence ids are prefixed by the std_mapper method (see Bio::MUST::Core::Listable).

my $infile1 = $ali1->temp_fasta( { id_prefix => 'file1-' } );
my $infile2 = $ali2->temp_fasta( { id_prefix => 'file2-' } );

Finally, it is possible to pass an existing Bio::MUST::Core::IdMapper object by providing the mapper option.

load_phylip

Class method (constructor) returning a new Ali read from a file in the PHYLIP format. Both sequential and interleaved formats are supported.

The only constraint is that sequence ids MUST NOT contain whitespace and be followed by at least one whitespace character. This means that some old-school PHYLIP files not following this convention will not be processed correctly.

When using the interleaved format, sequence ids may or may not be repeated in each block.

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load_phylip('phylip.phy');
say $ali->count_seqs;
say $ali->width;

# outputs:
# 10
# 709

This method requires one argument.

store_phylip

Writes the Ali to disk in the interleaved (or sequential) PHYLIP format.

To ensure maximal flexibility, this method fetches sequence ids using the native full_id method described in Bio::MUST::Core::SeqId, but truncates them to 10 characters, as expected by the original PHYLIP software package. No other tinkering is carried out on the ids. Thus, if the ids contain whitespace or are not unique in their 10 first characters, it is advised to first map them using one of the constructors in Bio::MUST::Core::IdMapper.

use aliased 'Bio::MUST::Core::Ali';
use aliased 'Bio::MUST::Core::IdMapper';
my $ali = Ali->load('input.ali');
my $mapper = IdMapper->std_mapper($ali);
$ali->shorten_ids($mapper);
$ali->store_phylip( 'input.phy', { chunk => 50 } );

This method requires one argument and accepts a second optional argument controlling the output format. It is a hash reference that may contain one or more of the following keys:

- short: truncate ids to 10 chars, as in original PHYLIP (defaut: yes)
- clean: replace all ambiguous and missing states by 'X' (default: false)
- chunk: line width (default: 60 chars; negative values means no wrap)

To store the Ali in PHYLIP sequential format, specify a negative chunk (-1).

load_stockholm

Class method (constructor) returning a new Ali read from a file in the STOCKHOLM format. =GF comments are retained (see above) but not the other comment classes (=GS, =GR and =GC).

use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('upsk.stockholm');
say $ali->header;

# outputs:
# ID    UPSK
# SE    Predicted; Infernal
# SS    Published; PMID 9223489
# RN    [1]
# RM    9223489
# RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
# RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
# RT    polymerase.
# RA    Deiman BA, Kortlever RM, Pleij CW;
# RL    J Virol 1997;71:5990-5996.

This method requires one argument.

instant_store

Class method intended to transform a large sequence file read from disk without loading it in memory. This method will transparently process plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).

my $chunk = 200;

my $split = sub {
    my $seq = shift;
    my $base_id = ( split /\s+/xms, $seq->full_id )[0];
    my $max_pos = $seq->seq_len - $chunk;
    my $n = 0;
    my $out_str;
    for (my $pos = 0; $pos <= $max_pos; $pos += $chunk, $n++) {
        $out_str .= ">$base_id.$n\n" . $seq->edit_seq($pos,
            $pos + $chunk <= $max_pos ? $chunk : 2 * $chunk
        ) . "\n";
    }
    return $out_str;
};

use aliased 'Bio::MUST::Core::Ali';

Ali->instant_store(
    'outfile.fasta', { infile => 'infile.fasta', coderef => $split }
);

This method requires two arguments. The sercond is a hash reference that must contain the following keys: - infile: input sequence file - coderef: subroutine implementing the transforming logic

instant_count

Class method returning the number of seqs in any sequence file read from disk without loading it in memory. This method will transparently process plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).

use aliased 'Bio::MUST::Core::Ali';
my $seq_n = Ali->instant_count('input.ali');
say $seq_n;

ALIASES

height

Alias for count_seqs method. For API consistency.

AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

CONTRIBUTORS

  • Catherine COLSON <ccolson@doct.uliege.be>

  • Arnaud DI FRANCO <arnaud.difranco@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.