NAME
Bio::MUST::Core::Ali::Stash - Thin wrapper for an indexed Ali read from disk
VERSION
version 0.243430
SYNOPSIS
#!/usr/bin/env perl
use Modern::Perl '2011';
# same as:
# use strict;
# use warnings;
# use feature qw(say);
use Bio::MUST::Core;
use aliased 'Bio::MUST::Core::Ali::Stash';
use aliased 'Bio::MUST::Core::IdList';
# load database
my $db = Stash->load('database.fasta');
# process OrthoFinder-like output file
# where each line defines a cluster followed by its member sequences
# cluster1: seq3 seq7 seq2
# cluster2: seq1 seq4 seq6 seq5
# ...
open my $in, '<', 'clusters.txt';
while (my $line = <$in>) {
chomp $line;
# extract member id list for current cluster
my ($cluster, @ids) = split /\s+/xms, $line;
$cluster =~ s/:\z//xms; # remove trailing colon (:)
my $list = IdList->new( ids => \@ids );
# assemble Ali and store it as FASTA file
my $ali = $list->reordered_ali($db);
$ali->dont_guess;
$ali->store( $cluster . '.fasta' );
}
DESCRIPTION
This module implements a class representing a sequence database where ids are indexed for faster access. To this end, it combines an internal Bio::MUST::Core::Ali object and a Bio::MUST::Core::IdList object.
An Ali::Stash is meant to be built from an existing ALI (or FASTA) file residing on disk and cannot be altered once loaded. Its sequences are supposed not to be aligned but aligned FASTA files are also processed correctly. By default, the full-length sequence ids are indexed. If the first word of each id (non-whitespace containing string or accession) is unique across the database, it can be used instead via the option <truncate_ids =
1>> of the load
method (see the SYNOPSIS for an example).
While this class is more efficient than the standard Ali
, it is way slower at reading large sequence databases than specialized external programs such as NCBI blastdbcmd
working on indexed binary files. Thus, if you need more performance, have a look at the Blast::Database
class from the Bio::MUST::Drivers distribution.
ATTRIBUTES
seqs
Bio::MUST::Core::Ali object (required)
This required attribute contains the Bio::MUST::Core::Seq objects that populate the associated sequence database file. It should be initialized through the class method load
(see the SYNOPSIS for an example).
For now, it provides the following methods: count_comments
, all_comments
, get_comment
, guessing
, all_seq_ids
, has_uniq_ids
, is_protein
, is_aligned
, get_seq
, get_seq_with_id
(see below), first_seq
, all_seqs
, filter_seqs
and count_seqs
(see Bio::MUST::Core::Ali).
lookup
Bio::MUST::Core::IdList object (auto)
This attribute is automatically initialized with the list indexing the sequence ids of the internal Ali
object. Thus, it cannot be user-specified.
It provides the following method: index_for
(see Bio::MUST::Core::IdList). Yet, it is nearly a private method. Instead, individual sequences should be accessed through the get_seq_with_id
method (see below), while sequence batches should be recovered via user-specified IdList objects (see the SYNOPSIS for an example).
ACCESSORS
get_seq_with_id
Returns a sequence of the Ali::Stash by its id. Note that sequence ids are assumed to be unique in the corresponding database. If no sequence exists for the specified id, this method will return undef
.
my $id = 'Pyrus malus_3750@658052655';
my $seq = $db->get_seq_with_id($id);
croak "Seq $id not found in Ali::Stash!" unless defined $seq;
This method accepts just one argument (and not an array slice).
It is a faster implementation of the same method from the Ali
class.
I/O METHODS
load
Class method (constructor) returning a new Ali::Stash read from disk. As in Ali
, this method will transparently import plain FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
# load database
my $db = Stash->load( 'database.fasta' );
# alternatively... (indexing only accessions)
my $db = Stash->load( 'database.fasta', { truncate_ids => 1 } );
This method requires one argument and accepts a second optional argument controlling the way sequence ids are processed. It is a hash reference that may only contain the following key:
- truncate_ids: consider only the first id word (accession)
AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.