SimpleAlign - Multiple alignments held as a set of sequences


# use Bio::AlignIO to read in the alignment
$str = Bio::AlignIO->new('-file' => 't/data/testaln.pfam');
$aln = $str->next_aln();

# some descriptors
print $aln->length, "\n";
print $aln->no_residues, "\n";
print $aln->is_flush, "\n";
print $aln->no_sequences, "\n";
print $aln->percentage_identity, "\n";
print $aln->consensus_string(50), "\n";

# find the position in the alignment for a sequence location
$pos = $aln->column_from_residue_number('1433_LYCES', 14); # = 6; 

# extract sequences and check values for the alignment column $pos
foreach $seq ($aln->each_seq) {
    $res = $seq->subseq($pos, $pos);
foreach $res (keys %count) {
    printf "Res: %s  Count: %2d\n", $res, $count{$res}; 


SimpleAlign handles multiple alignments of sequences. It is very permissive of types (it won\'t insist on things being all same length etc): really it is a SequenceSet explicitly held in memory with a whole series of built in manipulations and especially file format systems for read/writing alignments.

SimpleAlign basically views an alignment as an immutable block of text. SimpleAlign *is not* the object to be using if you want to perform complex alignment alignment manipulations. These functions are much better done by UnivAln by Georg Fuellen.

However for lightweight display/formatting and minimal manipulation (e.g. removiung all-gaps columns) - this is the one to use.

SimpleAlign uses a subclass of Bio::PrimarySeq class Bio::LocatableSeq to store its sequences. These are subsequences with a start and end positions in the parent reference sequence.

Tricky concepts. SimpleAlign expects name,start,end to be 'unique' in the alignment, and this is the key for the internal hashes. (name,start,end is abreviated nse in the code). However, in many cases people don\'t want the name/start-end to be displayed: either multiple names in an alignment or names specific to the alignment (ROA1_HUMAN_1, ROA1_HUMAN_2 etc). These names are called 'displayname', and generally is what is used to print out the alignment. They default to name/start-end

The SimpleAlign Module came from Ewan Birney\'s Align module


SimpleAlign is being slowly converted to bioperl coding standards, mainly by Ewan.

Use Bio::Root::Object - done
Use proper exceptions - done
Use hashed constructor - not done!


The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _


Title     : addSeq
Usage     : $myalign->addSeq($newseq);
Function  : Adds another sequence to the alignment
          : *does not* align it - just adds it to the
          : hashes
Returns   : nothing
Argument  :


 Title   : column_from_residue_number
 Usage   : $col = $ali->column_from_residue_number( $seqname, $resnumber)
    This function gives the position in the alignment (i.e. column number) of
    the given residue number in the sequence with the given name. For example,
    for the alignment

    Seq1/91-97 AC..DEF.GH
    Seq2/24-30 ACGG.RTY..
    Seq3/43-51 AC.DDEFGHI

    column_from_residue_number( "Seq1", 94 ) returns 5.
    column_from_residue_number( "Seq2", 25 ) returns 2.
    column_from_residue_number( "Seq3", 50 ) returns 9.

    An exception is thrown if the residue number would lie outside the length
    of the aligment (e.g. column_from_residue_number( "Seq2", 22 )

 Returns : A column number for the postion in the alignment of the
	   given residue in the given sequence (1 = first column)

 Args    : 
    A sequence name (not a name/start-end)
    A residue number in the whole sequence (not just that segment of it 
					    in the alignment)


 Title     : consensus_string
 Usage     : $str = $ali->consensus_string($threshold_percent)
 Function  : Makes a consensus
 Returns   :
 Argument  : Optional treshold ranging from 0 to 100.  If consensus residue appears in
		fewer than threshold % of the sequences at a given location
		consensus_string will return a "?" at that location rather than the consensus
		letter. (Default value = 0%)


 Title     : consensus_aa
 Usage     : $consensus_residue = $ali->consensus_aa($residue_number, $threshold_percent)
 Function  : Makes a consensus
 Returns   :
 Argument  : Optional treshold ranging from 0 to 100.  If consensus residue appears in
		fewer than threshold % of the sequences at the specified location
		consensus_string will return a "?"  rather than the consensus
		letter. (Default value = 0%)


Title     : each_alphabetically
Usage     : foreach $seq ( $ali->each_alphabetically() )
Function  : returns an array of sequence object sorted
          : alphabetically by name and then by start point
          : Does not change the order of the alignment
Returns   :
Argument  :


Title     : eachSeq
Usage     : foreach $seq ( $align->eachSeq() )
Function  : gets an array of Seq objects from the
          : alignment
Returns   : an array
Argument  : nothing


Title     : eachSeqWithId
Usage     : foreach $seq ( $align->eachSeqWithName() )
Function  : gets an array of Seq objects from the
          : alignment, the contents being those sequences
          : with the given name (there may be more than one
Returns   : an array
Argument  : nothing


Title     : id
Usage     : $myalign->id("Ig")
Function  : Gets/sets the id field of the alignment
Returns   : An id string
Argument  : An id string (optional)


Title     : is_flush
Usage     : if( $ali->is_flush() )
Function  : Tells you whether the alignment
          : is flush, ie all of the same length
Returns   : 1 or 0
Argument  :


Title     : length_aln()
Usage     : $len = $ali->length_aln()
Function  : returns the maximum length of the alignment.
          : To be sure the alignment is a block, use is_flush
Returns   :
Argument  :


Title     : map_chars
Usage     : $ali->map_chars('\.','-')
Function  : does a s/$arg1/$arg2/ on
          : the sequences. Useful for
          : gap characters
          : Notice that the from (arg1) is interpretted
          : as a regex, so be careful about quoting meta
          : characters (eg $ali->map_chars('.','-') wont
          : do what you want)
Returns   :
Argument  :


Title     : no_residues
Usage     : $no = $ali->no_residues
Function  : number of residues in total
          : in the alignment
Returns   :
Argument  :


Title     : no_sequences
Usage     : $depth = $ali->no_sequences
Function  : number of sequence in the
          : sequence alignment
Returns   :
Argument  :


Title   : percentage_identity
Usage   : $id = $align->percentage_identity
   The function uses a fast method to calculate the average percentage identity of the alignment
Returns : The average percentage identity of the alignment
Args    : None


Title   : purge
Usage   : $aln->purge(0.7);
Function: removes sequences above whatever %id
Example :
Returns : An array of the removed sequences

This function will grind on large alignments. Beware!

(perhaps not ideally implemented)


Title     : read_fasta
Usage     : $ali->read_fasta(\*INPUT)
Function  : reads in a fasta formatted
          : file for an alignment
Returns   :
Argument  :


Title     : read_mase
Usage     : $ali->read_mase(\*INPUT)
Function  : reads mase (seaview)
          : formatted alignments
Returns   :
Argument  :


Title   : read_MSF
Usage   : $al->read_MSF(\*STDIN);
Function: reads MSF formatted files. Tries to read *all* MSF
         It reads all non whitespace characters in the alignment
         area. For MSFs with weird gaps (eg ~~~) map them by using
Example :
Returns :
Args    : filehandle


Title     : read_Pfam
Usage     : $ali->read_Pfam(\*INPUT)
Function  : reads a Pfam formatted
          : Alignment (Mul format).
          : - this is the format used by Belvu
Returns   :
Argument  :


Title     : read_Pfam_file
Usage     : $ali->read_Pfam_file("thisfile");
Function  : opens a filename, reads
          : a Pfam (mul) formatted alignment
Returns   :
Argument  :


Title   : read_Prodom
Usage   : $ali->read_Prodom( $file )
Function: Reads in a Prodom format alignment
Returns :
   Args    : A filehandle glob or ref. to a filehandle object


Title     : read_selex
Usage     : $ali->read_selex(\*INPUT)
Function  : reads selex (hmmer) format
          : alignments
Returns   :
Argument  :


Title     : read_stockholm
Usage     : $ali->read_stockholm(\*INPUT)
Function  : reads stockholm  format alignments
Returns   :
Argument  :


Title     : removeSeq
Usage     : $aln->removeSeq($seq);
Function  : removes a single sequence from an alignment


Title     : set_displayname_count
Usage     : $ali->set_displayname_count
Function  : sets the names to be name_#
          : where # is the number of times this
          : name has been used.
Returns   :
Argument  :


Title     : set_displayname_flat
Usage     : $ali->set_displayname_flat()
Function  : Makes all the sequences be displayed
          : as just their name, not name/start-end
Returns   :
Argument  :


Title     : set_displayname_normal
Usage     : $ali->set_displayname_normal()
Function  : Makes all the sequences be displayed
          : as name/start-end
Returns   :
Argument  :


Title     : sort_alphabetically
Usage     : $ali->sort_alphabetically
Function  : changes the order of the alignemnt
          : to alphabetical on name followed by
          : numerical by number
Returns   :
Argument  :


Title     : uppercase()
Usage     : $ali->uppercase()
Function  : Sets all the sequences
          : to uppercase
Returns   :
Argument  :


Title     : write_clustalw
Usage     : $ali->write_clustalw
Function  : writes a clustalw formatted
          : (.aln) file
Returns   :
Argument  :


Title     : write_fasta
Usage     : $ali->write_fasta(\*OUTPUT)
Function  : writes a fasta formatted alignment
Returns   :
Argument  : reference-to-glob to file or filehandle object


Title     : write_MSF
Usage     : $ali->write_MSF(\*FH)
Function  : writes MSF format output
Returns   :
Argument  :


Title     : write_Pfam
Usage     : $ali->write_Pfam(\*OUTPUT)
Function  : writes a Pfam/Mul formatted
          : file
Returns   :
Argument  :


Title     : write_selex
Usage     : $ali->write_selex(\*OUTPUT)
Function  : writes a selex (hmmer) formatted alignment
Returns   :
Argument  : reference-to-glob to file or filehandle object