NAME
Bio::NEXUS User Manual - Programming the Perl API for NEXUS files
DESCRIPTION
This User Manual explains the motivation for developing the Bio::NEXUS library, the principles underlying its organization, and its use in developing software for evolutionary informatics. This manual also provides information on how to use two demonstration tools that come with the package, nexplot.pl (for visualizing character data and trees) and nextool.pl (for manipulating character data and trees).
Quick Start
0.1 Install
If you have CPAN, the automatic install is easiest (just be sure you invoke the command with adequate permission to install at the default system or user level, e.g., on Mac OSX, use "sudo" before the following commands):
$ perl -MCPAN -e "install Bio::NEXUS"
or use the manual method of downloading the package, unpacking, then the following:
$ perl Makefile.PL
$ make
$ make test
$ make install
For further details and options, see the Installation document.
0.2 Quickies
**Under Construction**
nextool quickies
To reroot a tree using 'S_Pombe' as an outgroup (if your file contains one tree):
$ nextool.pl my_nexus_file.nex output.nex reroot 'S_Pombe'
To completely remove some taxa from a file, while maintaining the integrity of remaining phylogeny:
$ nextool.pl my_nexus_file.nex output.nex exclude otu human_gene1 chimp_gene1 mouse_gene2
To define a set of taxa (sets can be colored in plots: see nexplot quickie):
$ nextool.pl my_nexus_file.nex output.nex makesets byotus mammals = [human_gene1 mouse_gene2] plants = [maize_gene1 maize_gene2]
nexplot quickies
To produce a postscript plot of the tree and alignment from your file (first tree and first alignment are used by default, if files contain multiple trees or alignments):
$ nexplot.pl my_nexus_file.nex > pretty_plot.ps
To plot only the phylogeny in the postscript file:
$ nexplot.pl -t my_nexus_file.nex > pretty_tree.ps
To apply coloring to sets you have already defined (called "mammals" and "plants") in your NEXUS file (coloring affects both portions of the tree and rows in the alignment):
$ nexplot.pl -U "mammals red plants green" my_nexus_file > colored_plot.ps
quickie script using Bio::NEXUS
Here is an example of how Bio::NEXUS can be used to read in a NEXUS file, perhaps created by ClustalW, and add a tree to the file. Note that taxa need to be named the same way in the NEXUS file and in the tree string (although a translate() method exists to change names in the file).
#!/usr/bin/perl -w
use strict;
use Bio::NEXUS;
my $nexus = new Bio::NEXUS("./clustal_output/gene_family.nex");
my $newick = "(human_gene1,(chimp_gene,(yeast_gene,(human_gene2,mouse_gene))))";
my $outgroup = "yeast_gene";
my $trees_block = new Bio::NEXUS::TreesBlock();
$trees_block->add_tree_from_newick($newick, "simple_tree");
my $tree_object = $trees_block->get_tree("simple_tree");
$tree_object->reroot($outgroup);
$nexus->add_block($trees_block);
$nexus->write("gene_family_with_tree.nex");
Discussion Here we need to explain the steps. also I think this code can be simplified by removing the unneeded variables such as $outgroup (just use "yeast_gene" as the arg to the reroot method). If we call this add_tree.pl and run it (after first displaying the content of the input file), the output is as follows
> cat clustal_gen_family.nex
#NEXUS
BEGIN DATA;
dimensions ntax=5 nchar=25;
format missing=?
symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"
interleave datatype=PROTEIN gap= -;
matrix
A IKKGANLFKTRCAQCHTVEKDGGNI
B LTKGAKLFTTRCAQCHTLEGDGGNI
C STKGAKLFETRCKQCHTVENGGGHV
D LKKGEKLFTTRCAQCHTLKEGEGNL
E LKKGEKLFTTRCAQCHTLKEGEGNL
;
end;
> perl add_tree.pl
Read in Data Block (deprecated), creating Characters Block instead . . .
> cat gene_family_with_tree.nex
#NEXUS
BEGIN TAXA;
DIMENSIONS ntax=5;
TAXLABELS human_gene1 human_gene2 yeast_gene mouse_gene chimp_gene;
END;
BEGIN CHARACTERS;
DIMENSIONS ntax=5 nchar=25;
FORMAT datatype=PROTEIN " gap=- abcdefghiklmnpqrstuvwxyz missing=? symbols=" interleave;
MATRIX
human_gene1 IKKGANLFKTRCAQCHTVEKDGGNI
human_gene2 LKKGEKLFTTRCAQCHTLKEGEGNL
yeast_gene STKGAKLFETRCKQCHTVENGGGHV
mouse_gene LKKGEKLFTTRCAQCHTLKEGEGNL
chimp_gene LTKGAKLFTTRCAQCHTLEGDGGNI
;
END;
BEGIN ;
TREE simple_tree = (human_gene1,(chimp_gene,(yeast_gene,(human_gene2,mouse_gene)inode4)inode3)inode2)root;
END;
0.3 Going further: should you be using nexplot.pl, nextool.pl and Bio::NEXUS?
**Under Construction**
You may find this package useful if you desire to:
use scripts or commands to manipulate gene or protein alignments with trees
use scripts or commands to manipulate classical character data with trees
generate custom views of your molecular or other character data and trees
develop wrappers for existing tools that don't use NEXUS but should
maintain integrity of complex data sets (e.g., multiple trees, constraints, notes)
write scripts to link together tools into an evolutionary analysis pipeline
develop new Perl tools that will use NEXUS as input or output
By contrast, this package may not help you directly if your main problem is to:
find methods to compute numeric results of an evolutionary model
integrate with BioPerl (we are working on this, via BioPerl, but its a separate branch)
find the best tree for your data
If you are uncertain about whether to use Bio::NEXUS, read further or contact the authors. See the Tutorial document for further code examples. For a more extensive explanation of Bio::NEXUS, read this User Manual.
Chapter 1 : Introduction
1.1. Overview
**Under Construction**
Genome analysis is increasingly dependent on comparative methods of analysis. Even at the earliest stages of genome annotation, bits of a new genome sequence are searched against known sequences to mine clues useful in "functional" inferences (where does the gene start? where are the introns? what does it do?). Often these inferences are based on BLAST search results.
Though it is not widely appreciated, there already exists a sophisticated methodological framework for comparative analysis, developed over the past 40 years by systematists and evolutionary biologists, in which differences are interpreted according to probabilistic models of evolutionary divergence on a branching tree. The basic methods and concepts of comparative evolutionary biology, originally developed for morphological characters, can be applied directly to any kind of character (discrete or continuous, so long as it fits the "2.1. The Character-State Data Model").
In the ongoing quest to improve the accuracy and reliability of functional inferences, it is inevitable that the bioinformatics/genomics community will come to rely on these more sophisticated methods. This transition will require automatable tools for phylogenetic analysis and character reconstruction (which already exist to a large degree), portable and flexible formats for data exchange, infrastructure to facilitate integration, and better education about how to integrate probabilistic evolutionary reasoning into genome interpretation.
The NEXUS file format of Maddison, Swofford & Maddison, 1997 (Systematic Biology 46:590-621) was developed to facilitate the communication and storage of data for comparative analysis.
Bio::NEXUS: an Object-oriented Perl API for the NEXUS file format
**Under Construction**
Developing Perl support for NEXUS is a natural way to facilitate evolutionary analysis. Because of its convenience, flexibility and power, Perl has become the glue language of bioinformatics. NEXUS is a powerful and flexible format, used in dozens of evolutionary analysis applications for which simplistic formats like FASTA are unacceptable. Combining the two makes it easy to develop wrappers for existing software and to glue together separate procedures into automated pipelines.
Several years of work in our research group led to the development of a NEXUS applications programming interface or "API" in Perl which we call "Bio::NEXUS" for "NEXUS Perl Library". Along with the library modules in Bio::NEXUS, the Bio::NEXUS package described here includes documentation as well as two demonstration applications, nexplot.pl and nextool.pl. The Bio::NEXUS library is the middle layer of the http://www.molevol.org/nexplorer Nexplorer server, which provides a graphical interface for browsing and manipulating sequence family data, showcasing the methods in Bio::NEXUS. The graphical rendering framework used in Nexplorer is the same as that used in nexplot.pl.
A beta version of the package was released in 2004. As of 2006, we are uploading the package to CPAN.
In terms of the big picture, what is currently missing from this project is data IO from other formats and streams. Our future plans include integrating more fully with BioPerl and with the CIPRES services architecture, to provide access to other file formats commonly used in bioinformatics, as well as to online services and databases.
Chapter 2. NEXUS, Bio::NEXUS, and the Character-State Data Model
2.1. The Character-State Data Model
**Under Construction**
The NEXUS format conveys data organized according to the character state data model, in which the features of operational taxonomic units (OTUs) (e.g., species, individuals, genes, genomes, etc.) are observable states of underlying homologous characters. For instance, in a protein sequence alignment, proteins are the OTUs, alignment columns are characters, and amino acids (or gaps) are states.
2.2 Evolutionary analysis
**Under Construction**
In evolutionary analysis, it is typical to consider differences as the result of state transitions by which common ancestral states diverge along the branches of a tree, according to some model of change. This is what makes evolutionary analysis so potent. Without the tree-based connection between differences and models of change, we can interpret similarities and differences in the search for patterns, but the results are often difficult to relate in a precise way to mechanistic hypotheses or to questions about causal factors. With evolutionary analysis, we can turn questions about the significance of similarities and differences into well posed questions about the rates or probabilities of different types of changes over time.
2.3. The NEXUS File Format Standard
**Under Construction**
Syntax
The NEXUS file is, in a sense, a text representation of the character-state data model. Thus it provides a means to represent a tree using the Newick standard (http://evolution.gs.washington.edu/phylip/newicktree.html). The syntactic structure of a NEXUS file is as follows:
Each of the pre-defined types of public blocks may appear only once. The TAXA block is the only necessary block. There are some restrictions on the ordering of blocks, and on the ordering of commands within a block.
Application-specific "private" blocks are also possible. NEXUS keywords are not case-sensitive. We put names of BLOCKS in upper case here for mnemonic purposes.
Blocks
Some important blocks
Name | Description |
---|---|
TAXA | specifies OTUs in data set |
CHARACTERS | specifies characters |
SETS | assigns names to sets of characters or OTUs |
ASSUMPTIONS | specifies assumptions for an analysis |
CODONS | specifies codons and their genetic codes |
Some important commands
Name | Block | Description |
---|---|---|
CharLabels | CHARACTERS | label for a character (column) |
StateLabels | CHARACTERS | label for a state (the type of an instance of a character) |
CharStateLabels | CHARACTERS | combined label for a character and its states |
CharSet | SETS | give a name to some set of chars |
TaxSet | SETS | give a name to some set of OTUs |
GeneticCode | CODONS | specify a genetic code |
CodeSet | CODONS | associate a code with a CharSet or TaxSet |
Tree | TREES | specify a Newick tree |
2.4. Examples of how NEXUS files are used in current software
**Under Construction**
mrbayes paup mesquite nexplorer
2.5. Bio::NEXUS Objects
**Under Construction**
The structure of a Bio::NEXUS object reflects the structure of a NEXUS file to the extent that it is logical. Like a NEXUS file, a Bio::NEXUS object is merely a container for blocks. Therefore, almost all manipulation or retrieval of data from a NEXUS object starts by getting some block:
$trees_block = $nexus->get_block('trees');
In some cases, blocks too are little more than containers (e.g. TREES, CHARACTERS, ASSUMPTIONS). In other cases, once you have the block, you have everything you need to access or manipulate data:
$dist = $distance_block->get_distance_between('Homo_sapiens','Pan_troglodytes');
or
@taxa = @{ $taxa_block->get_taxlabels() };
NEXUS
2.4. Manipulation objects
**Under Construction**
Chapter 3: Using Bio::NEXUS library
3.1. Selecting subset of the data
**Under Construction**
3.2. Manipulating data
**Under Construction**
3.3. Plotting data in different format.
**Under Construction**
Chapter 4: Demonstration applications: nexplot.pl, nextool.pl, and Nexplorer
4.1. Using nextool.pl
**Under Construction**
Nextool (nextool.pl) is self-documenting. To see the options, type one of these:
$ <path>/nextool.pl -h
$ perldoc <path>/nextool.pl
If nextool.pl is installed in your executable path, you do not need to specify the path.
4.2. Using nexplot.pl
**Under Construction**
Nexplot (nexplot.pl) is self-documenting. To see the options, type one of these:
$ <path>/nexplot.pl -h
$ perldoc <path>/nexplot.pl
If nexplot.pl is installed in your executable path, you do not need to specify the path. nexplot.pl, particularly when combined with nextool.pl, provides a flexible and convenient way to generate customized publication-quality views of character data in a phylogenetic context. In particular, nexplot.pl has a variety of settings, and can generate output in PostScript, which means that the graphic elements all have infinite resolution. To summarize, the advantages are:
output is publication-quality
output format can be converted to other formats for multiple uses
graphics are infinitely scalable
layout is customizable
data input format is standardized (NEXUS)
4.3. Using Nexplorer
**Under Construction**
Nexplorer is a web-based application providing a convenient graphical interface for phylogenetic browsing and manipulation of sequence family data sets (or other character data sets). Nexplorer provides only a subset of the capabilities of nexplot.pl and nextool.pl, but it is much easier to use, especially for beginners.
The http://www.molevol.org/nexplorerNexplorer web site (located at www.molevol.org/nexplorer) provides documentation, including tutorial exercises.
In order to use Nexplorer, you must either supply your own NEXUS file, or use one of the data files provided (several thousand, mostly sequence families).
AUTHOR
Tom Hladish Arlin Stoltzfus