NAME
Bio::CUA::Tutorial - A tutorial on using the programs of Bio-CUA
DESCRIPTION
This is a tutorial on how to use the accompanying programs in the distribution. The three programs are:
build_cai_param.pl
build_tai_param.pl
calculate_CUB.pl
If running any program without giving parameters, it will show a brief usage information.
To learn how to write new programs using the modules in the distribution, read the documentation of the following modules:
L<Bio::CUA::CUB::Builder>
L<Bio::CUA::CUB::Calculator>
L<Bio::CUA::Summarizer>
One can also read the source code of the accompanying programs to learn how to use the provided modules.
DATA
All the data used in this tutorial are downloadable from https://github.com/fortune9/CUA/blob/master/examples.tar.gz.
After downloading it, one can uncompress it. Under Linux-like systems, uncompress it using
tar -xzf examples.tar.gz
This will result in one folder examples, under which are folders data and output. The data folder contains the input data while the output contains the output which can be compared with the new output to make sure the programs work correctly.
EXAMPLE
In this tutorial, I will use the sequences and other data from the fruitfly Drosophila melanogaster as an example to show how to compute all the codon usage bias (CUB) metrics. For data availability, see "DATA".
Codon-level metrics
First, we will calculate the CAI, tAI, and Fop at the codon level, that is, which codons are preferred over others.
- CAI - Codon Adaptation Index
-
# calculate CAI using the top 200 highly expressed genes
build_cai_param.pl -i data/longest_cds.dmel_5_57.fa -e \ data/RPKM_S2_cells.tsv -s 200 -o codon_CAI.S2_cell
Here data/longest_cds.dmel_5_57.fa contains the sequences of the longest CDS for each gene in the fruitfly genome in fasta format. data/RPKM_S2_cells.tsv contains the mRNA expression values for all the genes in RPKM format. The option -s is to select what genes in the mRNA-expression file to be used as reference set, here top 200 highly expressed genes. The option <-o> is to direct the output to the file codon_CAI.S2_cell.
- tAI - tRNA adaptation index
-
# calculate codons' tAI using tRNA copy numbers to approximate the # tRNA abundance
build_tai_param.pl -t data/dmel_r5.tRNA_copy_number -o \ codon_tAI.dmel_r5
The file data/dmel_r5.tRNA_copy_number contains the tRNA copy number for each anticodon. Check the file for the format. One can download the tRNA information from the database GtRNAdb and then summrize the copy numbers for each anticodon.
- Fop - frequency of optimal codons
-
The optimal codons can be defined using different criteria. Here I classify codons with tAI > 0.4 optimal codons.
Under Linux, one can run the following command
gawk '$2 > 0.40' codon_tAI.dmel_r5 >optimal_codons.dmel_r5
to filter optimal codons from the above tAI output.
- ENC - effective number of codons
-
Codon-level ENC does not exist.
Sequence-level metrics
With the above codon-level metrics, one can compute sequence-level CUB values.
To calculate CAI, tAI, Fop, and ENC for all the CDS sequences, run
calculate_CUB.pl -s data/longest_cds.dmel_5_57.fa -t \
codon_tAI.dmel_r5 -c codon_CAI.S2_cell -f optimal_codons.dmel_r5 \
-e enc -o CUB_metrics.dmel_r5.tsv
Note we use the options -t, -c, -f, and -e to specify the needed parameters for calculating tAI, CAI, Fop, and ENC.
CAI and ENC variants
CAI variants
I devise two variants of CAI to fix the shortcoming of the standard CAI. Their calculations are below.
- mCAI
-
# calculate codons' CAI using the top 200 highly expressed genes with # RSCUs normalized by the expected RSCUs under even usage, termed # mCAI.
build_cai_param.pl -i data/longest_cds.dmel_5_57.fa -e \ data/RPKM_S2_cells.tsv -s 200 -o codon_CAI.by_mean.S2_cell -m mean
The difference from the standard CAI is adding option -m mean to ask all RSCUs are normalized by the expected RSCUs to get CAIs. RSCU stands for Relative Synonymous Codon Usage.
- bCAI
-
# calculate codons' CAI using the top 200 highly expressed genes with # RSCUs normalized by the RSCUs from the 1000 most lowly expressed # genes
build_cai_param.pl -i data/longest_cds.dmel_5_57.fa -e \ data/RPKM_S2_cells.tsv -s 200 -o codon_CAI.by_background.S2_cell \ -b 1000
The option -b 1000 asks for using the bottom 1000 lowly expressed genes to normalize RSCUs before calculating CAIs.
The sequence-level mCAI and bCAI can be computed by specifying the codon-level output. For example, for bCAI, we can run
calculate_CUB.pl -s data/longest_cds.dmel_5_57.fa -c
codon_CAI.by_background.S2_cell -o CAI_by_background.dmel_r5.tsv
--lite
Note I feed the option -c the codon-level bCAI file codon_CAI.by_background.S2_cell. I also use the option --lite to suppress the program to compute other auxiliary information.
ENC variants
Including the standard ENC, thare are four variants, defined by whether nucleotide compositions are corrected and on how missing F values are estimated, as shown below:
__________________________________________________________
Metric | Correct nucleotide | Estimate missing F values
| composition? | using the original method?
----------------------------------------------------------
ENC | No | Yes
ENC_r | No | No
ENCp | Yes | Yes
ENCp_r | Yes | No
----------------------------------------------------------
To calculate these ENC variants, specifying the corresponding names in lower case to the option -e of calculate_CUB.pl. For example, the command
calculate_CUB.pl -s data/longest_cds.dmel_5_57.fa -e encp,encp_r \
-b data/intron_base_comp.dmel_r5.tsv \
-o ENC_corrected_GC.dmel_r5.tsv --lite
calculates ENCp and ENCp_r for all the sequences, and the option -b specifies the file containing base composition from which the expected codon frequencies of each analyzed sequence is computed.
AUTHOR
Zhenguo Zhang, <zhangz.sci at gmail.com>
BUGS
Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
CITATION
Zhenguo Zhang and Daven C. Presgraves, CUA: a Flexible and Comprehensive Codon Usage Analyzer (In preparation)
LICENSE AND COPYRIGHT
Copyright 2015 Zhenguo Zhang.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.