NAME
cai_codon.pl - a program to calculate CAI for each codon
VERSION
VERSION = 0.02
SYNOPSIS
This is a program to compute CAI at codon level with different methods. It is part of distribution http://search.cpan.org/dist/Bio-CUA/
# calculate codon CAI by choosing the top 200 highly expressed genes cai_codon.pl -i seqs.fasta -e gene_expression.tsv -s 200 -o CAI_top200
# the same as above but normalize RSCUs with expected RSCUs under even # codon usage cai_codon.pl -i seqs.fasta -e gene_expression.tsv -s 200 -o CAI_top200.by_mean -m mean
# normalize RSCUs by RSCUs derived from bottom 1000 lowely expressed genes cai_codon.pl -i seqs.fasta -e gene_expression.tsv -s 200 -o CAI_top200.b1000 -b 1000
OPTIONS
All options have a short and a long forms, e.g., -i and --seq-file for first option.
In the following text, RSCU stands for relative synonymous codon usage.
Mandatory options
- -i/--seq-file
-
a file containing protein-coding sequences in fasta format or a list of codon counts. The latter lists the counts of codons in the format 'codon1<tab>#1' with each codon per line and <tab> as the field delimiter. The program distinguishes these two formats based on the first non-empty/non-comment line: if it starts with '>', then the format is regarded as 'fasta', otherwise codon counts are expected.
Auxiliary options
- -e/--exp-file
-
a file containing sequence IDs and their expression in the forllowing format:
seq-id1E<lt>tabE<gt>0.67 seq-id2E<lt>tabE<gt>2.57 ... ...
each line contains one sequence ID and the sequence's gene expression level (RNA, protein, or else), separated by tab. The sequence IDs must match the IDs in the sequence file specified above.
From this file, highly expressed genes will be selected according to the gene expression rank. See below options.
If this option is omitted, all the sequences in the above sequence file would be used for calculating CAIs.
- -s/--select
-
determine how many sequences are chosen from the above expression file (by option --exp-file). Available formats are:
all, all IDs in the expression file are chosen.
0.##, a fraction of top highly expressed genes, say 0.30, then top 30% highly expressed genes are chosen.
###, an integer, say 200, then the top 200 highly expressed genes are chosen.
Default is all. If the option --exp-file is omitted, this option has no effect.
- -b/--background
-
specify background data (e.g., lowly expressed genes) from which the background codon usage is derived. Then each codon's RSCU from highly expressed genes is divided by the codon's RSCU from the background data; these normalized RSCUs are used for CAI calculation. This method is termed 'background-normalization'.
How to specify background data: 0.##, ###, or filename, the former two formats choose a fraction of or a number of genes from the most lowly expressed genes specified in the expression file by --exp-file. See option --select for details of the two specification formats. The last format specifies a fasta-formatted sequence file containing protein-coding sequences or a list of codon counts which will be anlyzed for background codon usage. See the option --seq-file for format details. When this option is given, bCAI is calculated.
- -g/--gc-id
-
ID of genetic code table. See NCBI genetic code for valid IDs. Default is 1, i.e., standard genetic code.
- -m/--method
-
method to calculated CAI: max or mean. The former is used by <Sharp and Li, 1987, NAR>, in which each codon's RSCU is divided by the maximum of all synonymous codons to derive CAI. The 'mean' method divides each codon's RSCU by the expected RSCU under even codon usage to get CAI. For example, for an amino acid with four synonymous codons, the expected RSCU is 0.25 for each codon, so all observed RSCUs of this amino acid's codons are divided by 0.25. These two choices produce CAI and mCAI, respectively.
If option
--background
is activated, the 'background-normalization' method always uses the max method to get final CAIs. - -o/--out-file
-
file to store the result. Default is standard output, usually screen.
AUTHOR
Zhenguo Zhang, <zhangz.sci at gmail.com>
BUGS
Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org
or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this class with the perldoc command.
perldoc Bio::CUA
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2015 Zhenguo Zhang.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.