NAME
Bio::CUA::CUB::Builder -- A module to calculate codon usage bias (CUB) metrics at codon level and other parameters
SYNOPSIS
use Bio::CUA::CUB::Builder;
# initialize the builder
my $builder = Bio::CUA::CUB::Builder->new(
codon_table => 1 ); # using stardard genetic code
# calculate RSCU for each codon, and result is stored in "rscu.out" as
# well as returned as a hash reference
my $rscuHash = $builder->build_rscu("seqs.fa",undef, 0.5,"rscu.out");
# calculate CAI for each codon, normalizing RSCU values of codons
# for each amino acid by the expected RSCUs under even usage,
# rather than the maximal RSCU used by the traditional CAI method.
my $caiHash = $builder->build_cai($codonList,2,'mean',"cai.out");
# calculate tAI for each codon
my $taiHash = $builder->build_tai("tRNA_copy_number.txt","tai.out", undef, 1);
DESCRIPTION
Codon usage bias (CUB) can be represented at two levels, codon and sequence. The latter is often computed as the geometric means of the sequence's codons. This module caculates CUB metrics at codon level.
Supported CUB metrics include CAI (codon adaptation index), tAI (tRNA adaptation index), RSCU (relative synonymous codon usage), and their variants. See the methods below for details.
The output can be stored in a file which is then used by methods in Bio::CUA::CUB::Calculator to calculate CUB indice for each protein-coding sequence.
METHODS
new
Title : new
Usage : $analyzer = Bio::CUA::CUB::Builder->new(-codon_table => 1)
Function: initiate the analyzer
Returns : an object
Args : accepted options are as follows
B<options needed for building parameters of all CUB indice>
-codon_table
-
the genetic code table applied for following sequence analyses. It can be specified by an integer (genetic code table id), an object of Bio::CUA::CodonTable, or a map-file. See the method "new" in Bio::CUA::Summarizer for details.
B<options needed for building tAI index's parameters>
-a_to_i
-
a switch option. If true (any nonzero values), all 'A' nucleotides at the 1st position of anticodon will be regarded as I (inosine) which can pair with more nucleotides at codons's wobbling position (A,T,C at the 3rd position). The default is true.
-no_atg
-
a switch option to indicate whether ATG codons should be excluded in tAI calculation. Default is true, following I<dos Reis, et al., 2004, NAR>. To include ATG in tAI calculation, provide '0' here.
-wobble
-
reference to a hash containing anticodon-codon basepairs at wobbling position, such as ('U' is equivalent to 'T') %wobblePairs = ( A => [qw/T/], C => [qw/G/], T => [qw/A G/], G => [qw/C T/], I => [qw/A C T/] ); # this is the default setting Hash keys are the bases in anticodons and hash values are paired bases in codons's 3rd positions. This option is optional and default value is shown above by the example.
no_atg
Title : no_atg
Usage : $status = $self->no_atg([$newVal])
Function: get/set the status whether ATG should be excluded in tAI
calculation.
Returns : current status after updating
Args : optional. 1 for true, 0 for false
build_rscu
Title : build_rscu
Usage : $ok = $self->build_rscu($input,[$minTotal,$pseudoCnt,$output]);
Function: calculate RSCU values for all sense codons
Returns : reference of a hash using the format 'codon => RSCU value'.
return undef if failed.
Args : accepted arguments are as follows (note: not as hash):
input
-
name of a file containing fasta CDS sequences of interested genes, or a sequence object with method I<seq> to extract sequence string, or a plain sequence string, or reference to a hash containing codon counts with structure like I<{ AGC => 50, GTC => 124}>.
output
-
optional, name of the file to store the result. If omitted, no result will be written.
minTotal
-
optional, minimal count of an amino acid in sequences; if observed count is smaller than this minimum, all codons of this amino acid would be assigned equal RSCU values. This is to reduce sampling errors in rarely observed amino acids. Default value is 5.
pseudoCnt
-
optional. Pseudo-counts for unobserved codons. Default is 0.5.
build_cai
Title : build_cai
Usage : $ok = $self->build_cai($input,[$minTotal,$norm_method,$output]);
Function: calculate CAI values for all sense codons
Returns : reference of a hash in which codons are keys and CAI values
are values. return undef if failed.
Args : accepted arguments are as follows:
input
-
name of a file containing fasta CDS sequences of interested genes, or a sequence object with method I<seq> to derive sequence string, or a plain sequence string, or reference to a hash containing codon list with structure like I<{ AGC => 50, GTC => 124}>.
minTotal
-
optional, minimal codon count for an amino acid; if observed count is smaller than this count, all codons of this amino acid would be assigned equal CAI values. This is to reduce sampling errors in rarely observed amino acids. Default value is 5.
norm_method
-
optional, indicating how to normalize RSCU to get CAI values. Valid values are 'max' and 'mean'; the former represents the original method used by I<Sharp and Li, 1987, NAR>, i.e., dividing all RSCUs by the maximum of an amino acid, while 'mean' indicates dividing RSCU by expected average fraction assuming even usage of all codons, i.e., 0.5 for amino acids encoded by 2 codons, 0.25 for amino acids encoded by 4 codons, etc. The CAI metric determined by the latter method is named I<mCAI>. mCAI can assign different CAI values for the most preferred codons of different amino acids, which otherwise would be the same by CAI (i.e., 1).
output
-
optional. If provided, result will be stored in the file specified by this argument.
Note: for codons which are not observed will be assigned a count of
0.5, and codons which are not degenerate (such as AUG and UGG in
standard genetic code table) are excluded. These are the default of
the paper I<Sharp and Li, 1986, NAR>. Here you can also reduce
sampling error by setting parameter $minTotal.
build_b_cai
Title : build_b_cai
Usage : $caiHash =
$self->build_b_cai($input,$background,[$minTotal,$output]);
Function: calculate CAI values for all sense codons. Instead of
normalizing RSCUs by maximal RSCU or expected fractions, each RSCU value is
normalized by the corresponding background RSCU, then these
normalized RSCUs are used to calculate CAI values.
Returns : reference of a hash in which codons are keys and CAI values
are values. return undef if failed.
Args : accepted arguments are as follows:
input
-
name of a file containing fasta CDS sequences of interested genes, or a sequence object with metho I<seq> to derive sequence string, or a plain sequence string, or reference to a hash containing codon list with structure like I<{ AGC => 50, GTC => 124}>.
background
-
background data from which background codon usage (RSCUs) is computed. Acceptable formats are the same as the above argument 'input'.
minTotal
-
optional, minimal codon count for an amino acid; if observed count is smaller than this count, all codons of this amino acid would be assigned equal RSCU values. This is to reduce sampling errors in rarely observed amino acids. Default value is 5.
outpu
-
optional. If provided, result will be stored in the file specified by this argument.
Note: for codons which are not observed will be assigned a count of
0.5, and codons which are not degenerate (such as AUG and UGG in
standard genetic code table) are excluded.
build_tai
Title : build_tai
Usage : $taiHash =
$self->build_tai($input,[$output,$selective_constraints, $kingdom]);
Function: build tAI values for all sense codons
Returns : reference of a hash in which codons are keys and tAI indice
are values. return undef if failed. See Formula 1 and 2 in I<dos
Reis, 2004, NAR> to see how they are computed.
Args : accepted arguments are as follows:
input
-
name of a file containing tRNA copies/abundance in the format 'anticodon<tab>count' per line, where 'anticodon' is anticodon in the tRNA and count can be the tRNA gene copy number or abundance.
output
-
optional. If provided, result will be stored in the file specified by this argument.
selective_constraints
-
optional, reference to hash containing wobble base-pairing and its selective constraint compared to Watson-Crick base-pair, the format is like this: $selective_constraints = { ... ... ... 'C-G' => 0, 'G-T' => 0.41, 'I-C' => 0.28, ... ... ... }; The key follows the 'anticodon-codon' order, and the values are codon selective constraints. The smaller the constraint, the stronger the pairing, so all Watson-Crick pairings have value 0. If this option is omitted, values will be searched for in the 'input' file, following the section of anticodons and started with a line '>SC'. If it is not in the input file, then the values in the Table 2 of I<dos Reis, 2004, NAR> are used.
kingdom
-
kingdom = 1 for prokaryota and 0 or undef for eukaryota, which affects the cacluation for bacteria isoleucine ATA codon. Default is undef for eukaryota
AUTHOR
Zhenguo Zhang, <zhangz.sci at gmail.com>
BUGS
Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Bio::CUA::CUB::Builder
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2015 Zhenguo Zhang.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.