NAME

Bio::CUA::CUB::Builder -- A module to calculate codon usage bias (CUB) metrics at codon level and other parameters

SYNOPSIS

use Bio::CUA::CUB::Builder;

# initialize the builder
my $builder = Bio::CUA::CUB::Builder->new(
              codon_table => 1 ); # using stardard genetic code

# calculate RSCU for each codon, and result is stored in "rscu.out" as
# well as returned as a hash reference
my $rscuHash = $builder->build_rscu("seqs.fa",undef, 0.5,"rscu.out");

# calculate CAI for each codon, normalizing RSCU values of codons
# for each amino acid by the expected RSCUs under even usage,
# rather than the maximal RSCU used by the traditional CAI method.
my $caiHash = $builder->build_cai($codonList,2,'mean',"cai.out");

# calculate tAI for each codon
my $taiHash = $builder->build_tai("tRNA_copy_number.txt","tai.out", undef, 1);

DESCRIPTION

Codon usage bias (CUB) can be represented at two levels, codon and sequence. The latter is often computed as the geometric means of the sequence's codons. This module caculates CUB metrics at codon level.

Supported CUB metrics include CAI (codon adaptation index), tAI (tRNA adaptation index), RSCU (relative synonymous codon usage), and their variants. See the methods below for details.

The output can be stored in a file which is then used by methods in Bio::CUA::CUB::Calculator to calculate CUB indice for each protein-coding sequence.

METHODS

new

Title   : new
Usage   : $analyzer = Bio::CUA::CUB::Builder->new(-codon_table => 1)
Function: initiate the analyzer
Returns : an object
Args    : accepted options are as follows

B<options needed for building parameters of all CUB indice>
-codon_table

the genetic code table applied for following sequence analyses. It can be specified by an integer (genetic code table id), an object of Bio::CUA::CodonTable, or a map-file. See the method "new" in Bio::CUA::Summarizer for details.

B<options needed for building tAI index's parameters>
-a_to_i
a switch option. If true (any nonzero values), all
'A' nucleotides at the 1st position of anticodon will be regarded as I
(inosine) which can pair with more nucleotides at codons's wobbling
position (A,T,C at the 3rd position). The default is true.
-no_atg
a switch option to indicate whether ATG codons should be
excluded in tAI calculation. Default is true, following I<dos Reis,
et al., 2004, NAR>. To include ATG in tAI calculation, provide '0' here.
-wobble
 reference to a hash containing anticodon-codon basepairs at
 wobbling position, such as ('U' is equivalent to 'T')
 %wobblePairs = (
	A => [qw/T/],
	C => [qw/G/],
	T => [qw/A G/],
	G => [qw/C T/],
	I => [qw/A C T/]
	); # this is the default setting
 Hash keys are the bases in anticodons and hash values are paired
 bases in codons's 3rd positions. This option is optional and default
 value is shown above by the example.

no_atg

Title   : no_atg
Usage   : $status = $self->no_atg([$newVal])
Function: get/set the status whether ATG should be excluded in tAI
calculation.
Returns : current status after updating
Args    : optional. 1 for true, 0 for false

build_rscu

Title   : build_rscu
Usage   : $ok = $self->build_rscu($input,[$minTotal,$pseudoCnt,$output]);
Function: calculate RSCU values for all sense codons
Returns : reference of a hash using the format 'codon => RSCU value'.
return undef if failed.
Args    : accepted arguments are as follows (note: not as hash):
input
name of a file containing fasta CDS sequences of interested
genes, or a sequence object with method I<seq> to extract sequence
string, or a plain sequence string, or reference to a hash containing
codon counts with structure like I<{ AGC => 50, GTC => 124}>.
output
optional, name of the file to store the result. If omitted,
no result will be written.
minTotal
optional, minimal count of an amino acid in sequences; if observed
count is smaller than this minimum, all codons of this amino acid would 
be assigned equal RSCU values. This is to reduce sampling errors in
rarely observed amino acids. Default value is 5.
pseudoCnt
optional. Pseudo-counts for unobserved codons. Default is 0.5.

build_cai

Title   : build_cai
Usage   : $ok = $self->build_cai($input,[$minTotal,$norm_method,$output]);
Function: calculate CAI values for all sense codons
Returns : reference of a hash in which codons are keys and CAI values
are values. return undef if failed.
Args    : accepted arguments are as follows:
input
name of a file containing fasta CDS sequences of interested
genes, or a sequence object with metho I<seq> to derive sequence
string, or a plain sequence string, or reference to a hash containing
codon list with structure like I<{ AGC => 50, GTC => 124}>.
minTotal
optional, minimal codon count for an amino acid; if observed
count is smaller than this count, all codons of this amino acid would 
be assigned equal CAI values. This is to reduce sampling errors in
rarely observed amino acids. Default value is 5.
norm_method
optional, indicating how to normalize RSCU to get CAI
values. Valid values are 'max' and 'mean'; the former represents the
original method used by I<Sharp and Li, 1987, NAR>, i.e., dividing
all RSCUs by the maximum of an amino acid, while 'mean' indicates
dividing RSCU by expected average fraction assuming even usage of
all codons, i.e., 0.5 for amino acids encoded by 2 codons, 0.25 for
amino acids encoded by 4 codons, etc. The latter method is able to
give different CAI values for the most preferred codons of different
amino acids, which otherwise would be the same (i.e., 1).
output
optional. If provided, result will be stored in the file
specified by this argument.
Note: for codons which are not observed will be assigned a count of
0.5, and codons which are not degenerate (such as AUG and UGG in
standard genetic code table) are excluded. These are the default of
the paper I<Sharp and Li, 1986, NAR>. Here you can also reduce
sampling error by setting parameter $minTotal.

build_b_cai

Title   : build_b_cai
Usage   : $caiHash =
$self->build_b_cai($input,$background,[$minTotal,$output]);
Function: calculate CAI values for all sense codons. Instead of
normalizing RSCUs by maximal RSCU or expected fractions, each RSCU value is
normalized by the corresponding background RSCU, then these
normalized RSCUs are used to calculate CAI values.
Returns : reference of a hash in which codons are keys and CAI values
are values. return undef if failed.
Args    : accepted arguments are as follows:
input
name of a file containing fasta CDS sequences of interested
genes, or a sequence object with metho I<seq> to derive sequence
string, or a plain sequence string, or reference to a hash containing
codon list with structure like I<{ AGC => 50, GTC => 124}>.
background
background data from which background codon usage (RSCUs)
is computed. Acceptable formats are the same as the above argument
'input'.
minTotal
optional, minimal codon count for an amino acid; if observed
count is smaller than this count, all codons of this amino acid would 
be assigned equal RSCU values. This is to reduce sampling errors in
rarely observed amino acids. Default value is 5.
outpu
optional. If provided, result will be stored in the file
specified by this argument.
Note: for codons which are not observed will be assigned a count of
0.5, and codons which are not degenerate (such as AUG and UGG in
standard genetic code table) are excluded. 

build_tai

Title   : build_tai
Usage   : $taiHash =
$self->build_tai($input,[$output,$selective_constraints, $kingdom]);
Function: build tAI values for all sense codons
Returns : reference of a hash in which codons are keys and tAI indice
are values. return undef if failed. See Formula 1 and 2 in I<dos
Reis, 2004, NAR> to see how they are computed.
Args    : accepted arguments are as follows:
input
name of a file containing tRNA copies/abundance in the format
'anticodon<tab>count' per line, where 'anticodon' is anticodon in
the tRNA and count can be the tRNA gene copy number or abundance.
output
optional. If provided, result will be stored in the file
specified by this argument.
selective_constraints
 optional, reference to hash containing wobble base-pairing and its
 selective constraint compared to Watson-Crick base-pair, the format
 is like this:
 $selective_constraints = {
                 ...   ...   ...
                 'C-G'   => 0,
				 'G-T'   => 0.41,
				 'I-C'   => 0.28,
				 ...   ...   ...
				 };
 The key follows the 'anticodon-codon' order, and the values are codon
 selective constraints. The smaller the constraint, the stronger the
 pairing, so all Watson-Crick pairings have value 0.
 If this option is omitted, values will be searched for in the 'input' file,
 following the section of anticodons and started with a line '>SC'. If it is
 not in the input file, then the values in the Table 2 of 
 I<dos Reis, 2004, NAR> are used.
kingdom
kingdom = 1 for prokaryota and 0 or undef for eukaryota, which
affects the cacluation for bacteria isoleucine ATA codon. Default is 
undef for eukaryota

AUTHOR

Zhenguo Zhang, <zhangz.sci at gmail.com>

BUGS

Please report any bugs or feature requests to bug-bio-cua at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-CUA. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Bio::CUA::CUB::Builder

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2015 Zhenguo Zhang.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.