NAME
Bio::MaxQuant::Evidence::Statistics - Additional statistics on your SILAC evidence
VERSION
Version 0.01
SYNOPSIS
Read/convert your evidence file to a more rapidly processable format, and perform various operations and statistics across/between multiple experiments. Supports multidimensional experiments with replicate analyses.
use Bio::MaxQuant::Evidence::Statistics;
my $foo = Bio::MaxQuant::Evidence::Statistics->new();
# get the essential data from an evidence file
$foo->parseEssentials(filename=>$evidencePath);
# store the essentials for later
$foo->writeEssentials(filename=>$essentialsPath);
# laod previously stored essentials
$foo->readEssentials(filename=>$essentialsPath);
SUBROUTINES/METHODS
new
Create a new object:
my $foo = Bio::MaxQuant::Evidence::Statistics->new();
parseEssentials(%options)
Reads the essential data from an evidence file. Evidence files for large analyses can be very big and take a long time to process, to we only read what's necessary, and can save this for convenience and speed too, using writeEssentials().
The data are stored by Protein group IDs, i.e. one entry per protein group. Other data stored here are:
- id
- Protein group IDs
- Modified -- is this actually the right name??
- Leading Proteins
- Experiment
- PEP
- Ratio H/L
- Intensity H
- Intensity L
- Contaminant
- Reverse
The column names used for storage are defined in the default option essential_column_names, and can be changed when you call new, or when you call parseEssentials. The option is a hash of column names whose values detmerine whether the column is kept by their truthness... e.g.
$o->parseEssentials(essential_column_names=>(
'id' => 1, # kept
'PEP' => 0, # discarded
#foo => ?, # discarded
));
If a column doesn't exist, it does not complain!
The method takes a hash of options.
options:
- filename - path of the file to process
- separator - passed to Text::CSV (default is tab)
- key_column_name - change the column keyed on (default is id)
- experiment_column_name - change the column the data are split on
- list_column_names - change the columns stored as lists
list_column_names
Some columns are the same across all the evidence in a protein group, eg, the id is obviously the same, Contaminant and Reverse, Protein IDs, and so on. The default, therefore, is to overwrite the column with the value seen in an evidence. BUT, some columns have a different value in each evidence, e.g. Ratio H/L or PEP. Whatever columns are given in list_column_names, which true values, will be appended as lists, so in the final data, there will be one row per protein and any column bearing multiple evidences for that protein will be a list.
If that makes no sense, write to me and I'll try to change it.
experiments
Returns a list of the experiments in the data.
replicated
Returns a list of the experiment names without the replicate portion.
The names are assumed to be Cell.Condition.Replicate, i.e. full-stop (period) separated.
orthogonals
Returns a list of sets of orthogonal experiments, that is 3 experiments in which the first has one condition in common with the other two, but they have nothing in common with each other.
e.g. A.X A.Y B.X
The rationale behind this is that quantitative differences across this set indicate mechanistic links between, for example, cell line and drug treatment. If a reponse is seen to a drug, and a different repsonse is seen in a different cell-type, this system will pick that up. The fourth member of the comparison (in the example that would be B.Y) could be anything... and the interpretation would still be that there is a differential response.
pairs
Returns a list of pairs of replicated experiments (e.g. A.X A.Y, A.X B.X ...) that represents all possible comparisons.
ids
Returns a list of evidence ids in the data.
sharedIds
Returns a list containing the ids of those evidences shared between protein groups.
uniqueIds
Returns a list containing the ids of those evidences unique to one protein group.
saveEssentials(%options)
Save the essential data (quicker to read again in future)
loadEssentials
Load up previously saved essentials
extractColumnValues
proteinCount
getProteinGroupIds
getLeadingProteins
logRatios
Logs ratios (base 2) throughout the dataset, and sets a flag so it can't get logged again.
Treatment of "special values": empty string, <= 0, NaN, and any other non-number are removed from the data!
filter
returns a set of protein records based on filter parameters...
options
- experiment - regular expression to match experiment name
- proteinGroupId - regular expression to match protein group id
- leadingProteins - regular expression to match leading protein ids
- notLeadingProteins - regular expression to not match leading protein ids
Returns a filtered object of the same type, with relevant flags set (e.g. whether data has been logged, etc).
Warning, intentionally does not perform a deep clone!
replicateMedian
options are passed to filter.
deviations
returns an hashref with the following keys
- n - the number of items
- sd - the standard deviation (from the mean)
- mad - the median absolute deviation (from the median)
- sd_via_mad - the standard deviation estimated from the median absolute deviation
mean
given a list of values, returns the mean
sd (unbiased standard deviation)
given a list of values, returns a hash with keys mean and sd (standard deviation).
sum
given a list of values, returns the sum
mad
given a list of values, returns the median absolute deviation
ttest
Given options, experiment1, experiment2 and optional filters, returns a hash of statistics...
stats1 and stats2 are hashes of deviations: sd, mad, sd_via_mad, usv, n, values
ttest is hash of Welch's ttest results: t, df, p
ttest_mad is like ttest but based on median and median absolute deviateions.
The p-values are derived using Welch's Ttest and the t-distribution function from Statistics::Distributions.
MAD and medians are much more robust to outliers, which are significant in peptide ratios.
welchs_ttest
performs Welch's ttest, given mean1, mean2, usv1, usv2, n1 and n2 in a hash.
e.g.
$o->welchs_ttest( mean1 => 4, mean2 => 3, # sample mean
usv1 => 1, usv2 => 1.1, # unbiased sample variance (returned as usv from $o->sd
n1 => 4, n2=> 7 # number of observations
also performs Welch-Satterthwaite to calculate degrees of freedom (to look up in t-statistic table)
Returns hashref containing t and df.
replicateMedianSubtractions
Logs data, if not already done, calculates median for each replicate, and subtracts median from each evidence in that replicate.
median
given a list of numbers, returns the median... assumes all items are numbers!
experimentMaximumPvalue
fullProteinComparison
Does a full comparison on a particular protein, i.e. compares all pairs of conditions, also does differential response analysis. Allows limitation of analysis to proteotypic peptides.
fullComparison
Does a full comparison for each protein. Returns hash of hashes.
direction
given two values, returns whether the different between first and second is positive or negative
returns '+' or '-'
directionsDisagree
given two directions, which could be '+', '-' or '', returns true if one is '+' and the other is '-'
AUTHOR
jimi, <j at 0na.me>
BUGS
Please report any bugs or feature requests to bug-bio-maxquant-evidence-statistics at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-MaxQuant-Evidence-Statistics. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Bio::MaxQuant::Evidence::Statistics
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Bio-MaxQuant-Evidence-Statistics
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
http://cpanratings.perl.org/d/Bio-MaxQuant-Evidence-Statistics
Search CPAN
http://search.cpan.org/dist/Bio-MaxQuant-Evidence-Statistics/
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2014 jimi.
This program is released under the following license: artistic2