NAME

Bio::MaxQuant::Evidence::Statistics - Additional statistics on your SILAC evidence

VERSION

Version 0.01

SYNOPSIS

Read/convert your evidence file to a more rapidly processable format, and perform various operations and statistics across/between multiple experiments. Supports multidimensional experiments with replicate analyses.

    use Bio::MaxQuant::Evidence::Statistics;

    my $foo = Bio::MaxQuant::Evidence::Statistics->new();
    
    # get the essential data from an evidence file
    $foo->parseEssentials(filename=>$evidencePath);

    # store the essentials for later
	$foo->writeEssentials(filename=>$essentialsPath);

	# laod previously stored essentials
	$foo->readEssentials(filename=>$essentialsPath);

SUBROUTINES/METHODS

new

Create a new object:

my $foo = Bio::MaxQuant::Evidence::Statistics->new();

parseEssentials(%options)

Reads the essential data from an evidence file. Evidence files for large analyses can be very big and take a long time to process, to we only read what's necessary, and can save this for convenience and speed too, using writeEssentials().

The data are stored by Protein group IDs, i.e. one entry per protein group. Other data stored here are:

id
Protein group IDs
Modified -- is this actually the right name??
Leading Proteins
Experiment
PEP
Ratio H/L
Intensity H
Intensity L
Contaminant
Reverse

The column names used for storage are defined in the default option essential_column_names, and can be changed when you call new, or when you call parseEssentials. The option is a hash of column names whose values detmerine whether the column is kept by their truthness... e.g.

$o->parseEssentials(essential_column_names=>(
    'id'  => 1, # kept
    'PEP' => 0, # discarded
    #foo  => ?, # discarded
));

If a column doesn't exist, it does not complain!

The method takes a hash of options.

options:

filename - path of the file to process
separator - passed to Text::CSV (default is tab)
key_column_name - change the column keyed on (default is id)
experiment_column_name - change the column the data are split on
list_column_names - change the columns stored as lists

list_column_names

Some columns are the same across all the evidence in a protein group, eg, the id is obviously the same, Contaminant and Reverse, Protein IDs, and so on. The default, therefore, is to overwrite the column with the value seen in an evidence. BUT, some columns have a different value in each evidence, e.g. Ratio H/L or PEP. Whatever columns are given in list_column_names, which true values, will be appended as lists, so in the final data, there will be one row per protein and any column bearing multiple evidences for that protein will be a list.

If that makes no sense, write to me and I'll try to change it.

experiments

Returns a list of the experiments in the data.

replicated

Returns a list of the experiment names without the replicate portion.

The names are assumed to be Cell.Condition.Replicate, i.e. full-stop (period) separated.

orthogonals

Returns a list of sets of orthogonal experiments, that is 3 experiments in which the first has one condition in common with the other two, but they have nothing in common with each other.

e.g. A.X A.Y B.X

The rationale behind this is that quantitative differences across this set indicate mechanistic links between, for example, cell line and drug treatment. If a reponse is seen to a drug, and a different repsonse is seen in a different cell-type, this system will pick that up. The fourth member of the comparison (in the example that would be B.Y) could be anything... and the interpretation would still be that there is a differential response.

pairs

Returns a list of pairs of replicated experiments (e.g. A.X A.Y, A.X B.X ...) that represents all possible comparisons.

ids

Returns a list of evidence ids in the data.

sharedIds

Returns a list containing the ids of those evidences shared between protein groups.

uniqueIds

Returns a list containing the ids of those evidences unique to one protein group.

saveEssentials(%options)

Save the essential data (quicker to read again in future)

loadEssentials

Load up previously saved essentials

extractColumnValues

proteinCount

getProteinGroupIds

getLeadingProteins

logRatios

Logs ratios (base 2) throughout the dataset, and sets a flag so it can't get logged again.

Treatment of "special values": empty string, <= 0, NaN, and any other non-number are removed from the data!

filter

returns a set of protein records based on filter parameters...

options

experiment - regular expression to match experiment name
proteinGroupId - regular expression to match protein group id
leadingProteins - regular expression to match leading protein ids
notLeadingProteins - regular expression to not match leading protein ids

Returns a filtered object of the same type, with relevant flags set (e.g. whether data has been logged, etc).

Warning, intentionally does not perform a deep clone!

replicateMedian

options are passed to filter.

deviations

returns an hashref with the following keys

n - the number of items
sd - the standard deviation (from the mean)
mad - the median absolute deviation (from the median)
sd_via_mad - the standard deviation estimated from the median absolute deviation

mean

given a list of values, returns the mean

sd (unbiased standard deviation)

given a list of values, returns a hash with keys mean and sd (standard deviation).

sum

given a list of values, returns the sum

mad

given a list of values, returns the median absolute deviation

ttest

Given options, experiment1, experiment2 and optional filters, returns a hash of statistics...

stats1 and stats2 are hashes of deviations: sd, mad, sd_via_mad, usv, n, values

ttest is hash of Welch's ttest results: t, df, p

ttest_mad is like ttest but based on median and median absolute deviateions.

The p-values are derived using Welch's Ttest and the t-distribution function from Statistics::Distributions.

MAD and medians are much more robust to outliers, which are significant in peptide ratios.

welchs_ttest

performs Welch's ttest, given mean1, mean2, usv1, usv2, n1 and n2 in a hash.

e.g.

$o->welchs_ttest( mean1 => 4, mean2 => 3,  # sample mean
                  usv1 => 1,  usv2 => 1.1, # unbiased sample variance (returned as usv from $o->sd
                  n1 => 4,    n2=> 7       # number of observations
                  

also performs Welch-Satterthwaite to calculate degrees of freedom (to look up in t-statistic table)

Returns hashref containing t and df.

replicateMedianSubtractions

Logs data, if not already done, calculates median for each replicate, and subtracts median from each evidence in that replicate.

median

given a list of numbers, returns the median... assumes all items are numbers!

experimentMaximumPvalue

fullProteinComparison

Does a full comparison on a particular protein, i.e. compares all pairs of conditions, also does differential response analysis. Allows limitation of analysis to proteotypic peptides.

fullComparison

Does a full comparison for each protein. Returns hash of hashes.

direction

given two values, returns whether the different between first and second is positive or negative

returns '+' or '-'

directionsDisagree

given two directions, which could be '+', '-' or '', returns true if one is '+' and the other is '-'

AUTHOR

jimi, <j at 0na.me>

BUGS

Please report any bugs or feature requests to bug-bio-maxquant-evidence-statistics at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-MaxQuant-Evidence-Statistics. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Bio::MaxQuant::Evidence::Statistics

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2014 jimi.

This program is released under the following license: artistic2