NAME
PDL::Stats::Basic -- basic statistics and related utilities such as standard deviation, Pearson correlation, and t-tests.
DESCRIPTION
The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are broadcastable and methods that are NOT broadcastable, respectively.
Does not have mean or median function here. see SEE ALSO.
SYNOPSIS
use PDL::LiteF;
use PDL::Stats::Basic;
my $stdv = $data->stdv;
or
my $stdv = stdv( $data );
FUNCTIONS
stdv
Signature: (a(n); [o]b())
Types: (float double)
$b = stdv($a);
stdv($a, $b); # all arguments given
$b = $a->stdv; # method call
$a->stdv($b);
Sample standard deviation.
Broadcasts over its inputs.
stdv
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
stdv_unbiased
Signature: (a(n); [o]b())
Types: (float double)
$b = stdv_unbiased($a);
stdv_unbiased($a, $b); # all arguments given
$b = $a->stdv_unbiased; # method call
$a->stdv_unbiased($b);
Unbiased estimate of population standard deviation.
Broadcasts over its inputs.
stdv_unbiased
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
var
Signature: (a(n); [o]b())
Types: (float double)
$b = var($a);
var($a, $b); # all arguments given
$b = $a->var; # method call
$a->var($b);
Sample variance.
Broadcasts over its inputs.
var
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
var_unbiased
Signature: (a(n); [o]b())
Types: (float double)
$b = var_unbiased($a);
var_unbiased($a, $b); # all arguments given
$b = $a->var_unbiased; # method call
$a->var_unbiased($b);
Unbiased estimate of population variance.
Broadcasts over its inputs.
var_unbiased
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
se
Signature: (a(n); [o]b())
Types: (float double)
$b = se($a);
se($a, $b); # all arguments given
$b = $a->se; # method call
$a->se($b);
Standard error of the mean. Useful for calculating confidence intervals.
# 95% confidence interval for samples with large N
$ci_95_upper = $data->average + 1.96 * $data->se;
$ci_95_lower = $data->average - 1.96 * $data->se;
Broadcasts over its inputs.
se
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
ss
Signature: (a(n); [o]b())
Types: (float double)
$b = ss($a);
ss($a, $b); # all arguments given
$b = $a->ss; # method call
$a->ss($b);
Sum of squared deviations from the mean.
Broadcasts over its inputs.
ss
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
skew
Signature: (a(n); [o]b())
Types: (float double)
$b = skew($a);
skew($a, $b); # all arguments given
$b = $a->skew; # method call
$a->skew($b);
Sample skewness, measure of asymmetry in data. skewness == 0 for normal distribution.
Broadcasts over its inputs.
skew
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
skew_unbiased
Signature: (a(n); [o]b())
Types: (float double)
$b = skew_unbiased($a);
skew_unbiased($a, $b); # all arguments given
$b = $a->skew_unbiased; # method call
$a->skew_unbiased($b);
Unbiased estimate of population skewness. This is the number in GNumeric Descriptive Statistics.
Broadcasts over its inputs.
skew_unbiased
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
kurt
Signature: (a(n); [o]b())
Types: (float double)
$b = kurt($a);
kurt($a, $b); # all arguments given
$b = $a->kurt; # method call
$a->kurt($b);
Sample kurtosis, measure of "peakedness" of data. kurtosis == 0 for normal distribution.
Broadcasts over its inputs.
kurt
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
kurt_unbiased
Signature: (a(n); [o]b())
Types: (float double)
$b = kurt_unbiased($a);
kurt_unbiased($a, $b); # all arguments given
$b = $a->kurt_unbiased; # method call
$a->kurt_unbiased($b);
Unbiased estimate of population kurtosis. This is the number in GNumeric Descriptive Statistics.
Broadcasts over its inputs.
kurt_unbiased
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
cov
Signature: (a(n); b(n); [o]c())
Types: (float double)
$c = cov($a, $b);
cov($a, $b, $c); # all arguments given
$c = $a->cov($b); # method call
$a->cov($b, $c);
Sample covariance. see corr for ways to call
Broadcasts over its inputs.
cov
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
cov_table
Signature: (a(n,m); [o]c(m,m))
Types: (sbyte byte short ushort long ulong indx ulonglong longlong
float double ldouble)
$c = cov_table($a);
cov_table($a, $c); # all arguments given
$c = $a->cov_table; # method call
$a->cov_table($c);
Square covariance table. Gives the same result as broadcasting using cov but it calculates only half the square, hence much faster. And it is easier to use with higher dimension pdls.
Usage:
# 5 obs x 3 var, 2 such data tables
pdl> $a = random 5, 3, 2
pdl> p $cov = $a->cov_table
[
[
[ 8.9636438 -1.8624472 -1.2416588]
[-1.8624472 14.341514 -1.4245366]
[-1.2416588 -1.4245366 9.8690655]
]
[
[ 10.32644 -0.31311789 -0.95643674]
[-0.31311789 15.051779 -7.2759577]
[-0.95643674 -7.2759577 5.4465141]
]
]
# diagonal elements of the cov table are the variances
pdl> p $a->var
[
[ 8.9636438 14.341514 9.8690655]
[ 10.32644 15.051779 5.4465141]
]
for the same cov matrix table using cov,
pdl> p $a->dummy(2)->cov($a->dummy(1))
Broadcasts over its inputs.
cov_table
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
corr
Signature: (a(n); b(n); [o]c())
Types: (float double)
$c = corr($a, $b);
corr($a, $b, $c); # all arguments given
$c = $a->corr($b); # method call
$a->corr($b, $c);
Pearson correlation coefficient. r = cov(X,Y) / (stdv(X) * stdv(Y)).
Usage:
pdl> $a = random 5, 3
pdl> $b = sequence 5,3
pdl> p $a->corr($b)
[0.20934208 0.30949881 0.26713007]
for square corr table
pdl> p $a->corr($a->dummy(1))
[
[ 1 -0.41995259 -0.029301192]
[ -0.41995259 1 -0.61927619]
[-0.029301192 -0.61927619 1]
]
but it is easier and faster to use corr_table.
Broadcasts over its inputs.
corr
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
corr_table
Signature: (a(n,m); [o]c(m,m))
Types: (sbyte byte short ushort long ulong indx ulonglong longlong
float double ldouble)
$c = corr_table($a);
corr_table($a, $c); # all arguments given
$c = $a->corr_table; # method call
$a->corr_table($c);
Square Pearson correlation table. Gives the same result as broadcasting using corr but it calculates only half the square, hence much faster. And it is easier to use with higher dimension pdls.
Usage:
# 5 obs x 3 var, 2 such data tables
pdl> $a = random 5, 3, 2
pdl> p $a->corr_table
[
[
[ 1 -0.69835951 -0.18549048]
[-0.69835951 1 0.72481605]
[-0.18549048 0.72481605 1]
]
[
[ 1 0.82722569 -0.71779883]
[ 0.82722569 1 -0.63938828]
[-0.71779883 -0.63938828 1]
]
]
for the same result using corr,
pdl> p $a->dummy(2)->corr($a->dummy(1))
This is also how to use t_corr and n_pair with such a table.
Broadcasts over its inputs.
corr_table
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
t_corr
Signature: (r(); n(); [o]t())
Types: (float double)
$t = t_corr($r, $n);
t_corr($r, $n, $t); # all arguments given
$t = $r->t_corr($n); # method call
$r->t_corr($n, $t);
t significance test for Pearson correlations.
$corr = $data->corr( $data->dummy(1) );
$n = $data->n_pair( $data->dummy(1) );
$t_corr = $corr->t_corr( $n );
use PDL::GSL::CDF;
$p_2tail = 2 * (1 - gsl_cdf_tdist_P( $t_corr->abs, $n-2 ));
Broadcasts over its inputs.
t_corr
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
n_pair
Signature: (a(n); b(n); indx [o]c())
Types: (long longlong)
$c = n_pair($a, $b);
n_pair($a, $b, $c); # all arguments given
$c = $a->n_pair($b); # method call
$a->n_pair($b, $c);
Returns the number of good pairs between 2 lists. Useful with corr (esp. when bad values are involved)
Broadcasts over its inputs.
n_pair
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
corr_dev
Signature: (a(n); b(n); [o]c())
Types: (float double)
$c = corr_dev($a, $b);
corr_dev($a, $b, $c); # all arguments given
$c = $a->corr_dev($b); # method call
$a->corr_dev($b, $c);
Calculates correlations from dev_m vals. Seems faster than doing corr from original vals when data pdl is big
Broadcasts over its inputs.
corr_dev
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
t_test
Signature: (a(n); b(m); [o]t(); [o]d())
Types: (float double)
($t, $d) = t_test($a, $b);
t_test($a, $b, $t, $d); # all arguments given
($t, $d) = $a->t_test($b); # method call
$a->t_test($b, $t, $d);
Independent sample t-test, assuming equal var.
my ($t, $df) = t_test( $pdl1, $pdl2 );
use PDL::GSL::CDF;
my $p_2tail = 2 * (1 - gsl_cdf_tdist_P( $t->abs, $df ));
Broadcasts over its inputs.
t_test
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
t_test_nev
Signature: (a(n); b(m); [o]t(); [o]d())
Types: (float double)
($t, $d) = t_test_nev($a, $b);
t_test_nev($a, $b, $t, $d); # all arguments given
($t, $d) = $a->t_test_nev($b); # method call
$a->t_test_nev($b, $t, $d);
Independent sample t-test, NOT assuming equal var. ie Welch two sample t test. Df follows Welch-Satterthwaite equation instead of Satterthwaite (1946, as cited by Hays, 1994, 5th ed.). It matches GNumeric, which matches R.
Broadcasts over its inputs.
t_test_nev
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
t_test_paired
Signature: (a(n); b(n); [o]t(); [o]d())
Types: (float double)
($t, $d) = t_test_paired($a, $b);
t_test_paired($a, $b, $t, $d); # all arguments given
($t, $d) = $a->t_test_paired($b); # method call
$a->t_test_paired($b, $t, $d);
Paired sample t-test.
Broadcasts over its inputs.
t_test_paired
processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.
binomial_test
Signature: (x(); n(); p_expected(); [o]p())
Binomial test. One-tailed significance test for two-outcome distribution. Given the number of successes, the number of trials, and the expected probability of success, returns the probability of getting this many or more successes.
This function does NOT currently support bad value in the number of successes.
Usage:
# assume a fair coin, ie. 0.5 probablity of getting heads
# test whether getting 8 heads out of 10 coin flips is unusual
my $p = binomial_test( 8, 10, 0.5 ); # 0.0107421875. Yes it is unusual.
METHODS
rtable
Reads either file or file handle*. Returns observation x variable pdl and var and obs ids if specified. Ids in perl @ ref to allow for non-numeric ids. Other non-numeric entries are treated as missing, which are filled with $opt{MISSN} then set to BAD*. Can specify num of data rows to read from top but not arbitrary range.
*If passed handle, it will not be closed here.
Default options (case insensitive):
V => 1, # verbose. prints simple status
TYPE => double,
C_ID => 1, # boolean. file has col id.
R_ID => 1, # boolean. file has row id.
R_VAR => 0, # boolean. set to 1 if var in rows
SEP => "\t", # can take regex qr//
MISSN => -999, # this value treated as missing and set to BAD
NROW => '', # set to read specified num of data rows
Usage:
Sample file diet.txt:
uid height weight diet
akw 72 320 1
bcm 68 268 1
clq 67 180 2
dwm 70 200 2
($data, $idv, $ido) = rtable 'diet.txt';
# By default prints out data info and @$idv index and element
reading diet.txt for data and id... OK.
data table as PDL dim o x v: PDL: Double D [4,3]
0 height
1 weight
2 diet
Another way of using it,
$data = rtable( \*STDIN, {TYPE=>long} );
group_by
Returns pdl reshaped according to the specified factor variable. Most useful when used in conjunction with other broadcasting calculations such as average, stdv, etc. When the factor variable contains unequal number of cases in each level, the returned pdl is padded with bad values to fit the level with the most number of cases. This allows the subsequent calculation (average, stdv, etc) to return the correct results for each level.
Usage:
# simple case with 1d pdl and equal number of n in each level of the factor
pdl> p $a = sequence 10
[0 1 2 3 4 5 6 7 8 9]
pdl> p $factor = $a > 4
[0 0 0 0 0 1 1 1 1 1]
pdl> p $a->group_by( $factor )->average
[2 7]
# more complex case with broadcasting and unequal number of n across levels in the factor
pdl> p $a = sequence 10,2
[
[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
]
pdl> p $factor = qsort $a( ,0) % 3
[
[0 0 0 0 1 1 1 2 2 2]
]
pdl> p $a->group_by( $factor )
[
[
[ 0 1 2 3]
[10 11 12 13]
]
[
[ 4 5 6 BAD]
[ 14 15 16 BAD]
]
[
[ 7 8 9 BAD]
[ 17 18 19 BAD]
]
]
ARRAY(0xa2a4e40)
# group_by supports perl factors, multiple factors
# returns factor labels in addition to pdl in array context
pdl> p $a = sequence 12
[0 1 2 3 4 5 6 7 8 9 10 11]
pdl> $odd_even = [qw( e o e o e o e o e o e o )]
pdl> $magnitude = [qw( l l l l l l h h h h h h )]
pdl> ($a_grouped, $label) = $a->group_by( $odd_even, $magnitude )
pdl> p $a_grouped
[
[
[0 2 4]
[1 3 5]
]
[
[ 6 8 10]
[ 7 9 11]
]
]
pdl> p Dumper $label
$VAR1 = [
[
'e_l',
'o_l'
],
[
'e_h',
'o_h'
]
];
which_id
Lookup specified var (obs) ids in $idv ($ido) (see rtable) and return indices in $idv ($ido) as pdl if found. The indices are ordered by the specified subset. Useful for selecting data by var (obs) id.
my $ind = which_id $ido, ['smith', 'summers', 'tesla'];
my $data_subset = $data( $ind, );
# take advantage of perl pattern matching
# e.g. use data from people whose last name starts with s
my $i = which_id $ido, [ grep { /^s/ } @$ido ];
my $data_s = $data($i, );
SEE ALSO
PDL::Basic (hist for frequency counts)
PDL::Ufunc (sum, avg, median, min, max, etc.)
PDL::GSL::CDF (various cumulative distribution functions)
REFERENCES
Hays, W.L. (1994). Statistics (5th ed.). Fort Worth, TX: Harcourt Brace College Publishers.
AUTHOR
Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>
All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as described in the file COPYING in the PDL distribution.