NAME

PDL::Stats::GLM -- general and generalized linear modelling methods such as ANOVA, linear regression, PCA, and logistic regression.

SYNOPSIS

use PDL::LiteF;
use PDL::Stats::GLM;

# do a multiple linear regression and plot the residuals
my $y = pdl( 8, 7, 7, 0, 2, 5, 0 );
my $x = pdl( [ 0, 1, 2, 3, 4, 5, 6 ],        # linear component
             [ 0, 1, 4, 9, 16, 25, 36 ] );   # quadratic component
my %m  = $y->ols( $x, {plot=>1} );
print "$_\t$m{$_}\n" for sort keys %m;

DESCRIPTION

For more about general linear modelling, see Wikipedia. For an unbelievably thorough text on experimental design and analysis, including linear modelling, see A First Course in Design and Analysis of Experiments, Gary W. Oehlert.

The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are broadcastable and methods that are NOT broadcastable, respectively. FUNCTIONS support bad values.

P-values, where appropriate, are provided if PDL::GSL::CDF is installed.

FUNCTIONS

fill_m

Signature: (a(n); [o]b(n))
Types: (float double)

Replaces bad values with sample mean. Mean is set to 0 if all obs are bad.

   pdl> p $data
   [
    [  5 BAD   2 BAD]
    [  7   3   7 BAD]
   ]

   pdl> p $data->fill_m
   [
    [      5     3.5       2     3.5]
    [      7       3       7 5.66667]
   ]

Broadcasts over its inputs.

The output pdl badflag is cleared.

fill_rand

Signature: (a(n); [o]b(n))
Types: (sbyte byte short ushort long ulong indx ulonglong longlong
  float double ldouble)

Replaces bad values with random sample (with replacement) of good observations from the same variable.

pdl> p $data
[
 [  5 BAD   2 BAD]
 [  7   3   7 BAD]
]

pdl> p $data->fill_rand

[
 [5 2 2 5]
 [7 3 7 7]
]

Broadcasts over its inputs.

The output pdl badflag is cleared.

dev_m

Signature: (a(n); [o]b(n))
Types: (float double)

$b = dev_m($a);
dev_m($a, $b);      # all arguments given
$b = $a->dev_m;     # method call
$a->dev_m($b);
$a->inplace->dev_m; # can be used inplace
dev_m($a->inplace);

Replaces values with deviations from the mean.

Broadcasts over its inputs.

dev_m processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

stddz

Signature: (a(n); [o]b(n))
Types: (float double)

$b = stddz($a);
stddz($a, $b);      # all arguments given
$b = $a->stddz;     # method call
$a->stddz($b);
$a->inplace->stddz; # can be used inplace
stddz($a->inplace);

Standardize ie replace values with z_scores based on sample standard deviation from the mean (replace with 0s if stdv==0).

Broadcasts over its inputs.

stddz processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

sse

Signature: (a(n); b(n); [o]c())
Types: (float double)

$c = sse($a, $b);
sse($a, $b, $c);  # all arguments given
$c = $a->sse($b); # method call
$a->sse($b, $c);

Sum of squared errors between actual and predicted values.

Broadcasts over its inputs.

sse processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

mse

Signature: (a(n); b(n); [o]c())
Types: (float double)

$c = mse($a, $b);
mse($a, $b, $c);  # all arguments given
$c = $a->mse($b); # method call
$a->mse($b, $c);

Mean of squared errors between actual and predicted values, ie variance around predicted value.

Broadcasts over its inputs.

mse processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

rmse

Signature: (a(n); b(n); [o]c())
Types: (float double)

$c = rmse($a, $b);
rmse($a, $b, $c);  # all arguments given
$c = $a->rmse($b); # method call
$a->rmse($b, $c);

Root mean squared error, ie stdv around predicted value.

Broadcasts over its inputs.

rmse processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

pred_logistic

Signature: (a(n,m); b(m); [o]c(n))
Types: (float double)

Calculates predicted prob value for logistic regression.

# glue constant then apply coeff returned by the logistic method
$pred = $x->glue(1,ones($x->dim(0)))->pred_logistic( $m{b} );

Broadcasts over its inputs.

pred_logistic processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

d0

Signature: (a(n); [o]c())
Types: (float double)

$c = d0($a);
d0($a, $c);  # all arguments given
$c = $a->d0; # method call
$a->d0($c);

Null deviance for logistic regression.

Broadcasts over its inputs.

d0 processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

dm

Signature: (a(n); b(n); [o]c())
Types: (float double)

Model deviance for logistic regression.

my $dm = $y->dm( $y_pred );
  # null deviance
my $d0 = $y->dm( ones($y->nelem) * $y->avg );

Broadcasts over its inputs.

dm processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

dvrs

Signature: (a(); b(); [o]c())
Types: (float double)

$c = dvrs($a, $b);
dvrs($a, $b, $c);  # all arguments given
$c = $a->dvrs($b); # method call
$a->dvrs($b, $c);

Deviance residual for logistic regression.

Broadcasts over its inputs.

dvrs processes bad values. It will set the bad-value flag of all output ndarrays if the flag is set for any of the input ndarrays.

ols_t

Broadcasted version of ordinary least squares regression (ols). The price of broadcasting was losing significance tests for coefficients (but see r2_change). The fitting function was shamelessly copied then modified from PDL::Fit::Linfit. Intercept is FIRST of coeff if CONST => 1.

ols_t does not handle bad values. consider fill_m or fill_rand if there are bad values.

Default options (case insensitive):

CONST   => 1,

Usage:

# DV, 2 person's ratings for top-10 box office movies
# ascending sorted by box office numbers

pdl> p $y = pdl '1 1 2 4 4 4 4 5 5 5; 1 2 2 2 3 3 3 3 5 5'

# model with 2 IVs, a linear and a quadratic trend component

pdl> $x = cat sequence(10), sequence(10)**2

# suppose our novice modeler thinks this creates 3 different models
# for predicting movie ratings

pdl> p $x = cat $x, $x * 2, $x * 3
[
 [
  [ 0  1  2  3  4  5  6  7  8  9]
  [ 0  1  4  9 16 25 36 49 64 81]
 ]
 [
  [  0   2   4   6   8  10  12  14  16  18]
  [  0   2   8  18  32  50  72  98 128 162]
 ]
 [
  [  0   3   6   9  12  15  18  21  24  27]
  [  0   3  12  27  48  75 108 147 192 243]
 ]
]

pdl> p $x->info
PDL: Double D [10,2,3]

# insert a dummy dim between IV and the dim (model) to be broadcasted

pdl> %m = $y->ols_t( $x->dummy(2) )

pdl> p "$_\t@{[$m{$_} =~ /^\n*(.*?)\n*\z/s]}\n" for sort keys %m

# 2 persons' ratings, each fitted with 3 "different" models
F   [
 [ 38.314159  25.087209]
 [ 38.314159  25.087209]
 [ 38.314159  25.087209]
]

# df is the same across dv and iv models
F_df    [2 7]
F_p [
 [0.00016967051 0.00064215074]
 [0.00016967051 0.00064215074]
 [0.00016967051 0.00064215074]
]

R2  [
 [ 0.9162963 0.87756762]
 [ 0.9162963 0.87756762]
 [ 0.9162963 0.87756762]
]

b   [ # constant     linear      quadratic
 [
  [ 0.66363636    0.99015152   -0.056818182] # person 1
  [        1.4    0.18939394    0.022727273] # person 2
 ]
 [
  [ 0.66363636    0.49507576   -0.028409091]
  [        1.4    0.09469697    0.011363636]
 ]
 [
  [ 0.66363636    0.33005051   -0.018939394]
  [        1.4   0.063131313   0.0075757576]
 ]
]

# our novice modeler realizes at this point that
# the 3 models only differ in the scaling of the IV coefficients
ss_model    [
 [ 20.616667  13.075758]
 [ 20.616667  13.075758]
 [ 20.616667  13.075758]
]

ss_residual [
 [ 1.8833333  1.8242424]
 [ 1.8833333  1.8242424]
 [ 1.8833333  1.8242424]
]

ss_total    [22.5 14.9]
y_pred      [
 [
  [0.66363636  1.5969697  2.4166667  3.1227273  ...  4.9727273]
...

r2_change

Significance test for the incremental change in R2 when new variable(s) are added to an ols regression model.

Returns the change stats as well as stats for both models. Based on "ols_t". (One way to make up for the lack of significance tests for coeffs in ols_t).

Default options (case insensitive):

CONST   => 1,

Usage:

# suppose these are two persons' ratings for top 10 box office movies
# ascending sorted by box office

pdl> p $y = qsort ceil(random(10, 2) * 5)
[
 [1 1 2 2 2 3 4 4 4 4]
 [1 2 2 3 3 3 4 4 5 5]
]

# first IV is a simple linear trend

pdl> p $x1 = sequence 10
[0 1 2 3 4 5 6 7 8 9]

# the modeler wonders if adding a quadratic trend improves the fit

pdl> p $x2 = sequence(10) ** 2
[0 1 4 9 16 25 36 49 64 81]

# two difference models are given in two pdls
# each as would be pass on to ols_t
# the 1st model includes only linear trend
# the 2nd model includes linear and quadratic trends
# when necessary use dummy dim so both models have the same ndims

pdl> %c = $y->r2_change( $x1->dummy(1), cat($x1, $x2) )

pdl> p "$_\t$c{$_}\n" for sort keys %c
  #              person 1   person 2
F_change        [0.72164948 0.071283096]
  # df same for both persons
F_df    [1 7]
F_p     [0.42370145 0.79717232]
R2_change       [0.0085966043 0.00048562549]
model0  HASH(0x8c10828)
model1  HASH(0x8c135c8)

# the answer here is no.

METHODS

anova

Analysis of variance. Uses type III sum of squares for unbalanced data.

Dependent variable should be a 1D pdl. Independent variables can be passed as 1D perl array ref or 1D pdl.

Will only calculate p-value (F_p) if there are more samples than the product of categories of all the IVs.

Supports bad value (by ignoring missing or BAD values in dependent and independent variables list-wise).

For more on ANOVA, see https://en.wikipedia.org/wiki/Analysis_of_variance.

Default options (case insensitive):

V      => 1,          # carps if bad value in variables
IVNM   => [],         # auto filled as ['IV_0', 'IV_1', ... ]
PLOT   => 0,          # plots highest order effect
                      # can set plot_means options here
WIN    => undef,      # for plotting

Usage:

# suppose this is ratings for 12 apples

pdl> p $y = qsort ceil( random(12)*5 )
[1 1 2 2 2 3 3 4 4 4 5 5]

# IV for types of apple

pdl> p $a = sequence(12) % 3 + 1
[1 2 3 1 2 3 1 2 3 1 2 3]

# IV for whether we baked the apple

pdl> @b = qw( y y y y y y n n n n n n )

pdl> %m = $y->anova( $a, \@b, { IVNM=>['apple', 'bake'] } )

pdl> p "$_\t@{[$m{$_} =~ /^\n*(.*?)\n*\z/s]}\n" for sort keys %m
F	2.46666666666667
F_df	[5 6]
F_p	0.151168719948632
ms_model	3.08333333333333
ms_residual	1.25
ss_model	15.4166666666667
ss_residual	7.5
ss_total	22.9166666666667
| apple | F	0.466666666666667
| apple | F_p	0.648078345471096
| apple | df	2
| apple | m	[2.75    3  3.5]
| apple | ms	0.583333333333334
| apple | se	[0.85391256 0.81649658 0.64549722]
| apple | ss	1.16666666666667
| apple || err df	6
| apple ~ bake | F	0.0666666666666671
| apple ~ bake | F_p	0.936190104380701
| apple ~ bake | df	2
| apple ~ bake | m	[
 [1.5   2 2.5]
 [  4   4 4.5]
]
| apple ~ bake | ms	0.0833333333333339
| apple ~ bake | se	[
 [0.5   1 0.5]
 [  1   1 0.5]
]
| apple ~ bake | ss	0.166666666666668
| apple ~ bake || err df	6
| bake | F	11.2666666666667
| bake | F_p	0.015294126084452
| bake | df	1
| bake | m	[2 4.1666667]
| bake | ms	14.0833333333333
| bake | se	[0.36514837 0.40138649]
| bake | ss	14.0833333333333
| bake || err df	6

This is implemented as a call to "anova_rptd", with an undef subjects vector.

anova_rptd

Repeated measures and mixed model anova. Uses type III sum of squares.

The standard error (se) for the means are based on the relevant mean squared error from the anova, ie it is pooled across levels of the effect. Will only calculate p-value (F_p) if there are more samples than the product of categories of all the IVs.

Uses "anova_design_matrix", so supports bad values.

For more on repeated measures ANOVA, see https://en.wikipedia.org/wiki/Repeated_measures_design, and for mixed models, see https://en.wikipedia.org/wiki/Mixed-design_analysis_of_variance.

Default options (case insensitive):

V      => 1,    # carps if bad value in dv
IVNM   => [],   # auto filled as ['IV_0', 'IV_1', ... ]
BTWN   => [],   # indices of between-subject IVs (matches IVNM indices)
PLOT   => 0,    # plots highest order effect
                # see plot_means() for more options
WIN    => undef,      # for plotting

Usage:

Some fictional data: recall_w_beer_and_wings.txt

Subject Beer    Wings   Recall
Alex    1       1       8
Alex    1       2       9
Alex    1       3       12
Alex    2       1       7
Alex    2       2       9
Alex    2       3       12
Brian   1       1       12
Brian   1       2       13
Brian   1       3       14
Brian   2       1       9
Brian   2       2       8
Brian   2       3       14
...

  # rtable allows text only in 1st row and col
my ($data, $idv, $subj) = rtable 'recall_w_beer_and_wings.txt';

my ($b, $w, $dv) = $data->dog;
  # subj and IVs can be 1d pdl or @ ref
  # subj must be the first argument
my %m = $dv->anova_rptd( $subj, $b, $w, {ivnm=>['Beer', 'Wings']} );

print "$_\t@{[$m{$_} =~ /^\n*(.*?)\n*\z/s]}\n" for sort keys %m

ss_residual	19.0833333333333
ss_subject	24.8333333333333
ss_total	133.833333333333
| Beer | F	9.39130434782609
| Beer | F_p	0.0547977008378944
| Beer | df	1
| Beer | m [10.916667  8.9166667]
| Beer | ms	24
| Beer | se [0.4614791  0.4614791]
| Beer | ss	24
| Beer || err df	3
| Beer || err ms	2.55555555555556
| Beer || err ss	7.66666666666667
| Beer ~ Wings | F	0.510917030567687
| Beer ~ Wings | F_p	0.623881438624431
| Beer ~ Wings | df	2
| Beer ~ Wings | m [
 [   10     7]
 [ 10.5  9.25]
 [12.25  10.5]
]
| Beer ~ Wings | ms	1.625
| Beer ~ Wings | se [
 [0.89170561 0.89170561]
 [0.89170561 0.89170561]
 [0.89170561 0.89170561]
]
| Beer ~ Wings | ss	3.25000000000001
| Beer ~ Wings || err df	6
| Beer ~ Wings || err ms	3.18055555555555
| Beer ~ Wings || err ss	19.0833333333333
| Wings | F	4.52851711026616
| Wings | F_p	0.0632754786153548
| Wings | df	2
| Wings | m [8.5 9.875 11.375]
| Wings | ms	16.5416666666667
| Wings | se [0.67571978 0.67571978 0.67571978]
| Wings | ss	33.0833333333333
| Wings || err df	6
| Wings || err ms	3.65277777777778
| Wings || err ss	21.9166666666667

For mixed model anova, ie when there are between-subject IVs involved, feed the IVs as above, but specify in BTWN which IVs are between-subject. For example, if we had added age as a between-subject IV in the above example, we would do

my %m = $dv->anova_rptd( $subj, $age, $b, $w,
                       { ivnm=>['Age', 'Beer', 'Wings'], btwn=>[0] });

anova_design_matrix

Effect-coded design matrix for anova, including repeated-measures and mixed-model. The X for use in linear regression i.e. Y = X β + ε.

Added in 0.854. See https://en.wikipedia.org/wiki/Design_matrix for more.

Supports bad value in the dependent and independent variables. It automatically removes bad data listwise, i.e. remove a subject's data if there is any cell missing for the subject.

Default options (case insensitive):

V      => 1,    # carps if bad value in dv
IVNM   => [],   # auto filled as ['IV_0', 'IV_1', ... ]
BTWN   => [],   # indices of between-subject IVs (matches IVNM indices)

$matrix = $dv->anova_design_matrix(undef, $b, $w, {ivnm=>[qw(b w)]});
$matrix = $dv->anova_design_matrix(
  $subj, $b, $w, {ivnm=>[qw(b w)]}); # repeated-measures
$matrix = $dv->anova_design_matrix(
  $subj, $b, $w, {ivnm=>[qw(b w)], btwn=>['b']}); # mixed-model
($matrix, $ivnm_ref) = $dv->anova_design_matrix(
  $subj, $b, $w, {ivnm=>[qw(b w)], btwn=>['b']}); # list context also names

dummy_code

Dummy coding of nominal variable (perl @ ref or 1d pdl) for use in regression.

Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

pdl> @a = qw(a a a b b b c c c)
pdl> p $a = dummy_code(\@a)
[
 [1 1 1 0 0 0 0 0 0]
 [0 0 0 1 1 1 0 0 0]
]

effect_code

Unweighted effect coding of nominal variable (perl @ ref or 1d pdl) for use in regression. returns in @ context coded pdl and % ref to level - pdl->dim(1) index.

Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

my @var = qw( a a a b b b c c c );
my ($var_e, $map) = effect_code( \@var );

print $var_e . $var_e->info . "\n";

[
 [ 1  1  1  0  0  0 -1 -1 -1]
 [ 0  0  0  1  1  1 -1 -1 -1]
]
PDL: Double D [9,2]

print "$_\t$map->{$_}\n" for sort keys %$map
a       0
b       1
c       2

effect_code_w

Weighted effect code for nominal variable. returns in @ context coded pdl and % ref to level - pdl->dim(1) index.

Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

pdl> @a = qw( a a b b b c c )
pdl> p $a = effect_code_w(\@a)
[
 [   1    1    0    0    0   -1   -1]
 [   0    0    1    1    1 -1.5 -1.5]
]

interaction_code

Returns the coded interaction term for effect-coded variables.

Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

pdl> $a = sequence(6) > 2
pdl> p $a = $a->effect_code
[
 [ 1  1  1 -1 -1 -1]
]

pdl> $b = pdl( qw( 0 1 2 0 1 2 ) )
pdl> p $b = $b->effect_code
[
 [ 1  0 -1  1  0 -1]
 [ 0  1 -1  0  1 -1]
]

pdl> p $ab = interaction_code( $a, $b )
[
 [ 1  0 -1 -1 -0  1]
 [ 0  1 -1 -0 -1  1]
]

ols

Ordinary least squares regression, aka linear regression.

Unlike ols_t, ols is not broadcastable, but it can handle bad value (by ignoring observations with bad value in dependent or independent variables list-wise) and returns the full model in list context with various stats.

IVs ($x) should be pdl dims $y->nelem or $y->nelem x n_iv. Do not supply the constant vector in $x. Intercept is automatically added and returned as FIRST of the coeffs if CONST=>1. Returns full model in list context and coeff in scalar context.

For more on multiple linear regression see https://en.wikipedia.org/wiki/Multiple_linear_regression.

Default options (case insensitive):

CONST  => 1,
PLOT   => 0,   # see plot_residuals() for plot options
WIN    => undef,      # for plotting

Usage:

# suppose this is a person's ratings for top 10 box office movies
# ascending sorted by box office

pdl> $y = pdl '[1 1 2 2 2 2 4 4 5 5]'

# construct IV with linear and quadratic component

pdl> p $x = cat sequence(10), sequence(10)**2
[
 [ 0  1  2  3  4  5  6  7  8  9]
 [ 0  1  4  9 16 25 36 49 64 81]
]

pdl> %m = $y->ols( $x )

pdl> p "$_\t@{[$m{$_} =~ /^\n*(.*?)\n*\z/s]}\n" for sort keys %m

F       40.4225352112676
F_df    [2 7]
F_p     0.000142834216344756
R2      0.920314253647587

# coeff  constant   linear    quadratic
b        [0.981818  0.212121  0.030303]
b_p      [0.039910  0.328001  0.203034]
b_se     [0.389875  0.201746  0.021579]
b_t      [2.518284  1.051422  1.404218]
ss_model        19.8787878787879
ss_residual     1.72121212121212
ss_total        21.6
y_pred  [0.98181818  1.2242424  1.5272727  ...  4.6181818  5.3454545]

ols_rptd

Repeated measures linear regression. Handles purely within-subject design for now.

(Lorch & Myers, 1990; Van den Noortgate & Onghena, 2006). See t/glm.t for an example using the Lorch and Myers data.

Usage:

# This is the example from Lorch and Myers (1990),
# a study on how characteristics of sentences affected reading time
# Three within-subject IVs:
# SP -- serial position of sentence
# WORDS -- number of words in sentence
# NEW -- number of new arguments in sentence

# $subj can be 1D pdl or @ ref and must be the first argument
# IV can be 1D @ ref or pdl
# 1D @ ref is effect coded internally into pdl
# pdl is left as is

my %r = $rt->ols_rptd( $subj, $sp, $words, $new );

print "$_\t$r{$_}\n" for sort keys %r;

ss_residual   58.3754646504336
ss_subject    51.8590337714286
ss_total      405.188241771429
#      SP        WORDS      NEW
F   [  7.208473  61.354153  1.0243311]
F_p [0.025006181 2.619081e-05 0.33792837]
coeff   [0.33337285 0.45858933 0.15162986]
df  [1 1 1]
df_err  [9 9 9]
ms  [ 18.450705  73.813294 0.57026483]
ms_err  [ 2.5595857  1.2030692 0.55671923]
ss  [ 18.450705  73.813294 0.57026483]
ss_err  [ 23.036272  10.827623  5.0104731]

logistic

Logistic regression with maximum likelihood estimation using PDL::Fit::LM.

IVs ($x) should be pdl dims $y->nelem or $y->nelem x n_iv. Do not supply the constant vector in $x. It is included in the model and returned as LAST of coeff. Returns full model in list context and coeff in scalar context.

The significance tests are likelihood ratio tests (-2LL deviance) tests. IV significance is tested by comparing deviances between the reduced model (ie with the IV in question removed) and the full model.

***NOTE: the results here are qualitatively similar to but not identical with results from R, because different algorithms are used for the nonlinear parameter fit. Use with discretion***

Default options (case insensitive):

INITP => zeroes( $x->dim(1) + 1 ),    # n_iv + 1
MAXIT => 1000,
EPS   => 1e-7,

Usage:

# suppose this is whether a person had rented 10 movies

pdl> p $y = ushort( random(10)*2 )
[0 0 0 1 1 0 0 1 1 1]

# IV 1 is box office ranking

pdl> p $x1 = sequence(10)
[0 1 2 3 4 5 6 7 8 9]

# IV 2 is whether the movie is action- or chick-flick

pdl> p $x2 = sequence(10) % 2
[0 1 0 1 0 1 0 1 0 1]

# concatenate the IVs together

pdl> p $x = cat $x1, $x2
[
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 0 1 0 1 0 1 0 1]
]

pdl> %m = $y->logistic( $x )

pdl> p "$_\t$m{$_}\n" for sort keys %m

D0	13.8629436111989
Dm	9.8627829791575
Dm_chisq	4.00016063204141
Dm_df	2
Dm_p	0.135324414081692
  #  ranking    genre      constant
b	[0.41127706 0.53876358 -2.1201285]
b_chisq	[ 3.5974504 0.16835559  2.8577151]
b_p	[0.057868258  0.6815774 0.090936587]
iter	12
y_pred	[0.10715577 0.23683909 ... 0.76316091 0.89284423]

# to get the covariance out, supply a true value for the COV option:
pdl> %m = $y->logistic( $x, {COV=>1} )
pdl> p $m{cov};

pca

Principal component analysis. Based on corr instead of cov.

Bad values are ignored pair-wise. OK when bad values are few but otherwise probably should fill_m etc before pca). Uses "eigens_sym" in PDL::MatrixOps.

Default options (case insensitive):

CORR  => 1,     # boolean. use correlation or covariance
PLOT  => 0,     # calls plot_screes by default
                # can set plot_screes options here
WIN    => undef,      # for plotting

Usage:

my $d = qsort random 10, 5;      # 10 obs on 5 variables
my %r = $d->pca( \%opt );
print "$_\t$r{$_}\n" for (keys %r);

eigenvalue    # variance accounted for by each component
[4.70192 0.199604 0.0471421 0.0372981 0.0140346]

eigenvector   # dim var x comp. weights for mapping variables to component
[
 [ -0.451251  -0.440696  -0.457628  -0.451491  -0.434618]
 [ -0.274551   0.582455   0.131494   0.255261  -0.709168]
 [   0.43282   0.500662  -0.139209  -0.735144 -0.0467834]
 [  0.693634  -0.428171   0.125114   0.128145  -0.550879]
 [  0.229202   0.180393  -0.859217     0.4173  0.0503155]
]

loadings      # dim var x comp. correlation between variable and component
[
 [ -0.978489  -0.955601  -0.992316   -0.97901  -0.942421]
 [ -0.122661   0.260224  0.0587476   0.114043  -0.316836]
 [ 0.0939749   0.108705 -0.0302253  -0.159616 -0.0101577]
 [   0.13396 -0.0826915  0.0241629  0.0247483   -0.10639]
 [  0.027153  0.0213708  -0.101789  0.0494365 0.00596076]
]

pct_var       # percent variance accounted for by each component
[0.940384 0.0399209 0.00942842 0.00745963 0.00280691]

Plot scores along the first two components,

$d->plot_scores( $r{eigenvector} );

pca_sorti

Determine by which vars a component is best represented. Descending sort vars by size of association with that component. Returns sorted var and relevant component indices.

Default options (case insensitive):

NCOMP => 10,     # maximum number of components to consider

Usage:

  # let's see if we replicated the Osgood et al. (1957) study
pdl> ($data, $idv, $ido) = rtable 'osgood_exp.csv', {v=>0}

  # select a subset of var to do pca
pdl> $ind = which_id $idv, [qw( ACTIVE BASS BRIGHT CALM FAST GOOD HAPPY HARD LARGE HEAVY )]
pdl> $data = $data( ,$ind)->sever
pdl> @$idv = @$idv[list $ind]

pdl> %m = $data->pca

pdl> ($iv, $ic) = $m{loadings}->pca_sorti()

pdl> p "$idv->[$_]\t" . $m{loadings}->($_,$ic)->flat . "\n" for (list $iv)

         #   COMP0     COMP1    COMP2    COMP3
HAPPY	[0.860191 0.364911 0.174372 -0.10484]
GOOD	[0.848694 0.303652 0.198378 -0.115177]
CALM	[0.821177 -0.130542 0.396215 -0.125368]
BRIGHT	[0.78303 0.232808 -0.0534081 -0.0528796]
HEAVY	[-0.623036 0.454826 0.50447 0.073007]
HARD	[-0.679179 0.0505568 0.384467 0.165608]
ACTIVE	[-0.161098 0.760778 -0.44893 -0.0888592]
FAST	[-0.196042 0.71479 -0.471355 0.00460276]
LARGE	[-0.241994 0.594644 0.634703 -0.00618055]
BASS	[-0.621213 -0.124918 0.0605367 -0.765184]

plot_means

Plots means anova style. Can handle up to 4-way interactions (ie 4D pdl).

Default options (case insensitive):

IVNM  => ['IV_0', 'IV_1', 'IV_2', 'IV_3'],
DVNM  => 'DV',
AUTO  => 1,       # auto set dims to be on x-axis, line, panel
                  # if set 0, dim 0 goes on x-axis, dim 1 as lines
                  # dim 2+ as panels
  # see PDL::Graphics::Simple for next option
WIN   => undef,   # pgswin object. not closed here if passed
                  # allows comparing multiple lines in same plot
SIZE  => 640,           # individual square panel size in pixels

Usage:

  # see anova for mean / se pdl structure
$mean->plot_means( $se, {IVNM=>['apple', 'bake']} );

Or like this:

$m{'| apple ~ bake | m'}->plot_means;

plot_residuals

Plots residuals against predicted values.

Usage:

use PDL::Graphics::Simple;
$w = pgswin();
$y->plot_residuals( $y_pred, { win=>$w } );

Default options (case insensitive):

 # see PDL::Graphics::Simple for more info
WIN   => undef,  # pgswin object. not closed here if passed
                 # allows comparing multiple lines in same plot
SIZE  => 640,    # plot size in pixels
COLOR => 1,

plot_scores

Plots standardized original and PCA transformed scores against two components. (Thank you, Bob MacCallum, for the documentation suggestion that led to this function.)

Default options (case insensitive):

CORR  => 1,      # boolean. PCA was based on correlation or covariance
COMP  => [0,1],  # indices to components to plot
  # see PDL::Graphics::Simple for next options
WIN   => undef,  # pgswin object. not closed here if passed
                 # allows comparing multiple lines in same plot
SIZE  => 640,    # plot size in pixels
COLOR => [1,2],  # color for original and rotated scores

Usage:

my %p = $data->pca();
$data->plot_scores( $p{eigenvector}, \%opt );

plot_screes

Scree plot. Plots proportion of variance accounted for by PCA components.

Default options (case insensitive):

NCOMP => 20,     # max number of components to plot
CUT   => 0,      # set to plot cutoff line after this many components
                 # undef to plot suggested cutoff line for NCOMP comps
 # see PDL::Graphics::Simple for next options
WIN   => undef,  # pgswin object. not closed here if passed
                 # allows comparing multiple lines in same plot
SIZE  => 640,    # plot size in pixels

Usage:

# variance should be in descending order
$d = qsort random 10, 5;      # 10 obs on 5 variables
%pca = $d->pca( \%opt );
$pca{pct_var}->plot_screes( {ncomp=>16, win=>$win=PDL::Graphics::Simple::pgswin()} );

Or, because NCOMP is used so often, it is allowed a shortcut,

$pca{pct_var}->plot_screes( 16 );

plot_stripchart

Stripchart plot. Plots ANOVA-style data, categorised against given IVs.

Default options (case insensitive):

IVNM => [],   # auto filled as ['IV_0', 'IV_1', ... ]
DVNM => 'DV',
 # see PDL::Graphics::Simple for next options
WIN => undef,  # pgswin object. not closed here if passed

Usage:

%m = $y->plot_stripchart( $a, \@b, { IVNM=>[qw(apple bake)] } );

REFERENCES

Cohen, J., Cohen, P., West, S.G., & Aiken, L.S. (2003). Applied Multiple Regression/correlation Analysis for the Behavioral Sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.

Hosmer, D.W., & Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York, NY: Wiley-Interscience.

Lorch, R.F., & Myers, J.L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, & Cognition, 16, 149-157.

Osgood C.E., Suci, G.J., & Tannenbaum, P.H. (1957). The Measurement of Meaning. Champaign, IL: University of Illinois Press.

Rutherford, A. (2011). ANOVA and ANCOVA: A GLM Approach (2nd ed.). Wiley.

Shlens, J. (2009). A Tutorial on Principal Component Analysis. Retrieved April 10, 2011 from http://citeseerx.ist.psu.edu/

The GLM procedure: unbalanced ANOVA for two-way design with interaction. (2008). SAS/STAT(R) 9.2 User's Guide. Retrieved June 18, 2009 from http://support.sas.com/

Van den Noortgate, W., & Onghena, P. (2006). Analysing repeated measures data in cognitive research: A comment on regression coefficient analyses. European Journal of Cognitive Psychology, 18, 937-952.

AUTHOR

To install PDL::Stats, copy and paste the appropriate command in to your terminal.

cpanm

cpanm PDL::Stats

CPAN shell

perl -MCPAN -e shell
install PDL::Stats

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

fill_m

fill_rand

dev_m

stddz

sse

mse

rmse

pred_logistic

d0

dm

dvrs

ols_t

r2_change

METHODS

anova

anova_rptd

anova_design_matrix

dummy_code

effect_code

effect_code_w

interaction_code

ols

ols_rptd

logistic

pca

pca_sorti

plot_means

plot_residuals

plot_scores

plot_screes

plot_stripchart

SEE ALSO

REFERENCES

AUTHOR

Module Install Instructions