NAME
PDL::Stats::GLM -- general linear modeling methods and logistic regression
DESCRIPTION
The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are NOT threadable, respectively. FUNCTIONS except ols_t support bad value. PDL::Slatec strongly recommended for most METHODS, and it is required for logistic.
P-values, where appropriate, are provided if PDL::GSL::CDF is installed.
SYNOPSIS
use PDL::LiteF;
use PDL::NiceSlice;
use PDL::Stats::GLM;
# do a multiple linear regression and plot the residuals
my $y = random 10;
my $x1 = sequence 10;
my $x2 = $x1 ** 2;
my $iv = cat $x1, $x2;
my %m = $y->ols( $iv );
print "$_\t$m{$_}\n" for (sort keys %m);
use PDL::Graphics::PGPLOT::Window;
my $win = pgwin( 'xs' );
$win->points( $y - $m{y_pred} );
FUNCTIONS
fill_m
Signature: (a(n); float+ [o]b(n))
Replaces bad values with sample mean. Mean is set to 0 if all obs are bad. Can be done inplace.
perldl> p $data
[
[ 5 BAD 2 BAD]
[ 7 3 7 BAD]
]
perldl> p $data->fill_m
[
[ 5 3.5 2 3.5]
[ 7 3 7 5.66667]
]
The output pdl badflag is cleared.
fill_rand
Signature: (a(n); [o]b(n))
Replaces bad values with random sample (with replacement) of good observations from the same variable. Can be done inplace.
perldl> p $data
[
[ 5 BAD 2 BAD]
[ 7 3 7 BAD]
]
perldl> p $data->fill_rand
[
[5 2 2 5]
[7 3 7 7]
]
The output pdl badflag is cleared.
dev_m
Signature: (a(n); float+ [o]b(n))
Replaces values with deviations from the mean. Can be done inplace.
dev_m does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
stddz
Signature: (a(n); float+ [o]b(n))
Standardize ie replace values with z_scores based on sample standard deviation from the mean (replace with 0s if stdv==0). Can be done inplace.
stddz does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
sse
Signature: (a(n); b(n); float+ [o]c())
Sum of squared errors between actual and predicted values.
sse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
mse
Signature: (a(n); b(n); float+ [o]c())
Mean of squared errors between actual and predicted values, ie variance around predicted value.
mse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
rmse
Signature: (a(n); b(n); float+ [o]c())
Root mean squared error, ie stdv around predicted value.
rmse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
pred_logistic
Signature: (a(n,m); b(m); float+ [o]c(n))
Calculates predicted prob value for logistic regression.
# glue constant then apply coeff returned by the logistic method
$pred = $x->glue(1,ones($x->dim(0)))->pred_logistic( $m{b} );
pred_logistic does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
d0
Signature: (a(n); float+ [o]c())
my $d0 = $y->d0();
Null deviance for logistic regression.
d0 does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
dm
Signature: (a(n); b(n); float+ [o]c())
my $dm = $y->dm( $y_pred );
# null deviance
my $d0 = $y->dm( ones($y->nelem) * $y->avg );
Model deviance for logistic regression.
dm does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
dvrs
Signature: (a(); b(); float+ [o]c())
Deviance residual for logistic regression.
dvrs does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
ols_t
Threaded version of ordinary least squares regression (ols). The price of threading was losing significance tests for coefficients (but see r2_change). The fitting function was shamelessly copied then modified from PDL::Fit::Linfit. Uses PDL::Slatec when possible but otherwise uses PDL::MatrixOps. Intercept is LAST of coeff if CONST => 1.
ols_t does not handle bad values. consider fill_m or fill_rand if there are bad values.
Default options (case insensitive):
CONST => 1,
Usage:
# DV, 2 person's ratings for top-10 box office movies
# ascending sorted by box office numbers
perldl> p $y = qsort ceil( random(10, 2)*5 )
[
[1 1 2 4 4 4 4 5 5 5]
[1 2 2 2 3 3 3 3 5 5]
]
# model with 2 IVs, a linear and a quadratic trend component
perldl> $x = cat sequence(10), sequence(10)**2
# suppose our novice modeler thinks this creates 3 different models
# for predicting movie ratings
perldl> p $x = cat $x, $x * 2, $x * 3
[
[
[ 0 1 2 3 4 5 6 7 8 9]
[ 0 1 4 9 16 25 36 49 64 81]
]
[
[ 0 2 4 6 8 10 12 14 16 18]
[ 0 2 8 18 32 50 72 98 128 162]
]
[
[ 0 3 6 9 12 15 18 21 24 27]
[ 0 3 12 27 48 75 108 147 192 243]
]
]
perldl> p $x->info
PDL: Double D [10,2,3]
# insert a dummy dim between IV and the dim (model) to be threaded
perldl> %m = $y->ols_t( $x->dummy(2) )
perldl> p "$_\t$m{$_}\n" for (sort keys %m)
# 2 persons' ratings, eached fitted with 3 "different" models
F
[
[ 38.314159 25.087209]
[ 38.314159 25.087209]
[ 38.314159 25.087209]
]
# df is the same across dv and iv models
F_df [2 7]
F_p
[
[0.00016967051 0.00064215074]
[0.00016967051 0.00064215074]
[0.00016967051 0.00064215074]
]
R2
[
[ 0.9162963 0.87756762]
[ 0.9162963 0.87756762]
[ 0.9162963 0.87756762]
]
b
[ # linear quadratic constant
[
[ 0.99015152 -0.056818182 0.66363636] # person 1
[ 0.18939394 0.022727273 1.4] # person 2
]
[
[ 0.49507576 -0.028409091 0.66363636]
[ 0.09469697 0.011363636 1.4]
]
[
[ 0.33005051 -0.018939394 0.66363636]
[ 0.063131313 0.0075757576 1.4]
]
]
# our novice modeler realizes at this point that
# the 3 models only differ in the scaling of the IV coefficients
ss_model
[
[ 20.616667 13.075758]
[ 20.616667 13.075758]
[ 20.616667 13.075758]
]
ss_residual
[
[ 1.8833333 1.8242424]
[ 1.8833333 1.8242424]
[ 1.8833333 1.8242424]
]
ss_total [22.5 14.9]
y_pred
[
[
[0.66363636 1.5969697 2.4166667 3.1227273 ... 4.9727273]
...
r2_change
Significance test for the incremental change in R2 when new variable(s) are added to an ols regression model. Returns the change stats as well as stats for both models. Base on ols_t. (One way to make up for the lack of significance tests for coeffs in ols_t).
Default options (case insensitive):
CONST => 1,
Usage:
# suppose these are two persons' ratings for top 10 box office movies
# ascending sorted by box office
perldl> p $y = qsort ceil(random(10, 2) * 5)
[
[1 1 2 2 2 3 4 4 4 4]
[1 2 2 3 3 3 4 4 5 5]
]
# first IV is a simple linear trend
perldl> p $x1 = sequence 10
[0 1 2 3 4 5 6 7 8 9]
# the modeler wonders if adding a quadratic trend improves the fit
perldl> p $x2 = sequence(10) ** 2
[0 1 4 9 16 25 36 49 64 81]
# two difference models are given in two pdls
# each as would be pass on to ols_t
# the 1st model includes only linear trend
# the 2nd model includes linear and quadratic trends
# when necessary use dummy dim so both models have the same ndims
perldl> %c = $y->r2_change( $x1->dummy(1), cat($x1, $x2) )
perldl> p "$_\t$c{$_}\n" for (sort keys %c)
# person 1 person 2
F_change [0.72164948 0.071283096]
# df same for both persons
F_df [1 7]
F_p [0.42370145 0.79717232]
R2_change [0.0085966043 0.00048562549]
model0 HASH(0x8c10828)
model1 HASH(0x8c135c8)
# the answer here is no.
METHODS
anova
Analysis of variance. Uses type III sum of squares for unbalanced data.
anova supports bad value in the dependent variable.
Default options (case insensitive):
V => 1, # carps if bad value in dv
IVNM => undef, # auto filled as ['IV_0', 'IV_1', ... ]
PLOT => 1, # plots highest order effect
# can set plot_means options here
Usage:
# suppose this is ratings for 12 apples
perldl> p $y = qsort ceil( random(12)*5 )
[1 1 2 2 2 3 3 4 4 4 5 5]
# IV for types of apple
perldl> p $a = sequence(12) % 3 + 1
[1 2 3 1 2 3 1 2 3 1 2 3]
# IV for whether we baked the apple
perldl> @b = qw( y y y y y y n n n n n n )
perldl> %m = $y->anova( $a, \@b, { IVNM=>['apple', 'bake'] } )
perldl> p "$_\t$m{$_}\n" for (sort keys %m)
# apple # m
[
[2.5 3 3.5]
]
# apple # se
[
[0.64549722 0.91287093 0.64549722]
]
# apple ~ bake # m
[
[1.5 1.5 2.5]
[3.5 4.5 4.5]
]
# apple ~ bake # se
[
[0.5 0.5 0.5]
[0.5 0.5 0.5]
]
# bake # m
[
[ 1.8333333 4.1666667]
]
# bake # se
[
[0.30731815 0.30731815]
]
F 7.6
F_df [5 6]
F_p 0.0141586545851857
ms_model 3.8
ms_residual 0.5
ss_model 19
ss_residual 3
ss_total 22
| apple | F 2
| apple | F_df [2 6]
| apple | F_p 0.216
| apple | ms 1
| apple | ss 2
| apple ~ bake | F 0.666666666666667
| apple ~ bake | F_df [2 6]
| apple ~ bake | F_p 0.54770848985725
| apple ~ bake | ms 0.333333333333334
| apple ~ bake | ss 0.666666666666667
| bake | F 32.6666666666667
| bake | F_df [1 6]
| bake | F_p 0.00124263849516693
| bake | ms 16.3333333333333
| bake | ss 16.3333333333333
anova_rptd
Repeated measures anova. Uses type III sum of squares. The standard error (se) for the means are based on the relevant mean squared error from the anova, ie it is pooled across levels of the effect.
anova_rptd supports bad value in the dependent variable. It automatically removes bad data listwise, ie remove a subject's data if there is any cell missing for the subject.
Between-subject factor support upcoming. Stay tuned.
Default options (case insensitive):
V => 1, # carps if bad value in dv
IVNM => undef, # auto filled as ['IV_0', 'IV_1', ... ]
PLOT => 1, # plots highest order effect
# can set plot_means options here
Usage:
Some fictional data: recall_w_beer_and_wings.txt
Subject Beer Wings Recall
Alex 1 1 8
Alex 1 2 9
Alex 1 3 12
Alex 2 1 7
Alex 2 2 9
Alex 2 3 12
Brian 1 1 12
Brian 1 2 13
Brian 1 3 14
Brian 2 1 9
Brian 2 2 8
Brian 2 3 14
...
# get_data allows text only in 1st row and col
my ($data, $idv, $subj) = get_data 'recall_w_beer_and_wings.txt';
my ($b, $w, $dv) = $data->dog;
# subj and ivs can be 1d pdl or @ ref
my %m = $dv->anova_rptd( $subj, $b, $w, {ivnm=>['Beer', 'Wings']} );
print "$_\t$m{$_}\n" for (sort keys %m);
# Beer # m
[
[ 10.916667 8.9166667]
]
# Beer # se
[
[ 0.4614791 0.4614791]
]
# Beer ~ Wings # m
[
[ 10 7]
[ 10.5 9.25]
[12.25 10.5]
]
# Beer ~ Wings # se
[
[0.89170561 0.89170561]
[0.89170561 0.89170561]
[0.89170561 0.89170561]
]
# Wings # m
[
[ 8.5 9.875 11.375]
]
# Wings # se
[
[0.67571978 0.67571978 0.67571978]
]
ss_residual 19.0833333333333
ss_subject 24.8333333333333
ss_total 133.833333333333
| Beer | F 9.39130434782609
| Beer | F_p 0.0547977008378944
| Beer | df 1
| Beer | ms 24
| Beer | ss 24
| Beer || err df 3
| Beer || err ms 2.55555555555556
| Beer || err ss 7.66666666666667
| Beer ~ Wings | F 0.510917030567687
| Beer ~ Wings | F_p 0.623881438624431
| Beer ~ Wings | df 2
| Beer ~ Wings | ms 1.625
| Beer ~ Wings | ss 3.25000000000001
| Beer ~ Wings || err df 6
| Beer ~ Wings || err ms 3.18055555555555
| Beer ~ Wings || err ss 19.0833333333333
| Wings | F 4.52851711026616
| Wings | F_p 0.0632754786153548
| Wings | df 2
| Wings | ms 16.5416666666667
| Wings | ss 33.0833333333333
| Wings || err df 6
| Wings || err ms 3.65277777777778
| Wings || err ss 21.9166666666667
dummy_code
Dummy coding of nominal variable (perl @ ref or 1d pdl) for use in regression.
perldl> @a = qw(a a a b b b c c c)
perldl> p $a = dummy_code(\@a)
[
[1 1 1 0 0 0 0 0 0]
[0 0 0 1 1 1 0 0 0]
]
effect_code
Unweighted effect coding of nominal variable (perl @ ref or 1d pdl) for use in regression. returns in @ context coded pdl and % ref to level - pdl->dim(1) index.
my @var = qw( a a a b b b c c c );
my ($var_e, $map) = effect_code( \@var );
print $var_e . $var_e->info . "\n";
[
[ 1 1 1 0 0 0 -1 -1 -1]
[ 0 0 0 1 1 1 -1 -1 -1]
]
PDL: Double D [9,2]
print "$_\t$map->{$_}\n" for (sort keys %$map)
a 0
b 1
c 2
effect_code_w
Weighted effect code for nominal variable. returns in @ context coded pdl and % ref to level - pdl->dim(1) index.
perldl> @a = qw( a a b b b c c )
perldl> p $a = effect_code_w(\@a)
[
[ 1 1 0 0 0 -1 -1]
[ 0 0 1 1 1 -1.5 -1.5]
]
ols
Ordinary least squares regression, aka linear regression. Unlike ols_t, ols returns the full model in list context with various stats, but is not threadable.
IVs ($x) should be pdl dims $y->nelem or $y->nelem x n_iv. Do not supply the constant vector in $x. Intercept is automatically added and returned as LAST of the coeffs if CONST=>1. Returns full model in list context and coeff in scalar context.
Default options (case insensitive):
CONST => 1,
Usage:
# suppose this is a person's ratings for top 10 box office movies
# ascending sorted by box office
perldl> p $y = qsort ceil( random(10) * 5 )
[1 1 2 2 2 2 4 4 5 5]
# construct IV with linear and quadratic component
perldl> p $x = cat sequence(10), sequence(10)**2
[
[ 0 1 2 3 4 5 6 7 8 9]
[ 0 1 4 9 16 25 36 49 64 81]
]
perldl> %m = $y->ols( $x )
perldl> p "$_\t$m{$_}\n" for (sort keys %m)
F 40.4225352112676
F_df [2 7]
F_p 0.000142834216344756
R2 0.920314253647587
# coeff linear quadratic constant
b [0.21212121 0.03030303 0.98181818]
b_p [0.32800118 0.20303404 0.039910509]
b_se [0.20174693 0.021579989 0.38987581]
b_t [ 1.0514223 1.404219 2.5182844]
ss_model 19.8787878787879
ss_residual 1.72121212121212
ss_total 21.6
y_pred [0.98181818 1.2242424 1.5272727 ... 4.6181818 5.3454545]
logistic
Logistic regression with maximum likelihood estimation using PDL::Fit::LM (requires PDL::Slatec. Hence loaded with "require" in the sub instead of "use" at the beginning).
IVs ($x) should be pdl dims $y->nelem or $y->nelem x n_iv. Do not supply the constant vector in $x. It is included in the model and returned as LAST of coeff. Returns full model in list context and coeff in scalar context.
The significance tests are likelihood ratio tests (-2LL deviance) tests. IV significance is tested by comparing deviances between the reduced model (ie with the IV in question removed) and the full model.
***NOTE: the results here are qualitatively similar to but not identical with results from R, because different algorithms are used for the nonlinear parameter fit. Use with discretion***
Default options (case insensitive):
INITP => zeroes( $x->dim(1) + 1 ), # n_iv + 1
MAXIT => 1000,
EPS => 1e-7,
Usage:
# suppose this is whether a person had rented 10 movies
perldl> p $y = ushort( random(10)*2 )
[0 0 0 1 1 0 0 1 1 1]
# IV 1 is box office ranking
perldl> p $x1 = sequence(10)
[0 1 2 3 4 5 6 7 8 9]
# IV 2 is whether the movie is action- or chick-flick
perldl> p $x2 = sequence(10) % 2
[0 1 0 1 0 1 0 1 0 1]
# concatenate the IVs together
perldl> p $x = cat $x1, $x2
[
[0 1 2 3 4 5 6 7 8 9]
[0 1 0 1 0 1 0 1 0 1]
]
perldl> %m = $y->logistic( $x )
perldl> p "$_\t$m{$_}\n" for (sort keys %m)
D0 13.8629436111989
Dm 9.8627829791575
Dm_chisq 4.00016063204141
Dm_df 2
Dm_p 0.135324414081692
# ranking genre constant
b [0.41127706 0.53876358 -2.1201285]
b_chisq [ 3.5974504 0.16835559 2.8577151]
b_p [0.057868258 0.6815774 0.090936587]
iter 12
y_pred [0.10715577 0.23683909 ... 0.76316091 0.89284423]
pca
Principal component analysis. $data is pdl dim obs x var. Output loading (corr between var and component) and score are pdls dim var x component. value and variance are pdls dim component.
Based on corr instead of cov (bad values are ignored pair-wise. OK when bad values are few but otherwise probably should fill_m etc before pca). Use PDL::Slatec::eigsys() if installed, otherwise use PDL::MatrixOps::eigens_sym(). Added loadings and descending sorted component by $value (ie variance accouted for).
Default options (case insensitive):
PLOT => 1, # scree plot for var accounted for
# can set plot_scree options here
Usage:
my $data = random 100, 20; # 100 obs on 20 var
my %result = $data->pca;
print "$_\t$result{$_}\n" for (keys %result);
pca_sorti
Determine by which vars a component is best represented. Descending sort vars by size of association with that component. Returns sorted var and relevant component indices.
Default options (case insensitive):
NCOMP => 10, # maximum number of components to consider
Usage:
# let's see if we replicated the Osgood et al. (1957) study
perldl> ($data, $idv, $ido) = get_data 'osgood_exp.csv', {v=>0}
# select a subset of var to do pca
perldl> $ind = which_id $idv, [qw( ACTIVE BASS BRIGHT CALM FAST GOOD HAPPY HARD LARGE HEAVY )]
perldl> $data = $data( ,$ind)->sever
perldl> @$idv = @$idv[list $ind]
perldl> %m = $data->pca
perldl> ($iv, $ic) = $m{loading}->pca_sorti()
perldl> p "$idv->[$_]\t" . $m{loading}->($_,$ic)->flat . "\n" for (list $iv)
# COMP0 COMP1 COMP2 COMP3
HAPPY [0.860191 0.364911 0.174372 -0.10484]
GOOD [0.848694 0.303652 0.198378 -0.115177]
CALM [0.821177 -0.130542 0.396215 -0.125368]
BRIGHT [0.78303 0.232808 -0.0534081 -0.0528796]
HEAVY [-0.623036 0.454826 0.50447 0.073007]
HARD [-0.679179 0.0505568 0.384467 0.165608]
ACTIVE [-0.161098 0.760778 -0.44893 -0.0888592]
FAST [-0.196042 0.71479 -0.471355 0.00460276]
LARGE [-0.241994 0.594644 0.634703 -0.00618055]
BASS [-0.621213 -0.124918 0.0605367 -0.765184]
plot_means
Plots means anova style. Can handle up to 4-way interactions (ie 4D pdl).
Default options (case insensitive):
IVNM => ['IV_0', 'IV_1', 'IV_2', 'IV_3'],
DVNM => 'DV',
AUTO => 1, # auto set dims to be on x-axis, line, panel
# if set 0, dim 0 goes on x-axis, dim 1 as lines
# dim 2+ as panels
# see PDL::Graphics::PGPLOT::Window for next options
WIN => undef, # pgwin object. not closed here if passed
# allows comparing multiple lines in same plot
# set env before passing WIN
DEV => '/xs', # open and close dev for plotting if no WIN
SIZE => 480, # individual square panel size in pixels
SYMBL => [0, 4, 7, 11],
Usage:
# see anova for mean / se pdl structure
$mean->plot_means( $se, {IVNM=>['apple', 'bake']} );
Or like this:
$m{'# apple ~ bake # m'}->plot_means;
plot_scree
Scree plot. Plots proportion of variance accounted for by PCA components.
Default options (case insensitive):
NCOMP => 20, # max number of components to plot
CUT => 0, # set to plot cutoff line after this many components
# undef to plot suggested cutoff line for NCOMP comps
# see PDL::Graphics::PGPLOT::Window for next options
WIN => undef, # pgwin object. not closed here if passed
# allows comparing multiple lines in same plot
# set env before passing WIN
DEV => '/xs', # open and close dev for plotting if no WIN
SIZE => 480, # plot size in pixels
COLOR => 1,
Usage:
# variance should be in descending order
$pca{var}->plot_scree( {ncomp=>16} );
Or, because NCOMP is used so often, it is allowed a shortcut,
$pca{var}->plot_scree( 16 );
SEE ALSO
PDL::Fit::Linfit
PDL::Fit::LM
REFERENCES
Cohen, J., Cohen, P., West, S.G., & Aiken, L.S. (2003). Applied Multiple Regression/correlation Analysis for the Behavioral Sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.
Hosmer, D.W., & Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York, NY: Wiley-Interscience.
Osgood C.E., Suci, G.J., & Tannenbaum, P.H. (1957). The Measurement of Meaning. Champaign, IL: University of Illinois Press.
Rutherford, A. (2001). Introducing Anova and Ancova: A GLM Approach (1st ed.). Thousand Oaks, CA: Sage Publications.
The GLM procedure: unbalanced ANOVA for two-way design with interaction. (2008). SAS/STAT(R) 9.2 User's Guide. Retrieved June 18, 2009 from http://support.sas.com/
AUTHOR
Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>
All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as described in the file COPYING in the PDL distribution.