NAME

PDL::Stats::GLM -- general linear modeling methods and logistic regression

DESCRIPTION

The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are NOT threadable, respectively. FUNCTIONS except ols_t support bad value. PDL::Slatec strongly recommended for most METHODS, and it is required for logistic.

P-values, where appropriate, are provided if PDL::GSL::CDF is installed.

SYNOPSIS

use PDL::LiteF;
use PDL::NiceSlice;
use PDL::Stats::GLM;

# do a multiple linear regression and plot the residuals

my $y  = random 10;

my $x1 = sequence 10;
my $x2 = $x1 ** 2;
my $iv = cat $x1, $x2;

my %m  = $y->ols( $iv );
print "$_\t$m{$_}\n" for (sort keys %m);

use PDL::Graphics::PGPLOT::Window;

my $win = pgwin( 'xs' );
$win->points( $y - $m{y_pred} );

FUNCTIONS

fill_m

Signature: (a(n); float+ [o]b(n))
perldl> p $data
[
 [  5 BAD   2 BAD]
 [  7   3   7 BAD]
]

perldl> p $data->fill_m
[
 [      5     3.5       2     3.5]
 [      7       3       7 5.66667]
] 

replaces bad values with sample mean. can be done inplace.

The output pdl badflag is cleared.

fill_rand

Signature: (a(n); [o]b(n))

Replaces bad values with random sample (with replacement) of good observations from the same variable. can be done inplace.

  perldl> p $data
  [
   [  5 BAD   2 BAD]
   [  7   3   7 BAD]
  ]
  
  perldl> p $data->fill_rand
  
  [
   [5 2 2 5]
   [7 3 7 7]
  ]

The output pdl badflag is cleared.

dev_m

Signature: (a(n); float+ [o]b(n))

replaces values with deviations from the mean. can be done inplace.

dev_m does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

stddz

Signature: (a(n); float+ [o]b(n))

standardize ie replace values with z_scores based on sample standard deviation from the mean. can be done inplace.

stddz does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

sse

Signature: (a(n); b(n); float+ [o]c())

sum of squared errors between actual and predicted values.

sse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

mse

Signature: (a(n); b(n); float+ [o]c())

mean of squared errors between actual and predicted values. ie variance around predicted value.

mse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

rmse

Signature: (a(n); b(n); float+ [o]c())

root mean squared error. stdv around predicted value.

rmse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

pred_logistic

Signature: (a(n,m); b(m); float+ [o]c(n))
# glue constant then apply coeff returned by the logistic method

$pred = $x->glue(1,ones($x->dim(0)))->pred_logistic( $m{b} );

calculates predicted prob value for logistic regression.

pred_logistic does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

d0

Signature: (a(n); float+ [o]c())
my $d0 = $y->d0();

Null deviance for logistic regression.

d0 does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

dm

Signature: (a(n); b(n); float+ [o]c())
my $dm = $y->dm( $y_pred );

# null deviance
my $d0 = $y->dm( ones($y->nelem) * $y->avg );

Model deviance for logistic regression.

dm does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

dvrs

Signature: (a(); b(); float+ [o]c())

Deviance residual for logistic regression.

dvrs does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

ols_t

Threaded version of ordinary least squares regression. The price of threading was losing significance tests for coefficients (but see r2_change). The fitting function was shamelessly copied then modified from PDL::Fit::Linfit. Uses PDL::Slatec when possible but otherwise uses PDL::MatrixOps. Intercept is LAST of coeff if CONST => 1.

ols_t does not handle bad values. consider fill_m or fill_rand if there are bad values. Default options (case insensitive):

CONST   => 1,

Usage:

   # DV, 2 person's ratings for top-10 box office movies
   # ascending sorted by box office numbers

   perldl> p $y = qsort ushort( ceil( random(10, 2)*5 ) )    
   [
    [1 1 2 4 4 4 4 5 5 5]
    [1 2 2 2 3 3 3 3 5 5]
   ]

   # model with 2 IVs, a linear and a quadratic trend component

   perldl> $x = cat sequence(10), sequence(10)**2

   # suppose our novice modeler thinks this creates 3 different models
   # for predicting movie ratings

   perldl> p $x = cat $x, $x * 2, $x * 3
   [
    [
     [ 0  1  2  3  4  5  6  7  8  9]
     [ 0  1  4  9 16 25 36 49 64 81]
    ]
    [
     [  0   2   4   6   8  10  12  14  16  18]
     [  0   2   8  18  32  50  72  98 128 162]
    ]
    [
     [  0   3   6   9  12  15  18  21  24  27]
     [  0   3  12  27  48  75 108 147 192 243]
    ]
   ]

   perldl> p $x->info
   PDL: Double D [10,2,3]

   # insert a dummy dim between IV and the dim (model) to be threaded

   perldl> %m = $y->ols_t( $x->dummy(2) )

   perldl> p "$_\t$m{$_}\n" for (sort keys %m)

   # 2 persons' ratings, eached fitted with 3 "different" models

   F
   [
    [ 38.314159  25.087209]
    [ 38.314159  25.087209]
    [ 38.314159  25.087209]
   ]

   # df is the same across dv and iv models

   F_df    [2 7]
   F_p
   [
    [0.00016967051 0.00064215074]
    [0.00016967051 0.00064215074]
    [0.00016967051 0.00064215074]
   ]
   
   R2
   [
    [ 0.9162963 0.87756762]
    [ 0.9162963 0.87756762]
    [ 0.9162963 0.87756762]
   ]

   b
   [  # linear      quadratic     constant
    [
     [  0.99015152 -0.056818182   0.66363636]    # person 1
     [  0.18939394  0.022727273          1.4]    # person 2
    ]
    [
     [  0.49507576 -0.028409091   0.66363636]
     [  0.09469697  0.011363636          1.4]
    ]
    [
     [  0.33005051 -0.018939394   0.66363636]
     [ 0.063131313 0.0075757576          1.4]
    ]
   ]

   # our novice modeler realizes at this point that
   # the 3 models only differ in the scaling of the IV coefficients 
   
   ss_model
   [
    [ 20.616667  13.075758]
    [ 20.616667  13.075758]
    [ 20.616667  13.075758]
   ]
   
   ss_residual
   [
    [ 1.8833333  1.8242424]
    [ 1.8833333  1.8242424]
    [ 1.8833333  1.8242424]
   ]
   
   ss_total        [22.5 14.9]
   y_pred
   [
    [
     [0.66363636  1.5969697  2.4166667  3.1227273  3.7151515  4.1939394  4.5590909  4.8106061  4.9484848  4.9727273]
   ...

r2_change

Significance test for the incremental change in R2 when new variable(s) are added to an ols regression model. Returns the change stats as well as stats for both models. Base on ols_t. (One way to make up for the lack of significance tests for coeffs in ols_t).

Default options (case insensitive):

CONST   => 1,

Usage:

 # suppose these are two persons' ratings for top 10 box office movies
 # ascending sorted by box office

 perldl> p $y = qsort ushort( ceil(random(10, 2) * 5) )
 [
  [1 1 2 2 2 3 4 4 4 4]
  [1 2 2 3 3 3 4 4 5 5]
 ]

 # first IV is a simple linear trend

 perldl> p $x1 = sequence 10
 [0 1 2 3 4 5 6 7 8 9]

 # the modeler wonders if adding a quadratic trend improves the fit

 perldl> p $x2 = sequence(10) ** 2
 [0 1 4 9 16 25 36 49 64 81]

 # two difference models are given in two pdls
 # each as would be pass on to ols_t
 # the 1st model includes only linear trend
 # the 2nd model includes linear and quadratic trends
 # when necessary use dummy dim so both models have the same ndims

 perldl> %c = $y->r2_change( $x1->dummy(1), cat($x1, $x2) )

 perldl> p "$_\t$c{$_}\n" for (sort keys %c)
   #              person 1   person 2
 F_change        [0.72164948 0.071283096]
   # df same for both persons
 F_df    [1 7]
 F_p     [0.42370145 0.79717232]
 R2_change       [0.0085966043 0.00048562549]
 model0  HASH(0x8c10828)
 model1  HASH(0x8c135c8)

 # the answer here is no.

METHODS

anova

Analysis of variance. Uses type III sum of squares for unbalanced data.

Usage:

   # suppose this is ratings for 12 apples

   perldl> p $y = qsort ushort( ceil( random(12)*5 ) )
   [1 1 2 2 2 3 3 4 4 4 5 5]
   
   # IV for types of apple

   perldl> p $a = sequence(12) % 3 + 1
   [1 2 3 1 2 3 1 2 3 1 2 3]

   # IV for whether we baked the apple
   
   perldl> @b = qw( y y y y y y n n n n n n )

   # 1st @ ref is ivs, 2nd @ ref is optional iv names

   perldl> %m = $d->anova( [$a, \@b], [apple, bake] )
   
   perldl> p "$_\t$m{$_}\n" for (sort keys %m)

   # apple # m
   [
    [2.75    3 3.25]
   ]
   
   # apple # se
   [
    [0.85391256 0.91287093 0.85391256]
   ]
   
   # apple ~ bake # m
   [
    [1.5 1.5   2]
    [  4 4.5 4.5]
   ]
   
   # apple ~ bake # se
   [
    [0.5 0.5   1]
    [  1 0.5 0.5]
   ]
   
   # bake # m
   [
    [ 1.6666667  4.3333333]
   ]
   
   # bake # se
   [
    [0.33333333 0.33333333]
   ]
   
   F       4.4
   F_df    [5 6]
   F_p     0.0496945790114411
   ms_model        4.4
   ms_residual     1
   ss_model        22
   ss_residual     6
   ss_total        28
   | apple | F     0.25
   | apple | F_df  [2 6]
   | apple | F_p   0.78652708238507
   | apple | ms    0.25
   | apple | ss    0.5
   | apple ~ bake | F      0.0833333333333335
   | apple ~ bake | F_df   [2 6]
   | apple ~ bake | F_p    0.921090557321383
   | apple ~ bake | ms     0.0833333333333335
   | apple ~ bake | ss     0.166666666666667
   | bake | F      21.3333333333333
   | bake | F_df   [1 6]
   | bake | F_p    0.00361989531398121
   | bake | ms     21.3333333333333
   | bake | ss     21.3333333333333

dummy_code

dummy coding of nominal variable (perl @ ref or 1d pdl) for use in regression.

perldl> @a = qw(a a a b b b c c c)
perldl> p $a = dummy_code(\@a)
[
 [1 1 1 0 0 0 0 0 0]
 [0 0 0 1 1 1 0 0 0]
]

effect_code

Unweighted effect coding of nominal variable (perl @ ref or 1d pdl) for use in regression. returns in @ context coded pdl and % ref to level - pdl->dim(1) index. note that the last level is not explicitly coded in pdl

my @var = qw( a a a b b b c c c );
my ($var_e, $map) = effect_code( \@var );

print $var_e . $var_e->info . "\n";

[
 [ 1  1  1  0  0  0 -1 -1 -1]
 [ 0  0  0  1  1  1 -1 -1 -1]
]    
PDL: Double D [9,2]

print "$_\t$map->{$_}\n" for (sort keys %$map)
a       0
b       1
c       2

effect_code_w

weighted effect code for nominal variable. returns in @ context coded pdl and % ref to level - pdl->dim(1) index. note that the last level is not explicitly coded in pdl

perldl> @a = qw( a a b b b c c )
perldl> p $a = effect_code_w(\@a)
[
 [   1    1    0    0    0   -1   -1]
 [   0    0    1    1    1 -1.5 -1.5]
]

ols

Ordinary least squares regression, aka linear regression. Unlike ols_t, ols returns the full model in list context with various stats, but is not threadable. $ivs should be pdl dims $y->nelem or $y->nelem x n_iv. Intercept is LAST of the coeffs if CONST=>1.

Default options (case insensitive):

CONST  => 1,

Usage:

   # suppose this is a person's ratings for top 10 box office movies
   # ascending sorted by box office

   perldl> p $y = qsort ushort( ceil(random(10) * 5) )
   [1 1 2 2 2 2 4 4 5 5]

   # construct IV with linear and quadratic component

   perldl> p $x = cat sequence(10), sequence(10)**2
   [
    [ 0  1  2  3  4  5  6  7  8  9]
    [ 0  1  4  9 16 25 36 49 64 81]
   ]

   perldl> %m = $y->ols( $x )

   perldl> p "$_\t$m{$_}\n" for (sort keys %m)

   F       40.4225352112676
   F_df    [2 7]
   F_p     0.000142834216344756
   R2      0.920314253647587

   # coeff  linear     quadratic  constant

   b       [0.21212121 0.03030303 0.98181818]
   b_p     [0.32800118 0.20303404 0.039910509]
   b_se    [0.20174693 0.021579989 0.38987581]
   b_t     [ 1.0514223   1.404219  2.5182844]
   ss_model        19.8787878787879
   ss_residual     1.72121212121212
   ss_total        21.6
   y_pred  [0.98181818  1.2242424  1.5272727  1.8909091  2.3151515 2.8  3.3454545  3.9515152  4.6181818  5.3454545]

logistic

Logistic regression with maximum likelihood estimation using PDL::Fit::LM (requires PDL::Slatec. Hence loaded with "require" in the sub instead of "use" at the beginning). Do not supply the constant vector in $x. It is included in the model and returned as last of coeff. Returns full model in list context and coeff in scalar context.

***NOTE: the results here are qualitatively similar to but not identical with results from R, because different algorithms are used for the nonlinear parameter fit. Use with discretion***

Default options (case insensitive):

INITP => zeroes( $x->dim(1) + 1 ),
MAXIT => 1000,
EPS   => 1e-7,

Usage:

   # suppose this is whether a person had rented 10 movies
   # ascending sorted by box office

   perldl> p $y = qsort( ushort( random(10)*2 ) )
   [0 0 0 0 0 0 1 1 1 1]
  
   # IV is box office ranking

   perldl> p $x = sequence(10)
   [0 1 2 3 4 5 6 7 8 9]

   perldl> %m = $y->logistic( $x )

   perldl> p "$_\t$m{$_}\n" for (sort keys %m)
   D0      13.4602333401851
   Dm      0
   Dm_chisq        13.4602333401851
   Dm_df   1
   Dm_p    0.00024367344985432
   #         ranking     constant
   b       [ 74.586318  -410.7816]
   b_chisq [ 13.460233  13.065797]
   b_p     [0.00024367345 0.00030073723]
   iter    154
   y_pred  [3.9794114e-179 9.8230271e-147 2.4247772e-114 5.9854711e-82 1.477491e-49 3.6471308e-17 1 1 1 1]

pca

Principal component analysis. $data is pdl dim obs x var. output loading (corr between var and axis) and score are pdls dim var x axis. value and var are pdls dim axis.

Based on corr instead of cov (bad values are ignored pair-wise. OK when bad values are few but otherwise probably should fill_m etc before pca). use PDL::Slatec::eigsys() if installed. otherwise use PDL::MatrixOps::eigens_sym(). added loadings and descending sorted axis by $value (ie variance accouted for).

Usage:

my $data   = random 100, 20;       # 100 obs on 20 var
my %result = $data->pca;
print "$_\t" . $result{$_} . "\n" for (keys %result);

# screeplot for % of var accounted for by each axis

use PDL::Graphics::PGPLOT::Window;

my $win = pgwin( Dev=>"/xs" );
$win->line( $result{var} );
$win->hold;
$win->points( $result{var}, {CHARSIZE=>2} );

TO DO

anova_repeated

SEE ALSO

PDL::Fit::Linfit

PDL::Fit::LM

REFERENCES

Cohen, J., Cohen, P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.

Hosmer, D.W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York, NY: Wiley-Interscience.

The GLM procedure: unbalanced ANOVA for two-way design with interaction. (2008). SAS/STAT(R) 9.2 User's Guide. Retrieved June 18, 2009 from http://support.sas.com/

AUTHOR

Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>

All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as described in the file COPYING in the PDL distribution.