NAME

Statistics::PCA - A simple Perl implementation of Principal Component Analysis.

VERSION

This document describes Statistics::PCA version 0.0.1

SYNOPSIS

  use Statistics::PCA;

  # Create new Statistics::PCA object.
  my $pca = Statistics::PCA->new;

  #                  Var1    Var2    Var3    Var4...
  my @Obs1 = (qw/    32      26      51      12    /);
  my @Obs2 = (qw/    17      13      34      35    /);
  my @Obs3 = (qw/    10      94      83      45    /);
  my @Obs4 = (qw/    3       72      72      67    /);
  my @Obs5 = (qw/    10      63      35      34    /);

  # Load data. Data is loaded as a LIST-of-LISTS (LoL) pointed to by a named argument 'data'. Requires argument for format (see METHODS).
  $pca->load_data ( { format => 'table', data => [ \@Obs1, \@Obs2, \@Obs3, \@Obs4, \@Obs5 ], } ) ;

  # Perform the PCA analysis. Takes optional argument 'eigen' (see METHODS). 
  #$pca->pca( { eigen => 'C' } );
  $pca->pca();

  # Access results. The return value of this method is context-dependent (see METHODS). To print a report to STDOUT call in VOID-context.
  $pca->results();

DESCRIPTION

Principal component analysis (PCA) transforms higher-dimensional data consisting of a number of possibly correlated variables into a smaller number of uncorrelated variables termed principal components (PCs). The higher the ranking of the PCs the greater the amount of variability that the PC accounts for. This PCA procedure involves the calculation of the eigenvalue decomposition using either the Math::Cephes::Matrix or Math::MatrixReal modules (see METHODS) from a data covariance matrix after mean centering the data. See http://en.wikipedia.org/wiki/Principal_component_analysis for more details.

METHODS

new

Create a new Statistics::PCA object.

my $pca = Statistics::PCA->new;

load_data

Used for loading data into object. Data is fed as a reference to a LoL within an anonymous hash using the named argument 'data'. Data may be entered in one of two forms specified by the obligatory named argument 'format'. Data may either be entered in standard 'table' fashion (with rows corresponding to observations and columns corresponding to variables). Thus to enter the following table of data:

        Var1    Var2    Var3    Var4

Obs1    32      26      51      12  
Obs2    17      13      34      35        
Obs3    10      94      83      45        
Obs4    3       72      72      67        
Obs5    10      63      35      34 ...

The data is passed as an LoL with the with each nested ARRAY reference corresponding to a row of observations in the data table and the 'format' argument value 'table' as follows:

#                       Var1    Var2    Var3    Var4 ...
my $data  =   [   
                [qw/    32      26      51      12    /],     # Obs1
                [qw/    17      13      34      35    /],     # Obs2
                [qw/    10      94      83      45    /],     # Obs3
                [qw/    3       72      72      67    /],     # Obs4
                [qw/    10      63      35      34    /],     # Obs5 ...
            ];

$pca->load_data ( { format => 'table', data => $data, } );

Alternatively you may enter the data in a variable-centric fashion where each nested ARRAY reference corresponds to a single variable within the data (i.e. the transpose of the above table-fashion). To pass the above data in this fashion use the 'format' argument with value 'variable' as follows:

#                           Obs1    Obs2    Obs3    Obs4    Obs5 ...
my $transpose = [
                    [qw/    32      17      10      3       10    /],   # Var1
                    [qw/    26      13      94      72      63    /],   # Var2
                    [qw/    51      34      83      72      35    /],   # Var3
                    [qw/    12      35      45      67      34    /],   # Var4 ...
                ];

$pca->load_data ( { format => 'variable', data => $transpose, } ) ;

pca

To perform the PCA analysis. This method takes the optional named argument 'eigen' that takes the values 'M' or 'C' to calculate the eigenvalue decomposition using either the Math::MatrixReal or Math::Cephes::Matrix modules respectively (defaults to 'M' without argument).

$pca->pca();   
$pca->pca( { eigen => 'M' } );
$pca->pca( { eigen => 'C' } );

results

Used to access the results of the PCA analysis. This method is context-dependent and will return a variety of different values depending on whether it is called in VOID or LIST context and the arguments its passed. In VOID-context it prints a formated table of the computed results to STDOUT.

$pca->results;

In LIST context this method takes an obligatory argument that determines its return values. To return an ordered list (ordered by PC ranking) of the proportions of total variance of each PC pass 'proportion' to the method.

my @list = $pca->results('proportion');
print qq{\nOrdered list of individual proportions of variance: @list};

To return an ordered list of the cumulative variance of the PCs pass argument 'cumulative'.

@list = $pca->results('cumulative');
print qq{\nOrdered list of cumulative variance of the PCs: @list};

To return an ordered list of the individual standard deviations of the PCs pass argument 'stdev'.

@list = $pca->results('stdev');
print qq{\nOrdered list of individual standard deviations of the PCs: @list};

To return an ordered list of the individual eigenvalues of the PCs pass argument 'eigenvalue'.

@list = $pca->results('eigenvalue');
print qq{\nOrdered list of individual eigenvalues of the PCs: @list};

To return an ordered list of ARRAY references containing the eigenvectors of the PCs pass argument 'eigenvector'.

# Returns an ordered list of array references containing the eigenvectors for the components
@list = $pca->results('eigenvector');
use Data::Dumper;
print Dumper \@list;

To return an ordered list of ARRAY references containing more detailed information about each PC use the 'full' argument. Each nested ARRAY reference consists of an ordered list of: PC rank, PC stdev, PC proportion of variance, PC cumulative_variance, PC eigenvalue and a further nested ARRAY reference containing the PC eigenvector.

@list = $pca->results('full');
for my $i (@list) {
    print qq{\nPC rank: $i->[0]}
          . qq{\nPC stdev $i->[1]}
          . qq{\nPC proportion of variance $i->[2]}
          . qq{\nPC cumulative variance $i->[3]}
          . qq{\nPC eigenvalue $i->[4]}
    }

To return an ordered LoL of the transformed data for each of the PCs pass 'transformed' to the method.

@list = $pca->results('transformed');
print qq{\nThe transformed data for 'the' principal component (first PC): @{$list[0]} };

DEPENDENCIES

'version' => '0', 'Carp' => '1.08', 'Math::Cephes::Matrix' => '0.47', 'Math::Cephes' => '0.47', 'List::Util' => '1.19', 'Math::MatrixReal' => '2.05', 'Text::SimpleTable' => '2.0', 'Contextual::Return' => '0.2.1',

AUTHOR

Daniel S. T. Hughes <dsth@cpan.org>

LICENCE AND COPYRIGHT

Copyright (c) 2009, Daniel S. T. Hughes <dsth@cantab.net>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

Because this software is licensed free of charge, there is no warranty for the software, to the extent permitted by applicable law. Except when otherwise stated in writing the copyright holders and/or other parties provide the software "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the software is with you. Should the software prove defective, you assume the cost of all necessary servicing, repair, or correction.

In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who may modify and/or redistribute the software as permitted by the above licence, be liable to you for damages, including any general, special, incidental, or consequential damages arising out of the use or inability to use the software (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the software to operate with any other software), even if such holder or other party has been advised of the possibility of such damages.