NAME

Statistics::FactorAnalysis - A Perl implementation of Factor Analysis using the Principal Component Method.

VERSION

This document describes Statistics::FactorAnalysis version 0.0.2

SYNOPSIS

  use Statistics::FactorAnalysis;

  # Data is entered as a reference to a LoL. In this case each nested LIST corresponds to a separate variable - thus 'format' option is set to 'variable'. 
  my $data = [
                  [qw/ 1038 369  622  1731 1109 1274 517  2043 1106 201  665  593  1117 563  2448 2201 1036 2715 700  593  394  1097 212  /],
                  [qw/ 1348 1483 749  1658 401  952  1039 1488 791  1344 488  591  744  1472 1076 1475 784  1170 384  450  1035 938  1179 /],
                  [qw/ 4472 4388 2174 3527 5587 3454 2560 6247 2238 2778 4399 1750 4738 2918 6680 3141 3872 6634 2017 3458 1922 3374 2768 /],
                  [qw/ 2627 3407 2299 3094 2721 2705 2814 2804 2155 2500 2503 2701 3058 2914 2940 2596 2723 2710 3022 2557 2652 2920 2687 /],
                  [qw/ 6466 3596 153  3335 1921 3255 437  4486 2769 755  91   155  480  1954 5697 5327 1263 9577 52   268  68   2797 122  /],
                  [qw/ 2366 3984 300  837  1304 1909 3800 1994 2135 2089 5148 1956 1513 2160 1943 1918 2036 4800 1100 816  937  1327 918  /],
                  [qw/ 6862 5746 4220 5739 5646 4848 7089 5160 5514 6083 5187 4491 5154 6029 5870 4923 5287 5901 4055 4765 6213 3894 4694 /],
          ];

  # Create Statistics::FactorAnalysis object with checking of variable distributions.
  my $fac = Statistics::FactorAnalysis->new(dist_check => 1);

  # Set compulsory format option - can be set in constructor as with any Moose attribute.
  $fac->format('variable');

  # Set compulsory LoL option. Points to reference of LoL of the data.
  $fac->LoL($data); 

  # Load the data. 
  $fac->load_data;

  # Loading complained so log transform data.
  use Math::Cephes qw(:explog);
  for my $row (@{$data}) { for my $col (@{$row}) { $col = log10($col); }}

  # Re-load data.
  $fac->load_data;

  # We'll perform PCA to have a look at the PC variances so perform PCA analysis.
  $fac->pca;

  # Have a look at the variances.
  $fac->pca_print_variance;

  # If the first 2 PCs explain more than 75% of the variance we use 2 factors else we use 3.
  my @cumulative = $fac->return_pca_cumulative_variances;
  my $factors = $cumulative[1] > 0.75 ? 2 : 3; 

  # Set our choice of factor number.
  $fac->factors($factors);

  # We will compute the rotated matrix.
  $fac->rotate(1);

  # Perform the factor analysis.
  $fac->fac();

  # Have a look at the results - if you want to access data directly use the return methods (see DIRECT DATA ACCESS/RETURN METHODS).
  $fac->fac_print_summary;

  # Have a look at the results with the rotated loadings - can only call this method if the 'rotate' => 1.
  $fac->fac_print_rotated_summary;

  # create a reference containing a LoL of the loadings.
  $fac->return_loadings;

DESCRIPTION

Factor analysis is a statistical method by which the variability of a large set of observed variables is described in terms of a smaller set of unobserved variables termed factors. Factor analysis uses the premise that data observed from such a large number of variables are in some way a function of these factors that cannot be measured directly. The observed variables are modeled as linear combinations of the factors. Factor analysis is related to principal component analysis (PCA). However, unlike PCA that takes into account all variability in the variables, factor analysis estimates how much of the variability is due to common factors ("communality"). See http://en.wikipedia.org/wiki/Factor_analysis.

METHODS

new

Object constructor. May pass arguments upon object construction - see OBJECT CONSTRUCTOR OPTIONS.

my $pca = Statistics::FactorAnalysis->new(dist_check => 1);

load_data

Used to load the data into the object. Requires you to set 'LoL' and 'format' options (can set these during object creation if you wish). LoL is a reference to a LoL containing the data. While, 'format' option specifies the nature of the LoL. If your data is in the format of a table (i.e. each nested reference corresponds to an observation) use 'table'. Thus in this case of 7 variables with 23 observations (of random data) we load as:

my $data = [
        # Variables: 1,   2,   3,   4,   5,   6,   7,
                [qw/ 1038 1348 4472 2627 6466 2366 6862 /],     # obs1
                [qw/ 369  1483 4388 3407 3596 3984 5746 /],     # obs2
                [qw/ 622  749  2174 2299 153  300  4220 /],     # obs3
                [qw/ 1731 1658 3527 3094 3335 837  5739 /],     # ...
                [qw/ 1109 401  5587 2721 1921 1304 5646 /],
                [qw/ 1274 952  3454 2705 3255 1909 4848 /],
                [qw/ 517  1039 2560 2814 437  3800 7089 /],
                [qw/ 2043 1488 6247 2804 4486 1994 5160 /],
                [qw/ 1106 791  2238 2155 2769 2135 5514 /],
                [qw/ 201  1344 2778 2500 755  2089 6083 /],
                [qw/ 665  488  4399 2503 91   5148 5187 /],
                [qw/ 593  591  1750 2701 155  1956 4491 /],
                [qw/ 1117 744  4738 3058 480  1513 5154 /],
                [qw/ 563  1472 2918 2914 1954 2160 6029 /],
                [qw/ 2448 1076 6680 2940 5697 1943 5870 /],
                [qw/ 2201 1475 3141 2596 5327 1918 4923 /],
                [qw/ 1036 784  3872 2723 1263 2036 5287 /],
                [qw/ 2715 1170 6634 2710 9577 4800 5901 /],
                [qw/ 700  384  2017 3022 52   1100 4055 /],
                [qw/ 593  450  3458 2557 268  816  4765 /],
                [qw/ 394  1035 1922 2652 68   937  6213 /],
                [qw/ 1097 938  3374 2920 2797 1327 3894 /],
                [qw/ 212  1179 2768 2687 122  918  4694 /],     # obs23
        ];

$fac->format(q{table});
$fac->LoL($data);
$fac->load_data;

For the same sample of 7 variables with 23 observations if each nested LIST corresponds to a reference as below we use the 'variable' argument to the format option:

my $data = [   # obs 1,   2,   3,   ...,                                                                                           23
                [qw/ 1038 369  622  1731 1109 1274 517  2043 1106 201  665  593  1117 563  2448 2201 1036 2715 700  593  394  1097 212  /],
                [qw/ 1348 1483 749  1658 401  952  1039 1488 791  1344 488  591  744  1472 1076 1475 784  1170 384  450  1035 938  1179 /],
                [qw/ 4472 4388 2174 3527 5587 3454 2560 6247 2238 2778 4399 1750 4738 2918 6680 3141 3872 6634 2017 3458 1922 3374 2768 /],
                [qw/ 2627 3407 2299 3094 2721 2705 2814 2804 2155 2500 2503 2701 3058 2914 2940 2596 2723 2710 3022 2557 2652 2920 2687 /],
                [qw/ 6466 3596 153  3335 1921 3255 437  4486 2769 755  91   155  480  1954 5697 5327 1263 9577 52   268  68   2797 122  /],
                [qw/ 2366 3984 300  837  1304 1909 3800 1994 2135 2089 5148 1956 1513 2160 1943 1918 2036 4800 1100 816  937  1327 918  /],
                [qw/ 6862 5746 4220 5739 5646 4848 7089 5160 5514 6083 5187 4491 5154 6029 5870 4923 5287 5901 4055 4765 6213 3894 4694 /],
        ];

$fac->format(q{variable});
$fac->LoL($data);
$fac->load_data;

PRINCIPAL COMPONENT ANALYSIS METHODS

This module performs PCA using the Statistics::PCA module. However, it introduced some additional options to give added flexibility e.g. standardise, divisor - see OPTIONS. Performing PCA analysis may be useful for making initial decisions about factor number to use.

pca

Performs optional PCA analysis.

pca_print_variance

Alias for original Statistics::PCA print_variance method. Prints a table of PC standard deviations, proportion of variance and cumulative variance to STDOUT.

pca_print_eigenvectors

Alias for original Statistics::PCA print_eigenvectors method. Prints a table of the individual eigenvectors to STDOUT .

pca_print_transform

Alias for original Statistics::PCA print_transform method. Prints a table of the PCA transformed data to STDOUT.

pca_summary

Alias for original Statistics::PCA results method. Prints summary of PCA analysis results to STDOUT.

FACTOR ANALYSIS METHODS

fac

Estimates parameters for factor model using the Principal Component Method.

fac_print_loadings

Prints a table to STDOUT of the loadings generated by fac method.

fac_print_rotated_loadings

Prints a table to STDOUT of the rotated loadings generated by fac method with rotation option set to '1'.

fac_print_communalities

Prints a table to STDOUT of the communalities generated by fac method.

fac_print_variance_explained

Prints a table to STDOUT of the variances explained by the individual factors generated by fac method.

fac_print_summary

Prints a table to STDOUT summarising all data generated by fac method.

fac_print_rotated_summary

Prints a table to STDOUT summarising all data generated by fac method from rotated loadings.

DIRECT DATA ACCESS/RETURN METHODS

return_variable_number

Description:    Returns total number of variables.
Usage:          my $var_num = $fac->return_variable_number;
Return type:    Number.

return_variable_measurements

Description:    Returns the total number of observations. 
Usage:          my $obs_num = $fac->return_variable_measurements;
Return type:    Number.

return_total_variance

Description:    Returns sum of variances of analysed data.
Usage:          my $variance = $fac->return_total_variance;
Return type:    Number.

return_total_communality

Description:    Returns the sum of the communalities of the analysed data.
Usage:          my $communality = $fac->return_total_communality;
Return type:    Number.

return_total_percentage_explained_by_factors

Description:    Returns the total percentage of variance explained by the factors.
Usage:          my $percentage = $fac->return_total_percentage_explained_by_factors;
Return type:    Number.

return_variances

Description:    Returns the variances of the analysed variables.
Usage:          my @variances - $fac->return_variances_explained_by_factors;
Return type:    LIST.

return_communalities

Description:    Returns the individual communalities for the variables.
Usage:          my @communalities = $fac->return_communalities;
Return type:    LIST.

return_variances_explained_by_factors

Description:    Returns the variance explained each of the factors for the loadings generated by the PC method.
Usage:          my @variances_explained = $fac->return_variances_explained_by_rotated_factors;
Return type:    LIST.

return_variances_explained_by_rotated_factors

Description:    Returns the variance explained each of the factors for the rotated loadings generated by Varimax rotation of the original loadings.
Usage:          my @@variances_explained = $fac->return_variances_explained_by_rotated_factors;
Return type:    LIST.

return_percentages_explained_by_factors

Description:    Returns the percentage of variance explained by the factors for each of observed variables.
Usage:          my @percentage = $fac->return_percentages_explained_by_factors;
Return type:    LIST.

return_pca_cumulative_variances

Description:    Returns the cumulative variances for each successive Principal Component generated by a PCA analysis.
Usage:          my @cumulative_variance = $fac->return_pca_cumulative_variances;
Return type:    LIST.

return_orthogonal_matrix

Description:    Returns a LoL of the orthogonal matrix generated by Varimax rotation.
Usage:          for ($pca->return_orthogonal_matrix) { print @{$_}, qq{\n} }
Return type:    LoL.

return_loadings

Description:    Returns a LoL of the loadings generated by PC method factor analysis - each nested array contains the loadings for a single factor.
Usage:          for ($pca->return_loadings) { print @{$_}, qq{\n} }
Return type:    LoL.

return_rotated_loadings

Description:    Returns a LoL of the rotated loadings generated by Varimax rotation - each nested array contains the rotated loadings for a single factor.
Usage:          for ($pca->return_rotated_loadings) { print @{$_}, qq{\n} }
Return type:    LoL.

OPTIONS

COMPULSORY DATA INPUT OPTIONS

format

Purpose:        Defines format of LoL being passed to object. If the nested arrays contain the data of the different variables or of the different observations 
                use 'variable', or 'table' respectively. See METHODS.
Values:         'table', 'variable'. 
Default value: 

LoL

Purpose:        Used for passing the data to the object. Accepts a reference to a LoL containing the data. 
Values:         Reference to LoL. 
Default value:  

OPTIONAL DATA CHECKS

dist_check

Purpose:        Tells object whether to perform checks on the skewness and kurtosis of the data of the variables during the load_data method call. It prints 
                warnings to STDOUT if any variable deviates beyond acceptable cutoffs. 
Values:         '1', '0'. 
Default value:  '0'.

dist_croak

Purpose:        Causes Statistics::FactorAnalysis to croak on load_data method calls instead of print to STDOUT if variables deviate beyond acceptable cutoffs
Values:         '1', '0'. 
Default value:  '0'.

skewness

Purpose:        Sets the cutoff value for skewness. 
Values:         Numeric.
Default value:  0.8.

kurtosis

Purpose:        Sets the cutoff value for kurtosis.
Values:         Numeric.
Default value:  3.

OPTIONAL DATA ANALYSIS OPTIONS.

standardise

Purpose:        Used to tell the object whether to standardise the variables prior to subjecting them to the principal component method such that all have mean 
                zero and variance equal to one. 
Values:         'Y', 'N'. 
Default value:  'Y'.

factors

Purpose:        Sets the number of factors to be used for the factor model. 
Values:         Numeric. 
Default value:  3.

rotate

Purpose:        Tells object whether to perform Varimax rotation of the PC generated loadings using the Statistics::PCA::Varimax module.
Values:         'Y', 'N'. 
Default value:  'N'.

divisor

Purpose:        Used to set the divisor for covariant matrix generation. To use N pass '0'. To use N-1 pass '-1'.
Values:         '0', '-1'.
Default value:  '0'.

eigen_method

Purpose:        Used to define which module will be used to perform the eigen decomposition. To use Math::Cephes pass 'C'. For Math::MatrixReal pass 'M'. For the 
                gsl C library procedure implemented by Math::GSL::Linalg::SVD pass 'G'.
Values:         'M', 'C', 'G'. 
Default value:  'M'.

TABLE PRINTING METHOD OPTIONS

cutoff

Purpose:        Turns on cutoffs for printing loading values - if the loading value is below the cutoff value cutoff_null will be printed instead.
Values:         '0', '-1'. 
Default value:  '0'.

cutoff_value

Purpose:        Sets the cutoff value for printing loadings.
Values:         Numeric. 
Default value:  0.1

cutoff_null

Purpose:        Sets the string to print in place of the loading if the loading is below the cutoff value.
Values:         String.
Default value:  ''.

DEPENDENCIES

'Carp' => '1.08', 'Moose' => '0.93', 'MooseX::NonMoose' => '0.07', 'Statistics::PCA' => '0.0.1', 'Statistics::PCA::Varimax' => '0.0.2', 'Math::GSL::Linalg::SVD' => '0.0.2', 'List::Util' => '1.22',

AUTHOR

Daniel S. T. Hughes <dsth@cantab.net>

SEE ALSO

Statistics::PCA, Statistics::PCA::Varimax,Math::GSL::Linalg::SVD.

BUGS

This software is in early stage of development. I´m sure there will be bugs.

LICENCE AND COPYRIGHT

Copyright (c) 2009, Daniel S. T. Hughes <dsth@cantab.net>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

because this software is licensed free of charge, there is no warranty for the software, to the extent permitted by applicable law. Except when otherwise stated in writing the copyright holders and/or other parties provide the software "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of the software is with you. Should the software prove defective, you assume the cost of all necessary servicing, repair, or correction.

In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who may modify and/or redistribute the software as permitted by the above licence, be liable to you for damages, including any general, special, incidental, or consequential damages arising out of the use or inability to use the software (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of the software to operate with any other software), even if such holder or other party has been advised of the possibility of such damages.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1251:

Non-ASCII character seen before =encoding in 'I´m'. Assuming UTF-8