NAME

Statistics::Gap - Perl extension for the "Gap Statistics"

SYNOPSIS

use Statistics::Gap;
&gap("GapPrefix", "InputFile", "squared", "agglo", 5, 3, "unif");

OR

use Statistics::Gap;
&gap("GapPrefix", "InputFile", "squared", "agglo", 5, 3, "prop");

DESCRIPTION

Given a dataset how does one automatically find the optimal number 
of clusters that the dataset should be grouped into? - is one of the 
prevailing problems. Statisticians Robert Tibshirani, Guenther Walther 
and Trevor Hastie  propose a solution for this problem in a Techinal 
Report named - "Estimating the number of clusters in a dataset via 
the Gap Statistics". This perl module implements the approach proposed 
in the above paper.

NOTE: 
Gap Statistics uses reference distribution in the process of estimating
the number of clusters. The appropriate methodology for generation of this 
reference distribution is dependent on the data to be clustered. 
This module was implemented for data with following characteristics:
1. highly sparse - very few features occur in any given observation.
2. high multivariate dimensionality (i.e. large feature space)
3. binary feature frequency - feature either occurs or does not occur 
   in an observation.

EXPORT

"gap" function by default.

INPUT

Prefix

The string that should be used to as a prefix while naming the 
intermediate files and the .png files (graph files).

InputFile

The input dataset is expected in a plain text file where the first
line in the file gives the dimensions of the dataset and then the 
dataset in a matrix format should follow. The contexts / observations 
should be along the rows and the features should be along the column.

DistanceMeasure

The Distance Measure that should be used.
Currrently this module supports the following distance measure:
1. Squared Euclidean (string that should be used as an argument: "squared")
2. Manhattan (string that should be used as an argument: "manhattan")
3. Euclidean (string that should be used as an argument: "euclidean")

ClusteringAlgorithm

The Clustering Measures that can be used are:
1. rb - Repeated Bisections [Default]
2. rbr - Repeated Bisections for by k-way refinement
3. direct - Direct k-way clustering
4. agglo  - Agglomerative clustering
5. graph  - Graph partitioning-based clustering
6. bagglo - Partitional biased Agglomerative clustering

K value

This is an approximate upper bound for the number of clusters that may be
present in the dataset. Thus for a dataset that you expect to be seperated
into 3 clusters this value should be set some integer value greater than 3.

B value

Specifies the number of time the reference distribution should be generated.

ReferenceGenerationMethod

1. Uniform - While generating the reference distribution, all the features
in the feature set have equal probability of being selected for the observation
under consideration.

2. Proportional - Each feature is assigned a probability of being selected
depending upon its frequency of occurrence in the observed data. Thus feature
distribution is taken into consideration while selecting the features for the 
reference distribution generation.

OUTPUT

The output returned is a single integer number which indicates the optimal
number of clusters that the input dataset should be clustered into.

PRE-REQUISITES

    1. This module uses suite of C programs called CLUTO for clustering purposes. 
    Thus CLUTO needs to be installed for this module to be functional.
    CLUTO can be downloaded from http://www-users.cs.umn.edu/~karypis/cluto/

    2. Following Perl Modules
        1. GD	(http://search.cpan.org/~lds/GD-2.19/GD.pm)
	2. GD::Text	(http://search.cpan.org/~mverb/GDTextUtil-0.86/Text.pm)
	3. GD::Graph::lines	(http://search.cpan.org/~mverb/GDGraph-1.43/)
  	4. GD::Graph::colour	(http://search.cpan.org/~mverb/GDGraph-1.43/Graph/colour.pm)

AUTHOR

    Anagha Kulkarni, University of Minnesota Duluth
    kulka020 <at> d.umn.edu
	
    Guergana Savova, Mayo Clinic
    savova.guergana <at> mayo.edu

COPYRIGHT AND LICENSE

Copyright (C) 2005-2006, Guergana Savova and Anagha Kulkarni

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.

To install Statistics::Gap, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Statistics::Gap

CPAN shell

perl -MCPAN -e shell
install Statistics::Gap

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)