The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

clusterstopping.pl Predict the optimal number of clusters for your data

SYNOPSIS

Predicts the optimal number of clusters for the given data.

USAGE

clusterstopping.pl [OPTIONS] INPUTFILE

This script tries to find the optimal number of clusters for the given INPUTFILE.

INPUT

Required Arguments:

INPUTFILE

Matrix file containing either: * context vectors (in dense or sparse format) * similarity values between contexts

Optional Arguments:

--prefix PRE

Specify a prefix to be used for the output filenames.

If prefix is not specified then prefix is created by concatenating time stamp to the string "expr".

--measure MSR

Specify the cluster stopping measure to be used.

(Further details about the measures given below under the Section "DETAILS ABOUT THE CLUSTER-STOPPING MEASURES")

The possible option values: pk1 - PK1 measure pk2 - PK2 measure pk3 - PK3 measure [Default] gap - Adapted Gap Statistic pk - All the three PK measures all - All the four measures - PK1, PK2, PK3 and Gap

--space SP

Specify if the clustering should be performed in vector or similarity space.

The possible option values: vector [Default] similarity

--delta INT

NOTE: Delta value can only be a positive integer value.

Specify 0 to stop the iterating clustering process when two consecutive crfun values are exactly equal. This is the default setting when the crfun values are integer/whole numbers.

Specify non-zero positive integer to stop the iterating clustering process when the diffference between two consecutive crfun values is less than or equal to this value. However, note that the integer value specified is internally shifted to capture the difference in the least significant digit of the crfun values when these crfun values are fractional. For example: For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001 For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005 The default delta value when the crfun values are fractional is 1.

However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta value is internally shifted only until the least significant digit in the scientific notation. For example: For crfun = 1.23e+04 & delta = 2 will be transformed to 200 For crfun = 2.45e+02 & delta = 5 will be transformed to 5 For crfun = 1.44e+03 & delta = 1 will be transformed to 10

--clmethod CL

Specifies the clustering method.

The possible option values: rb - Repeated Bisections [Default] rbr - Repeated Bisections for by k-way refinement direct - Direct k-way clustering agglo - Agglomerative clustering bagglo - Partitional biased Agglomerative clustering [Only in vector space]

For large amount of data, 'rb', 'rbr' or 'direct' are recommended.

--crfun CR

Selects the criteria function for Clustering. The meanings of these criteria functions are explained in Cluto's manual.

The possible option values: i1 - I1 Criterion function i2 - I2 Criterion function [Default] h1 - H1 Criterion function h2 - H2 Criterion function e1 - E1 Criterion function

--sim SIM

Specifies the similarity measure to be used

The possible option values: cos - Cosine [Default] corr - Correlation Coefficient

NOTE: This option can be used only in vector space.

--rowmodel RMOD

The option is used to specify the model to be used to scale every column of each row. (For further details please refer Cluto manual)

The possible option values: none - no scaling is performed [Default] maxtf - post scaling the values are between 0.5 and 1.0 sqrt - square-root of actual values log - log of actual values

--colmodel CMOD

The option is used to specify the model to be used to (globally) scale each column across all rows. (For further details please refer Cluto manual)

The possible option values: none - no scaling is performed [Default] idf - scaling according to inverse-document-frequency

--threspk1 NUM

The threshold value that should be used by the PK1 measure to predict the k value.

Default = -0.7

--precision NUM

Specifies the precision to be used.

Default: 4

--B NUM

The number of replicates/references to be generated.

Default: 1

--typeref TYP

Specifies whether to generate B replicates from a reference or to generate B references.

The possible option values: rep - replicates [Default] ref - references

--seed NUM

The seed to be used with the random number generator.

Default: No seed is set.

--percentage NUM

Specifies the percentage confidence to be reported in the log file. Since Statistics::Gap uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population ("true") mean and with what certainty.

Default: 90

Other Options :

--help

Displays the quick summary of program options.

--version

Displays the version information.

--verbose

Displays to STDERR the current program status.

OUTPUT

  • prefix.pk1

     Contains the crfun values, PK1 values and the predicted k.
  • prefix.pk2

     Contains the crfun values, PK2 values and the predicted k.
  • prefix.pk3

     Contains the crfun values, PK3 values and the predicted k.
  • prefix.gap

     Contains the crfun values, delta values and the predicted k.
  • prefix.gap.log Contains a table of values for Gap(k), Obs(crfun(k)), Exp(crfun(k)), sd(k), s(k) and Confidence interval. Also contains individual crfun values for each of the B replicates/references for every value of K.

The following files (*.dat) files are created to facilitate creation of plots if required.

  • prefix.cr.dat

     Contains the crfun values.
  • prefix.pk1.dat

     Contains the PK1 values.
  • prefix.pk2.dat

     Contains the PK2 values.
  • prefix.pk3.dat

     Contains the PK3 values.
  • prefix.exp.dat

     Contains the Exp(crfun(k)) values generated for Gap Statistics.
  • prefix.gap.dat

     Contains the gap(k) values.

DETAILS ABOUT THE CLUSTER-STOPPING MEASURES

Each of the four cluster-stopping measure PK1, PK2, PK3 and Gap try to predict the optimal number of clusters for the provided data (represented in the form of vectors or similarity values).

Following is a description about each of the measures and how they select a k value.

Formulation for PK1:

           crfun[m] - mean(crfun[1...deltaM])
 PK1[m] = -----------------------------------
                std(crfun[1...deltaM])

 where m = 2,...deltaM
 deltaM is a m value beyond which the crfun value does not change much.

k selection strategy: If m is the first value for which the PK1 score is greater than the threshold value then m-1 value is selected as the k value when using one of the maximization criterion functions like I2, H2 etc. While using minimization criterion functions like E1, if m the first value for which the PK1 score is less than the threshold value then m-1 is selected as the k value.

Formulation for PK2:

            crfun[m]
 PK2[m] = ------------
           crfun[m-1]

where m = 2,...deltaM

k selection strategy: One standard deviation interval for the set of PK2[m] values id calculated and the m value for which the PK2 score is outside this interval but is still closest to the interval (from outside) is selected as the k value.

Formulation for PK3:

                2 * crfun[m]
 PK3[m] = -------------------------
           crfun[m-1] + crfun[m+1]
 

where m = 2,...deltaM

k selection strategy: Similar to that of PK2. One standard deviation of the PK3[m] scores is computed and the m value for which the PK3 score is outside this interval but is still closest to the interval (from outside) is selected as the k value.

Adapted Gap Statistic:

Adapted Gap tries to predict the optimal number of clusters by comparing the criterion function values that one gets for the observed/given data with the criterion function values that one gets for a random data. It uses hypothesis testing in this process where the null hypothesis is that the optimal number of clusters for the given data is 1 and the aim is to refute this null hypothesis if the given data exhibits enough pattern.

The plot of crfun values for given data is compared with the plot of crfun values for random or reference data. The m value corresponding to the first point of maximum difference between the two plots is chosen as the predicted k when using the maximization crfuns. While for the minimization crfuns, the m value corresponding to the first point of minimum difference between the two plots is choosen as the predicted k value.

We provide two methods for the generation of reference data:

1. One can choose to generate a random dataset over the observed distribution by holding the row and the column marginals fixed and then generating B replicates from this random dataset using Monte Carlo sampling.

2. Or to generate B random datasets over the observed distribution by holding the row and the column marginals fixed.

Please refer to http://search.cpan.org/dist/Algorithm-RandomMatrixGeneration/ to learn more about generating random dataset over the observed distribution by holding the row and the column marginals fixed.

DETAILS ABOUT THE INPUT MATRIX FORMAT

The input matrix can be in either dense or sparse format. The cell values can be integer or real. Depending upon the value specified for the space parameter the header in the input file (first line) changes.

Example of input matrix in dense format when space=vector: The first line specifies the dimensions - #rows #cols From the second line the actual matrix follows.

 6 5
 1.3       2       0       0       3
 2.1       0       4       2.7     0
 1.3       2       0       0       3
 2.1       0       4       2.7     0
 1.3       2       0       0       3
 2.1       0       4       2.7     0

Example of input matrix in dense format when space=similarity: The matrix, when in similarity space, is square and symmetric. The first line specifies the dimensions - #rows/#cols From the second line the actual matrix follows.

 5
 1.0000   0.3179   0.5544   0.2541   0.4431   
 0.3179   1.0000   0.1386   0.4599   0.5413
 0.5544   0.1386   1.0000   0.5143   0.2186
 0.2541   0.4599   0.5143   1.0000   0.5148
 0.4431   0.5413   0.2186   0.5148   1.0000

Example of input matrix in sparse format when space=vector: The first line specifies the dimensions & number of non-zero elements - #rows #cols #nonzeroElem

From the second line the matrix contents follow. Only non-zero elements are specified. Thus the elements are specified as pairs of - #col elem. The row number is implied by the (line number-1).

 8 10 41
 1 3 4 2 8 2 10 1
 1 1 2 5 3 1 5 2 7 1 9 2
 1 3 4 2 8 2 10 1
 1 1 2 5 3 1 5 2 7 1 9 2
 1 3 4 2 8 2 10 1
 1 1 2 5 3 1 5 2 7 1 9 2
 2 4 3 1 4 2 5 5 7 1 9 1
 2 4 3 1 4 2 5 5 7 1 9 1

Example of input matrix in sparse format when space=similarity: The first line specifies the dimensions & number of non-zero elements - #rows/#cols #nonzeroElem The matrix format is same as explained above.

 5 15
 1 1.0000 3 0.5544 5 0.4431   
 2 1.0000 3 0.1386 4 0.4599 5 0.5413
 1 0.5544 2 0.1386 3 1.0000
 2 0.4599 4 1.0000
 1 0.4431 2 0.5413 5 1.0000

AUTHOR

Anagha Kulkarni, University of Minnesota, Duluth.

Ted Pedersen, University of Minnesota, Duluth.

COPYRIGHT

Copyright (c) 2006-2008,

 Ted Pedersen, University of Minnesota, Duluth.
 tpederse@d.umn.edu

 Anagha Kulkarni, University of Minnesota, Duluth.
 kulka020@d.umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.