The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Algorithm::KMeans - Clustering multi-dimensional data with a pure-Perl implementation

SYNOPSIS

use Algorithm::KMeans qw(kmeans visualize cluster_data_generator);

# Set the mask to tell system which columns of a datafile to use
# for clustering and which column contains a symbolic ID for each
# data record.  For example, if the ID is in the first column, if
# you want the second column to be ignored, and for 3D clustering
# from the rest:
my $mask = "I0111";

# If you want the module to figure out the optimum number of clusters
# from the data in the file supplied as $datafile:
kmeans( datafile => $datafile,
        mask     =>  $mask,
        terminal_output => 1,
        K => 0 );

# If you know how many clusters you want (in this case 3):
kmeans( datafile => $datafile,
        mask     =>  $mask,
        terminal_output => 1,
        K => 3 );

# To view the clusters formed:
visualize( datafile =>  $datafile,
           mask     =>  $mask );

# If you want to generate your own multivariate data for clustering,
# you can call
my $N = 60;              # If you want 60 data points per cluster
cluster_data_generator( input_parameter_file => $parameter_file,
                        output_datafile => $datafile,
                        number_data_points_per_cluster => $N );

DESCRIPTION

Algorithm::KMeans is a perl5 module for the clustering of numerical data in multidimensional spaces. Since the module is entirely in Perl (in the sense that it is not a Perl wrapper around a C library that actually does the clustering), the code in the module can easily be modified to experiment with several aspects of automatic clustering. For example, one can change the criterion used to measure the "distance" between two data points, the stopping condition for accepting final clusters, the criterion used for measuring the quality of the clustering achieved, etc.

A K-Means clusterer is a poor man's implementation of the EM algorithm. EM stands for Expectation Maximization. For the case of Gaussian data, the results obtained with a good K-Means implementation should match those obtained with the EM algorithm. Clustering with K-Means takes place iteratively and involves two steps: 1) assignment of data samples to clusters; and 2) Recalculation of the cluster centers. The assignment step can be shown to be akin to the Expectation step of the EM algorithm, and the calculation of the cluster centers akin to the Maximization step of the EM algorithm.

Of the two key steps of the K-Means algorithm, the assignment step consists of assigning each data point to that cluster from whose center the data point is the closest. That is, during assignment, you compute the distance between the data point and each of the current cluster centers. You assign the data sample on the basis of the minimum value of the computed distance. The second step consists of re-computing the cluster centers for the newly modified clusters.

Obviously, before the two-step approach can proceed, we need to initialize the both the cluster center values and the clusters that can then be iteratively modified by the two-step algorithm. How this initialization is carried out is very important. The implementation here uses a random number generator to find K random integers between 0 and N where N is the total number of data samples that need to be clustered and K the number of clusters you wish to form. The K random integers are used as indices for the data samples in the overall data array --- the data samples thus selected are treated as seed cluster centers. This obviously requires a prior knowledge of K.

How to specify K is one of the most vexing issues in any approach to clustering. In some case, we can set K on the basis of prior knowledge. But, more often than not, no such prior knowledge is available. When the programmer does not explicitly specify a value for K, the approach taken in the current implementation is to try all possible values between 2 and some largest possible value that makes statistical sense. We then choose that value for K which yields the best value for the QoC (Quality of Clustering) metric. It is generally believed that the largest value for K should not exceed sqrt(N/2) where N is the number of data point to be clustered.

How to set the QoC metric is obviously a critical issue unto itself. In the current implementation, the value of QoC is a ratio of the average radius of the clusters and the average distance between the cluster centers. But note that this is a good criterion only when the data exhibits the same variance in all directions. When the data variance is different directions, but still remains the same for all clusters, a more appropriate QoC can be formulated using other distance metrics such as the Mahalanobis distance.

Every iterative algorithm requires a stopping criterion. The criterion implemented here is that we stop iterations when there is no re-assignment of the data points during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a local minimum for the QoC values, as opposed to a global minimum. The current implementation protects against that, but only in a very small way, by trying different randomly selected initial cluster centers and then selecting the one that gives the best overall QoC value.

METHODS

The module provides the following methods for clustering, for cluster visualization, and for the generation of data for testing a clustering algorithm:

Algorithm::KMeans::kmeans( datafile => $data_file, mask => $mask, terminal_output => 1, K => 0 );

where the keyword argument "K=>0" tells the module that you want it to figure out the optimum number of clusters to form. The datafile keyword names the file that contains the data that needs to be clustered. The data file is expected to contain entries in the following format

c20  0  10.7087017086940  9.63528386251712  10.9512155258108
c7   0  12.8025925026787  10.6126270065785  10.5228482095349
b9   0  7.60118206283120  5.05889245193079  5.82841781759102
....
....

where the first column contains the symbolic ID tag for each data record and the rest of the columns the numerical information. As to which columns are actually used for clustering is decided by the string value of the mask. For example, if we wanted to cluster on the basis of the entries in just the last three columns, the mask value would be `I0111' where the character `I' indicates that the ID tag is in the first column, the character '0' that the second column is to be ignored, and '1's that the last three columns are to be used for clustering.

The parameter `terminal_value' is boolean and determines what you will see on the terminal screen of the window in which you make these method calls. If you set it to 1, you will see different clusters as lists of the symbolic IDs and you will also see the cluster centers, along with the QoC (Quality of Clustering) value for the clusters displayed. If this parameter is set to 0, you will see only minimal information. In either case, the clusters will be written out to files named Cluster0.dat, Cluster1.dat, Cluster2.dat, etc. Before the clusters are written to these files, the module destroys all files with such names in the directory in which you call the module.

Algorithm::KMeans::kmeans( datafile => $data_file, mask => $mask, terminal_output => 1, K => $K );

for a non-zero integer value for the keyword `K'. This has a profound effect of the behavior of the module, as it will now form exactly $K clusters.

Algorithm::KMeans::visualize( datafile => $datafile, mask => $mask );

for visualizing the clusters formed. The datafile is the same as used in the call to kmeans(). This datafile is used by the visualization script to extract the numerical coordinates associated with the symbolic ID tags in the cluster files. The mask here does not have to be identical to the one used for clustering, but must be a subset of that mask. This is convenient for visualizing the clusters formed in two- or three-dimensional subspaces of the original space, assuming that the clustering was carried out in a space whose dimensionality is greater than 3.

Algorithm::KMeans::cluster_data_generator( input_parameter_file => $parameter_file_name, output_datafile => $datafile, number_data_points_per_cluster => $N );

for generating multivariate data for clustering if you wish to play with synthetic data for clustering. The input parameter file contains the means and the variances for the different Gaussians you wish to use for the synthetic data. See the file param.txt provided in the examples directory. It will be easiest for you to just edit this file for your data generation needs. In addition to the format of the parameter file, the main constraint you need to observe in specifying the parameters is that the dimensionality of the covariance matrix must correspond to the dimensionality of the mean vectors. The multivariate random numbers are generated by calling the Math::Random module. As you would expect, this module requires that the covariance matrices you specify in your parameter file be symmetric and positive definite. Should the covariances in your parameter file not obey this condition, the Math::Random module will let you know.

HOW ARE THE CLUSTERS OUTPUT?

The module dumps the clusters into files named

Cluster0.dat
Cluster1.dat
Cluster2.dat
...
...

in the directory in which you execute the module. The number of such files will equal the number of clusters formed. On each run of the kmeans() function, all such existing files in the directory are destroyed before fresh ones created. Each cluster file consists of the symbolic ID tags of the data points in that cluster.

It would be trivial to modify the module so that a call to kmeans() returns a list of hashes, each hash representing a different cluster, and with each hash holding the symbolic ID tags for the keys and their data point coordinates for the values. If there is demand for such a feature, it will be added to the next version of this module.

REQUIRED

This module requires the following two modules:

Math::Random
Graphics::GnuplotIF

the former for generating the multivariate random numbers and the latter for the visualization of the clusters.

EXAMPLES

See the examples directory in the distribution for how to make calls to the clustering and the visualization routines. The examples directory also includes a parameter file, param.txt, for generating synthetic data for clustering. Just edit this file if you would like to generate your own multivariate data for clustering.

CAVEATS

Please note that this clustering module is not meant for very large datafiles. Being an all-Perl implementation, the goal here is not the speed of execution. On the contrary, the goal is to make it easy to experiment with the different facets of K-Means clustering. If you need to process a large data file, you'd be better off with a module like Algorithm::Cluster. But note that when you use a wrapper module in which it is a C library that is actually doing the job of clustering for you, it is more daunting to experiment with various aspects of clustering.

Clustering usually does not work well when the data is highly anisotropic, that is, when the data has very different variances along its different dimensions. This problem becomes particularly severe when the different clusters you expect to see in the data have non-uniform anisotropies. When the anisotropies are uniform, one could improve on the implementation shown in the current module by first normalizing the data with respect to the different variances along the different dimensions. One could also try to cluster the data in a low-dimensional space formed by a principal components analysis of the data, especially after the variances are normalized out. Depending on how the current module is received, its future versions may include those enhancements.

BUGS

Please send me email if you find any bugs. They will be dealt with promptly.

INSTALLATION

The usual

perl Makefile.PL
make
make test
make install

if you have root access. If not,

perl Makefile.PL prefix=/some/other/directory/
make
make test
make install

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "KMeans" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Copyright 2010 Avinash Kak