NAME

PDL::Stats::Kmeans -- classic k-means cluster analysis

DESCRIPTION

Assumes that we have data pdl dim [observation, variable] and the goal is to put observations into clusters based on their values on the variables. The terms "observation" and "variable" are quite arbitrary but serve as a reminder for "that which is being clustered" and "that which is used to cluster".

The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are non-threadable, respectively.

SYNOPSIS

use PDL::LiteF;
use PDL::NiceSlice;
use PDL::Stats;

my ($data, $idv, $ido) = get_data( $file );

my ($cluster, $centroid, $ss_this, $ss_last, $ss_centroid);

  # start out with 8 random clusters
$cluster = random_cluster( $data->dim(0), 8 );

($centroid, $ss_centroid) = $data->centroid( $cluster );
$ss_this = $ss_centroid->sum;

  # iterate to minimize total ss
  # stop when change in ss is less than $crit amount 
do {
  $ss_last = $ss_this;
  $cluster = $data->assign( $centroid );

  ($centroid, $ss_centroid) = $data->centroid( $cluster );
  $ss_this = $ss_centroid->sum;
}
while ( $ss_last - $ss_this > $crit );

or just do

my %result = $data->kmeans( \%opt );
print "$_\t$result{$_}\n" for (sort keys %result);

plot the clusters if there are only 2 vars

use PDL::Graphics::PGPLOT::Window;

my ($win, $c);
$win = pgwin(Dev=>'/xs');
$win->env($data( ,0)->minmax, $data( ,1)->minmax);

$win->points( $data->dice_axis(0,which($m{cluster}->(,$_)))->dog,
              {COLOR=>++$c} )
  for (0..$m{cluster}->dim(1)-1);

FUNCTIONS

random_cluster

Signature: (byte [o]cluster(o,c); int obs=>o; int clu=>c)

Creates masks for random mutually exclusive clusters. Accepts two parameters, num_obs and num_cluster. Extra parameter turns into extra dim in mask. May loop a long time if num_cluster approaches num_obs because empty cluster is not allowed.

my $cluster = random_cluster( $num_obs, $num_cluster );

assign

Signature: (data(o,v); centroid(c,v); byte [o]cluster(o,c))

Takes data pdl dim [obs x var] and centroid pdl dim [cluster x var] and returns mask dim [obs x cluster] to cluster membership. An obs is assigned to the first cluster with the smallest distance (ie sum squared error) to cluster centroid. With bad value, obs is assigned by smallest mean squared error across variables.

  perldl> $centroid = ones 2, 3
  perldl> $centroid(0,) .= 0
  perldl> p $centroid
  [
   [0 1]
   [0 1]
   [0 1]
  ]

  perldl> $b = qsort( random 4, 3 )
  perldl> p $b
  [
   [0.022774068 0.032513883  0.13890034  0.30942479]
   [ 0.16943853  0.50262636  0.56251531   0.7152271]
   [ 0.23964483  0.59932745  0.60967495  0.78452117]
  ]
    # notice that 1st 3 obs in $b are on average closer to 0 
    # and last obs closer to 1
  perldl> p $b->assign( $centroid )
  [
   [1 1 1 0]    # cluster 0 membership
   [0 0 0 1]    # cluster 1 membership
  ]

assign does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

centroid

Signature: (data(o,v); cluster(o,c); float+ [o]m(c,v); float+ [o]ss(c,v))

Takes data dim [obs x var] and mask dim [obs x cluster], returns mean and ss (ms when data contains bad values) dim [cluster x var], using data where mask = 1. Multiple cluster membership for an obs is okay. If a cluster is empty all means and ss are set to zero for that cluster.

     # data is 10 obs x 3 var
   perldl> p $d = sequence 10, 3
   [
    [ 0  1  2  3  4  5  6  7  8  9]
    [10 11 12 13 14 15 16 17 18 19]
    [20 21 22 23 24 25 26 27 28 29]
   ]
     # create two clusters by value on 1st var
   perldl> p $a = $d( ,(0)) <= 5
   [1 1 1 1 1 1 0 0 0 0]

   perldl> p $b = $d( ,(0)) > 5
   [0 0 0 0 0 0 1 1 1 1]

   perldl> p $c = cat $a, $b
   [
    [1 1 1 1 1 1 0 0 0 0]
    [0 0 0 0 0 0 1 1 1 1]
   ]

   perldl> p $d->centroid($c)
     # mean for 2 cluster x 3 var
   [
    [ 2.5  7.5]
    [12.5 17.5]
    [22.5 27.5]
   ]
     # ss for 2 cluster x 3 var
   [
    [17.5    5]
    [17.5    5]
    [17.5    5]
   ]

 

centroid does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

kmeans

Implements classic kmeans cluster analysis. Tries several rounds of random-seeding and clustering, returns the best results in terms of R2. Stops when change in R2 is smaller than set criterion.

Alternatively, if a centroid is provided, clustering will proceed from the centroid and there is no random-seeding or multiple tries.

kmeans supports bad value*.

Default options (case insensitive):

V       => 1,          # prints simple status
FULL    => 0,          # returns results for all seeding rounds

CNTRD   => PDL->null,  # clu x var. optional. disables next 3 opts

NTRY    => 5,          # num of seeding rounds
NSEED   => 1000,       # num of initial seeds, use NSEED up to max obs
NCLUS   => 8,          # num of clusters

R2CRT   => .001,       # stop criterion for R2 change

Usage:

   # suppose we have 4 person's ratings on 5 movies

   perldl> p $rating = ceil( random(4, 5) * 5 ) 
   [
    [3 2 2 3]
    [2 4 5 4]
    [5 3 2 3]
    [3 3 1 5]
    [4 3 3 2]
   ]

   # we want to put the 4 persons into 2 groups

   perldl> %k = $rating->kmeans( {NCLUS=>2} )

   # by default prints back options used
   # as well as info for all tries and iterations

   CNTRD	=> Null
   FULL	=> 0
   NCLUS	=> 2
   NSEED	=> 4
   NTRY	=> 5
   R2CRT	=> 0.001
   V	        => 1
   ss total:	20.5
   iter 0 R2 [0.024390244 0.024390244 0.26829268  0.4796748  0.4796748]
   iter 1 R2 [0.46341463 0.46341463  0.4796748  0.4796748  0.4796748]

   perldl> p "$_\t$k{$_}\n" for (sort keys %k)

   R2      0.479674796747968
   centroid       # mean ratings for 2 group x 5 movies
   [
    [         3  2.3333333]
    [         2  4.3333333]
    [         5  2.6666667]
    [         3          3]
    [         4  2.6666667]
   ]

   cluster        # 4 persons' membership in two groups
   [
    [1 0 0 0]
    [0 1 1 1]
   ]
   
   n       [1 3]  # cluster size
   ss
   [
    [         0 0.66666667]
    [         0 0.66666667]
    [         0 0.66666667]
    [         0          8]
    [         0 0.66666667]
   ]

Now, for the valiant, kmeans is threadable. Say you gathered 10 persons' ratings on 5 movies from 2 countries, so the data is dim [10,5,2], and you want to put the 10 persons from each country into 3 clusters, just specify NCLUS => [3,1], and there you have it. The key is for NCLUS to include $data->ndims - 1 numbers. The 1 in [3,1] turns into a dummy dim, so the 3-cluster operation is repeated on both countries. Similarly, when seeding, CNTRD needs to have ndims that at least match the data ndims. Extra dims in CNTRD will lead to threading (convenient if you want to try out different centroid locations, for example, but you will have to hand pick the best result). See stats_kmeans.t for examples w 3D and 4D data.

*With bad value, R2 is based on average of variances instead of sum squared error. What's minimized is the average variance across clusters as compared to the original variance with all obs in one cluster. R2 in this case does not have the usual meaning of proportion of variance accounted for, but it does serve the purpose of minimizing variance. **With LOTS bad values, ie VERY sparse data, R2 may bounce around instead of monotonously decreasing. May be good idea to fill_m etc before kmeans instead.

METHODS

iv_cluster

Turns an independent variable into a cluster pdl. Returns cluster pdl and level-to-pdl_index mapping in list context and cluster pdl only in scalar context.

This is the method used for mean and var in anova. The difference between iv_cluster and dummy_code is that iv_cluster returns pdl dim (obs x level) whereas dummy_code returns pdl dim (obs x (level - 1)).

Usage:

 perldl> @bake = qw( y y y n n n )

 # accepts @ ref or 1d pdl

 perldl> p $bake = iv_cluster( \@bake )
 [
  [1 1 1 0 0 0]
  [0 0 0 1 1 1]
 ]
 
 perldl> p $rating = sequence 6
 [0 1 2 3 4 5]

 perldl> p $rating->centroid( $bake )
 # mean for each iv level
 [
  [1 4]
 ]
 # ss
 [
  [2 2]
 ]

pca_cluster

Assgin variables to components ie clusters based on pca loadings or scores. One way to seed kmeans (see Ding & He, 2004, and Su & Dy, 2004 for other ways of using pca with kmeans). Variables are assigned to their most associated component. Note that some components may not have any variable that is most associated with them, so the returned number of clusters may be smaller than NCOMP.

Default options (case insensitive):

V     => 1,
ABS   => 1,   # high pos and neg loadings on a comp in same cluster
NCOMP => 10,  # max number of components to consider

Usage:

  # say we need to cluster a group of documents
($data, $idd, $idw) = get_data 'doc_word_info.txt';

perldl> %p = $data->pca;
perldl> $cluster  = $p{loading}->pca_cluster;

  # pca clusters var while kmeans clusters obs. hence transpose
perldl> ($m, $ss) = $data->transpose->centroid( $cluster );
perldl> %k = $data->transpose->kmeans( { cntrd=>$m } );

  # take a look at cluster 0 doc ids
perldl> p join("\n", @$idd[ list which $k{cluster}->( ,0) ]);

REFERENCES

Ding, C., & He, X. (2004). K-means clustering via principal component analysis. Proceedings of the 21st International Conference on Machine Learning, 69, 29.

Su, T., & Dy, J. (2004). A deterministic method for initializing K-means clustering. 16th IEEE International Conference on Tools with Artificial Intelligence, 784-786.

Romesburg, H.C. (1984). Cluster Analysis for Researchers. NC: Lulu Press.

Wikipedia (retrieved June, 2009). K-means clustering. http://en.wikipedia.org/wiki/K-means_algorithm

AUTHOR

Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>

All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as described in the file COPYING in the PDL distribution.