NAME

PDL::Stats::Kmeans -- classic k-means cluster analysis

DESCRIPTION

Assumes that we have data pdl dim [observation, variable] and the goal is to put observations into clusters based on their values on the variables. The terms "observation" and "variable" are quite arbitrary but serve as a reminder for "that which is being clustered" and "that which is used to cluster".

The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are non-threadable, respectively.

SYNOPSIS

use PDL::LiteF;
use PDL::NiceSlice;
use PDL::Stats;

my ($data, $idv, $ido) = get_data( $file );

my ($cluster, $centroid, $ss_this, $ss_last, $ss_centroid);

  # start out with 8 random clusters
$cluster = random_cluster( $data->dim(0), 8 );

($centroid, $ss_centroid) = $data->centroid( $cluster );
$ss_this = $ss_centroid->sum;

  # iterate to minimize total ss
  # stop when change in ss is less than $crit amount 
do {
  $ss_last = $ss_this;
  $cluster = $data->assign( $centroid );

  ($centroid, $ss_centroid) = $data->centroid( $cluster );
  $ss_this = $ss_centroid->sum;
}
while ( $ss_last - $ss_this > $crit );

or just do

my %result = $data->kmeans( \%opt );
print "$_\t$result{$_}\n" for (sort keys %result);

plot the clusters if there are only 2 vars

use PDL::Graphics::PGPLOT::Window;

my ($win, $c);
$win = pgwin(Dev=>'/xs');
$win->env($data( ,0)->minmax, $data( ,1)->minmax);

$win->points( $data->dice_axis(0,which($m{cluster}->(,$_)))->dog,
              {COLOR=>++$c} )
  for (0..$m{cluster}->dim(1)-1);

FUNCTIONS

random_cluster

Signature: (byte [o]cluster(o,c); int obs=>o; int clu=>c)

Creates masks for random mutually exclusive clusters. Accepts two parameters, num_obs and num_cluster. Extra parameter turns into extra dim in mask. May loop a long time if num_cluster approaches num_obs because empty cluster is not allowed.

my $cluster = random_cluster( $num_obs, $num_cluster );

assign

Signature: (data(o,v); centroid(c,v); byte [o]cluster(o,c))

Takes data pdl dim [obs x var] and centroid pdl dim [cluster x var] and returns mask dim [obs x cluster] to cluster membership. An obs is assigned to the first cluster with the smallest distance (ie sum squared error) to cluster centroid. With bad value, obs is assigned by smallest mean squared error across variables.

  perldl> $centroid = ones 2, 3
  perldl> $centroid(0,) .= 0
  perldl> p $centroid
  [
   [0 1]
   [0 1]
   [0 1]
  ]

  perldl> $b = qsort( random 4, 3 )
  perldl> p $b
  [
   [0.022774068 0.032513883  0.13890034  0.30942479]
   [ 0.16943853  0.50262636  0.56251531   0.7152271]
   [ 0.23964483  0.59932745  0.60967495  0.78452117]
  ]
    # notice that 1st 3 obs in $b are on average closer to 0 
    # and last obs closer to 1
  perldl> p $b->assign( $centroid )
  [
   [1 1 1 0]    # cluster 0 membership
   [0 0 0 1]    # cluster 1 membership
  ]

assign does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

centroid

Signature: (data(o,v); cluster(o,c); float+ [o]m(c,v); float+ [o]ss(c,v))

Takes data dim [obs x var] and mask dim [obs x cluster], returns mean and ss (ms when data contains bad values) dim [cluster x var], using data where mask = 1. Multiple cluster membership for an obs is okay. If a cluster is empty all means and ss are set to zero for that cluster.

     # data is 10 obs x 3 var
   perldl> p $d = sequence 10, 3
   [
    [ 0  1  2  3  4  5  6  7  8  9]
    [10 11 12 13 14 15 16 17 18 19]
    [20 21 22 23 24 25 26 27 28 29]
   ]
     # create two clusters by value on 1st var
   perldl> p $a = $d( ,(0)) <= 5
   [1 1 1 1 1 1 0 0 0 0]

   perldl> p $b = $d( ,(0)) > 5
   [0 0 0 0 0 0 1 1 1 1]

   perldl> p $c = cat $a, $b
   [
    [1 1 1 1 1 1 0 0 0 0]
    [0 0 0 0 0 0 1 1 1 1]
   ]

   perldl> p $d->centroid($c)
     # mean for 2 cluster x 3 var
   [
    [ 2.5  7.5]
    [12.5 17.5]
    [22.5 27.5]
   ]
     # ss for 2 cluster x 3 var
   [
    [17.5    5]
    [17.5    5]
    [17.5    5]
   ]

 

centroid does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

kmeans

Implements classic kmeans cluster analysis. Tries several rounds of random-seeding and clustering, returns the best results in terms of R2. Stops when change in R2 is smaller than set criterion.

Alternatively, if a centroid is provided, clustering will proceed from the centroid and there is no random-seeding or multiple tries.

kmeans supports bad value*.

Default options (case insensitive):

V       => 1,          # prints simple status
FULL    => 0,          # returns results for all seeding rounds

CNTRD   => PDL->null,  # clu x var. optional. disables next 3 opts

NTRY    => 5,          # num of seeding rounds
NSEED   => 1000,       # num of starting seeds, use n obs up to NSEED
NCLUS   => 8,          # num of clusters

R2CRT   => .001,       # stop criterion for R2 change

Usage:

# suppose we have 4 person's ratings on 5 movies

perldl> p $rating = ceil( random(4, 5) * 5 ) 
[
 [3 2 2 3]
 [2 4 5 4]
 [5 3 2 3]
 [3 3 1 5]
 [4 3 3 2]
]

# we want to put the 4 persons into 2 groups

perldl> %k = $rating->kmeans( {NCLUS=>2} )

# by default prints back options used
# as well as info for all tries and iterations

CNTRD   => Null
FULL    => 0
NCLUS   => 2
NSEED   => 1000
NTRY    => 5
R2CRT   => 0.001
V       => 1
ss/ms total:    20.5
iter 0 R2 [0.46341463 0.46341463 0.46341463 0.46341463 0.024390244]
iter 1 R2 [0.46341463 0.46341463 0.46341463 0.46341463 0.46341463]

perldl> p "$_\t$k{$_}\n" for (sort keys %k)

R2      0.463414634146341
centroid    # mean ratings for 2 group x 5 movies
[
 [  2   3]
 [4.5   3]
 [2.5   4]
 [  2   4]
 [  3   3]
]

cluster    # 4 persons' membership in two groups
[
 [0 1 1 0]
 [1 0 0 1]
]

n       [2 2]  # cluster size
ss
[
 [  0   0]
 [0.5   2]
 [0.5   2]
 [  2   2]
 [  0   2]
]

Now, for the valiant, kmeans is threadable. Say you gathered 10 persons' ratings on 5 movies from 2 countries, so the data is dim [10,5,2], and you want to put the 10 persons from each country into 3 clusters, just specify NCLUS => [3,1], and there you have it. The key is for NCLUS to include $data->ndims - 1 numbers. The 1 in [3,1] turns into a dummy dim, so the 3-cluster operation is repeated on both countries. Similarly, when seeding, CNTRD needs to have dims that match the data dims. See stats_kmeans.t for examples w 4D data.

*With bad value, R2 is based on average of variances instead of sum squared error. What's minimized is the average variance across clusters as compared to the original variance with all obs in one cluster. R2 in this case does not have the usual meaning of proportion of variance accounted for, but it does serve the purpose of minimizing variance.

METHODS

iv_cluster

Turns an independent variable into a cluster pdl. Returns cluster pdl and level-to-pdl_index mapping in list context and cluster pdl only in scalar context.

This is the method used for mean and var in anova. The difference between iv_cluster and dummy_code is that iv_cluster returns pdl dim (obs x level) whereas dummy_code returns pdl dim (obs x (level - 1)).

Usage:

 perldl> @bake = qw( y y y n n n )

 # accepts @ ref or 1d pdl

 perldl> p $bake = iv_cluster( \@bake )
 [
  [1 1 1 0 0 0]
  [0 0 0 1 1 1]
 ]
 
 perldl> p $rating = sequence 6
 [0 1 2 3 4 5]

 perldl> p $rating->centroid( $bake )
 # mean for each iv level
 [
  [1 4]
 ]
 # ss
 [
  [2 2]
 ]

REFERENCES

Romesburg, H.C. (1984). Cluster Analysis for Researchers. NC: Lulu Press.

Wikipedia (retrieved June, 2009). K-means clustering. http://en.wikipedia.org/wiki/K-means_algorithm

AUTHOR

Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>

All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as described in the file COPYING in the PDL distribution.