NAME
PDL::Stats::Kmeans -- classic k-means cluster analysis
DESCRIPTION
Assumes that we have data pdl dim [observation, variable] and the goal is to put observations into clusters based on their values on the variables. The terms "observation" and "variable" are quite arbitrary but serve as a reminder for "that which is being clustered" and "that which is used to cluster".
The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are non-threadable, respectively.
SYNOPSIS
use PDL::LiteF;
use PDL::NiceSlice;
use PDL::Stats;
my ($data, $idv, $ido) = get_data( $file );
my ($cluster, $centroid, $ss_this, $ss_last, $ss_centroid);
# start out with 8 random clusters
$cluster = random_cluster( $data->dim(0), 8 );
($centroid, $ss_centroid) = $data->centroid( $cluster );
$ss_this = $ss_centroid->sum;
# iterate to minimize total ss
# stop when change in ss is less than $crit amount
do {
$ss_last = $ss_this;
$cluster = $data->assign( $centroid );
($centroid, $ss_centroid) = $data->centroid( $cluster );
$ss_this = $ss_centroid->sum;
}
while ( $ss_last - $ss_this > $crit );
or just do
my %result = $data->kmeans( \%opt );
print "$_\t$result{$_}\n" for (sort keys %result);
plot the clusters if there are only 2 vars
use PDL::Graphics::PGPLOT::Window;
my ($win, $c);
$win = pgwin(Dev=>'/xs');
$win->env($data( ,0)->minmax, $data( ,1)->minmax);
$win->points( $data->dice_axis(0,which($m{cluster}->(,$_)))->dog,
{COLOR=>++$c} )
for (0..$m{cluster}->dim(1)-1);
FUNCTIONS
random_cluster
Signature: (byte [o]cluster(o,c); int obs=>o; int clu=>c)
creates masks for random mutually exclusive clusters. accepts two parameters, num_obs and num_cluster. extra parameter turns into extra dim in mask. may loop a long time if num_cluster approaches num_obs because empty cluster is not allowed.
my $cluster = random_cluster( $num_obs, $num_cluster );
assign
Signature: (data(o,v); centroid(c,v); byte [o]cluster(o,c))
takes data pdl dim [obs x var] and centroid pdl dim [cluster x var] and returns mask dim [obs x cluster] to cluster membership. an obs is assigned to the first cluster with the smallest distance (ie sum squared error) to cluster centroid. with bad value, obs is assigned by smallest mean squared error across variables.
perldl> $centroid = ones 2, 3
perldl> $centroid(0,) .= 0
perldl> p $centroid
[
[0 1]
[0 1]
[0 1]
]
perldl> $b = qsort( random 4, 3 )
perldl> p $b
[
[0.022774068 0.032513883 0.13890034 0.30942479]
[ 0.16943853 0.50262636 0.56251531 0.7152271]
[ 0.23964483 0.59932745 0.60967495 0.78452117]
]
# notice that 1st 3 obs in $b are on average closer to 0
# and last obs closer to 1
perldl> p $b->assign( $centroid )
[
[1 1 1 0] # cluster 0 membership
[0 0 0 1] # cluster 1 membership
]
assign does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
centroid
Signature: (data(o,v); cluster(o,c); float+ [o]m(c,v); float+ [o]ss(c,v))
takes data dim [obs x var] and mask dim [obs x cluster] and returns average value and ss (ms when data contains bad values) dim [cluster x var] for data where mask = 1. multiple cluster membership for an obs is fine but quits if run into empty cluster ie no qualified element in a mask.
# data is 10 obs x 3 var
perldl> p $d = sequence 10, 3
[
[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
]
# create two clusters by value on 1st var
perldl> p $a = $d( ,(0)) <= 5
[1 1 1 1 1 1 0 0 0 0]
perldl> p $b = $d( ,(0)) > 5
[0 0 0 0 0 0 1 1 1 1]
perldl> p $c = cat $a, $b
[
[1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 0 0 1 1 1 1]
]
perldl> p $d->centroid($c)
# mean for 2 cluster x 3 var
[
[ 2.5 7.5]
[12.5 17.5]
[22.5 27.5]
]
# ss for 2 cluster x 3 var
[
[17.5 5]
[17.5 5]
[17.5 5]
]
centroid does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.
kmeans
Implements classic kmeans cluster analysis. Tries several rounds of random-seeding and clustering, returns the best results in terms of R2. Stops when change in R2 is smaller than set criterion.
Alternatively, if a centroid is provided, clustering will proceed from the centroid and there is no random-seeding or multiple tries.
kmeans supports bad value*.
Default options (case insensitive):
V => 1, # prints simple status
FULL => 0, # returns results for all seeding rounds
CNTRD => PDL->null, # clu x var. optional. disables next 3 opts
NTRY => 5, # num of seeding rounds
NSEED => 1000, # num of starting seeds, use n obs up to NSEED
NCLUS => 8, # num of clusters
R2CRT => .001, # stop criterion for R2 change
Usage:
# suppose we have 4 person's ratings on 5 movies
perldl> p $rating = ushort( ceil( random(4, 5) * 5 ) )
[
[3 2 2 3]
[2 4 5 4]
[5 3 2 3]
[3 3 1 5]
[4 3 3 2]
]
# we want to put the 4 persons into 2 groups
perldl> %k = $rating->kmeans( {NCLUS=>2} )
# by default prints back options used
# as well as info for all tries and iterations
CNTRD => Null
FULL => 0
NCLUS => 2
NSEED => 1000
NTRY => 5
R2CRT => 0.001
V => 1
ss/ms total: 20.5
iter 0 R2 [0.46341463 0.46341463 0.46341463 0.46341463 0.024390244]
iter 1 R2 [0.46341463 0.46341463 0.46341463 0.46341463 0.46341463]
perldl> p "$_\t$k{$_}\n" for (sort keys %k)
R2 0.463414634146341
centroid # mean ratings for 2 group x 5 movies
[
[ 2 3]
[4.5 3]
[2.5 4]
[ 2 4]
[ 3 3]
]
cluster # 4 persons' membership in two groups
[
[0 1 1 0]
[1 0 0 1]
]
n [2 2] # cluster size
ss
[
[ 0 0]
[0.5 2]
[0.5 2]
[ 2 2]
[ 0 2]
]
Now, for the valiant, kmeans is threadable. Say you gathered 10 persons' ratings on 5 movies from 2 countries, so the data is dim [10,5,2], and you want to put the 10 persons from each country into 3 clusters, just specify NCLUS => [3,1], and there you have it. The key is for NCLUS to include $data->ndims - 1 numbers. The 1 in [3,1] turns into a dummy dim, so the 3-cluster operation is repeated on both countries. Similarly, when seeding, CNTRD needs to have dims that match the data dims. See stats_kmeans.t for examples w 4D data.
*With bad value, R2 is based on average of variances instead of sum squared error. What's minimized is the average variance across clusters as compared to the original variance with all obs in one cluster. R2 in this case does not have the usual meaning of proportion of variance accounted for, but it does serve the purpose of minimizing variance.
METHODS
iv_cluster
Turns an independent variable into a cluster pdl. Returns cluster pdl and level-to-pdl_index mapping in list context and cluster pdl only in scalar context.
This is the method used for mean and var in anova. The difference between iv_cluster and dummy_code is that iv_cluster returns pdl dim (obs x level) whereas dummy_code returns pdl dim (obs x (level - 1)).
Usage:
perldl> @bake = qw( y y y n n n )
# accepts @ ref or 1d pdl
perldl> p $bake = iv_cluster( \@bake )
[
[1 1 1 0 0 0]
[0 0 0 1 1 1]
]
perldl> p $rating = sequence 6
[0 1 2 3 4 5]
perldl> p $rating->centroid( $bake )
# mean for each iv level
[
[1 4]
]
# ss
[
[2 2]
]
REFERENCES
Romesburg, H.C. (1984). Cluster Analysis for Researchers. NC: Lulu Press.
Wikipedia (retrieved June, 2009). K-means clustering. http://en.wikipedia.org/wiki/K-means_algorithm
AUTHOR
Copyright (C) 2009 Maggie J. Xiong <maggiexyz users.sourceforge.net>
All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as described in the file COPYING in the PDL distribution.