NAME
Cluster::Similarity - compute the similarity of two classifications.
VERSION
Version 0.02
SYNOPSIS
Compute similarity of two classifications following various cluster similarity evaluation schemes based on contingency tables.
use Cluster::Similarity;
my $sim_calculator = Cluster::Similarity->new( $classification_1, $classification_2 );
my $pair_wise_recall = $sim_calculator->pair_wise_recall();
my $pair_wise_precision = $sim_calculator->pair_wise_precision();
my $pair_wise_f_score = $sim_calculator->pair_wise_fscore();
my $mutual_information = $sim_calculator->mutual_information();
my $rand_index = $sim_calculator->rand_index();
my $rand_adj = $sim_calculator->rand_adjusted($max_index);
my $matching = $sim_calculator->matching_index();
my $contingency_table = $sim_calculator->contingency();
my $pairs_matrix = $sim_calculator->pairs_matrix();
my $pair_of_cell_12 = $sim_calculator->pairs(1,2);
DESCRIPTION
Computes the similarity of two word clusterings using several clustering similarity measures.
Consider for eg. the following groupings:
clustering_1: { {a, b, c}, {d, e, f} } clustering_2: { {a, b}, {c, d, e}, {f} }
Cluster similarity measures provide a numerical value helping to assess the alikeness of two such groupings.
All cluster similarity measures implemented in this module are based on the so-called contingency table of the two classifications (clusterings). The contingency table is a matrix with a cell for each pair of classes (one from each classification), containing the number of objects present in both classes.
The similarity measures (and also examples and tests) are taken from Chapter 4 of Susanne Schulte im Walde's Phd thesis:
Sabine Schulte im Walde. Experiments on the Automatic Induction of German Semantic Verb Classes. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2003. Published as AIMS Report 9(2) http://www.schulteimwalde.de/phd-thesis.html
Please see there for a more in depth description of the similarity measures and further details.
INTERFACE
Constructor
FUNCTIONS
Providing the Data
- load_data(\@classification_1, \@classification_2)
- load_data(\%classification_1, \%classification_2)
- set_classification_1(\@classification_1), set_classification1(\@classification_2)
- set_classification_2(\%classification_1), set_classification1(\%classification_2)
When calling these methods, the contingency tables and all previously computed similarity values are reset.
objects, object_number
Return (number of) objects in either classification
contingency
Compute the contingency table for two classifications. The contingency table is a matrix with a cell for each pair of classes (one class from each classification). Each cell contains the number of objects present in both classes.
Eg. For the classifications
-
{ {a, b, c}, {d, e, f} }
-
{ {a, b}, {c, d, e}, {f} }
the returned contingency table is:
{
'c_1' => {
'c_1' => 2,
'c_2' => 0
},
'c_2' => {
'c_1' => 1,
'c_2' => 2
},
'c_3' => {
'c_1' => 0,
'c_2' => 1
}
}
Which is a hash representation of the matrix:
2 0
1 2
0 1
with the columns indexed by the classes of the first classification and the rows by the classes of the second classification.
pairs_contingency
Compute the contingency table for the number of common element pairs in the two classifications.
For the example above this would be:
1 0
0 0
0 1
true_positives
True positives are the number of object pairs which occur together in both classifications.
pairs_classification_1, pairs_classification_2
Number of pairs in classification.
pair_wise_precision, pair_wise_recall, pair_wise_fscore
Pair-wise recall is the number of true positives divided by the number of pairs in classification 1
Pair-wise precision is the number of true positives divided by the number of pairs in classification 2
Pair-wise F-score is the harmonic mean of precision and recall, i.e. 2*precision*recall / (precision + recall)
mutual_information
Mutual information is a symmetric measure for the degree of dependency between two classifications used here as introduced by Strehl et. al. (2000).
rand_index
The Rand index (defined by Rand, 1971) is based on the agreement vs. disagreement between object pairs in clusterings.
rand_adjusted
Rand index adjusted by chance (Hubert and Arabie 1985). The adopted model for randomness assumes that the two classifications are picked at random, given the original number of classes and objects - the contingency table is constructed from the hyper-geometric distribution. The general form of an index corrected for chance is:
Index_adj = (Index - Expected Index) / (Maximum Index - Expected Index)
As maximum index I use the minimum of possible pairs in either classifications.
matching_index
Matching index (Fowlkes and Mallows, 1983).
DIAGNOSTICS
<Need reference to classification
>-
When a "Providing the data" method is called without enough arguments.
<Classifications must be passed as array or hash references
>-
Argument of wrong type.
<Please set/load classifications before calling ... method
>-
Method was called without providing classification data first, by calling one of the ""Providing the data" methods.
<Need data for classification 1/2
>-
Data for classification 1 (2 resp.) is missing.
CONFIGURATION AND ENVIRONMENT
Cluster::Similarity requires no configuration files or environment variables.
DEPENDENCIES
INCOMPATIBILITIES
None reported.
BUGS AND LIMITATIONS
No bugs have been reported.
Please report any bugs or feature requests to bug-cluster-similarity@rt.cpan.org
, or through the web interface at http://rt.cpan.org.
TO DO
find more suitable return values for when a given similarity measure is not applicable.
for the Rand adjusted measure make the maximum index configurable.
AUTHOR
Ingrid Falk, <ingrid dot falk at loria dot fr>
BUGS
Please report any bugs or feature requests to bug-cluster-similarity at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Cluster-Similarity. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Cluster::Similarity
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
SEE ALSO
For the description of the implemented clustering similarity measures:
Sabine Schulte im Walde. Experiments on the Automatic Induction of German Semantic Verb Classes. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2003. Published as AIMS Report 9(2), http://www.schulteimwalde.de/phd-thesis.html
For building clusterings or classifications:
- Algorithm::Cluster
-
a Perl interface to the C Clustering Library.
- Text::SenseClusters
-
Clusters similar contexts using co-occurrence matrices and Latent Semantic Analysis.
COPYRIGHT & LICENSE
Copyright 2008 Ingrid Falk, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 81:
Non-ASCII character seen before =encoding in 'für'. Assuming UTF-8