Name
Text::SenseClusters::LabelEvaluation::Driver - Module for evaluation of labels of the clusters.
SYNOPSIS
The following code snippet will evaluate the labels by comparing
them with text data for a gold-standard key from Wikipedia.
In order to test this module, please copy 'TestData' folder in current directory
or adjust directory location while mentioning the label and GoldKeys files.
# Including the LabelEvaluation Module.
use Text::SenseClusters::LabelEvaluation::Driver;
my $labelFileName = 'TestData/TVS/TVS.label';
my $topicFileName = 'TestData/TVS/TVSTopic.txt';
# Calling the LabelEvaluation modules by passing the following options
%inputOptions = (
senseClusterLabelFileName => $labelFileName,
labelComparisonMethod => 'automate',
goldKeyFileName => $topicFileName,
goldKeyDataSource => 'wikipedia',
weightRatio => 10,
isClean => 1,
);
# Calling the LabelEvaluation modules by passing the name of the
# label and topic files.
my $driverObject = Text::SenseClusters::LabelEvaluation::Driver->
new (\%inputOptions);
if($driverObject->{"errorCode"}){
print "Please correct the error before proceeding.\n\n";
exit();
}
my $accuracyScore = $driverObject->evaluateLabels();
# Printing the score.
print "\n\nScore of label evaluation is :: $accuracyScore \n";
Note: For more usage, please refer to test-cases in "t" folder of this package.
DESCRIPTION
This Program will compare the result obtained from the SenseClusters with that
of Gold Standards. Gold Standards can be obtained from:
1. Wikipedia
2. Wordnet
3. User Provided
For fetching the Wikipedia data it use the WWW::Wikipedia module from the CPAN
and for comparison of Labels with Gold Standards it uses the Text::Similarity
Module. The comparison result is then further processed to obtain the result
and score of result.
FILE FORMATS:
senseClusterLabelFileName:
This tells about the file that will contains the labels for the clusters generated by SenseClusters.
The file format for this file should be same as that of generated by SenseClusters.
For e.g:
Cluster 0 (Descriptive): George Bush, Russian President, British Prime, British Minister, India Pakistan, US George, Prime Minister,
Cluster 0 (Discriminating): Russian President, British Minister, India Pakistan, US George,
Cluster 1 (Descriptive): George Bush, British Prime, weapons mass, United Nations, September 11, mass destruction, United States,
Prime Minister, military action
Cluster 1 (Discriminating): United Nations, September 11, United States
Cluster 2 (Descriptive): George Bush, weapons destruction, prime minister, axis evil, Saddam Hussein, weapons mass, mass destruction,
Gulf War, military action, Iraqi leader
Cluster 2 (Discriminating): weapons destruction, prime minister, axis evil, Saddam Hussein, Gulf War, Iraqi leader
goldKeyFileName:
This parameter contains the name of the file that contains the gold standard keys for the labels of clusters generated by
SenseClusters.
The file format provided by user for Gold-Standard key's are dependent on the following
two parameters that user pass to call this module:
labelComparisonMethod
This parameter tells that whether is passing the mapping information between
goldkeys and clusters or not.
Two options available are: 1. 'direct' - this says user will provide the mapping info.
2. 'automate' - this says module should find the best possible
mapping between cluster's label and goldkeys.
goldKeyDataSource
This parameter tell this module from where it can read more information about
the goldkeys
Options for this parameter are: 1. 'wikipedia' - this tells to fetch data from wikipedia.
2. 'wordnet' - this tells to fetch data from wordnet.
3. 'userData' - this tells user will give the data along
with mapping.
Combinatios of the various values for the aboue two parameters will give the following six cases:
(Please note that separator between cluster name and Goldkeys are ":::".
Also, the separator between Goldkeys and their data are ":::")
Case 1. labelComparisonMethod => 'direct', goldKeyDataSource => 'userData'
a) In this case user should provide the mapping between the clusters and Goldkeys
b) User should also provide the data about these goldstandard keys.
for e.g:
Cluster0:::Tony Blair
Cluster1:::Vladimir Putin
Cluster2:::Saddam Hussein
Tony Blair::: Anthony Charles Lynton Blair (born 6 May 1953)[1] is a British Labour Party politician who served
as the Prime Minister of the United Kingdom from 1997 to 2007. He was the Member of Parliament (MP) for Sedgefield
from 1983 to 2007 and Leader of the Labour Party from 1994 to 2007. He resigned from all of these positions in
June 2007.
Vladimir Putin::: Vladimir Vladimirovich Putin (Russian: ( listen); born 7 October 1952) is a Russian politician
who has been the President of Russia since 7 May 2012. Putin previously served as President from 2000 to 2008, and
as Prime Minister of Russia from 1999 to 2000 and again from 2008 to 2012. Putin was also previously the Chairman
of United Russia.
Saddam Hussein::: Saddam Hussein Abd al-Majid al-Tikriti 28 April 1937[2] – 30 December 2006)[3] was the fifth
President of Iraq, serving in this capacity from 16 July 1979 until 9 April 2003.[4][5] A leading member of the
revolutionary Arab Socialist Ba'ath Party.
Case 2. labelComparisonMethod => 'direct', goldKeyDataSource => 'wikipedia'
a) In this case user just need to provide the mapping between the clusters and Goldkeys.
b) User do not need to provide the data about these goldstandard keys. Even though, if user provides the
data about these topics, it will be ignored.
for e.g:
Cluster0:::Tony Blair
Cluster1:::Vladimir Putin
Cluster2:::Saddam Hussein
Case 3. labelComparisonMethod => 'direct', goldKeyDataSource => 'wordnet'
a) In this case also user just need to provide the mapping between the clusters and Goldkeys.
b) User do not need to provide the data about these goldstandard keys.
for e.g:
Cluster0:::Tony Blair
Cluster1:::Vladimir Putin
Cluster2:::Saddam Hussein
Case 4. labelComparisonMethod => 'automate', goldKeyDataSource => 'userData'
a) No Mapping between the clusters and Goldkeys.
b) User will just need to provide the data about these goldstandard keys.
for e.g:
Tony Blair::: Anthony Charles Lynton Blair (born 6 May 1953)[1] is a British Labour Party politician who served
as the Prime Minister of the United Kingdom from 1997 to 2007. He was the Member of Parliament (MP) for Sedgefield
from 1983 to 2007 and Leader of the Labour Party from 1994 to 2007. He resigned from all of these positions in
June 2007.
Vladimir Putin::: Vladimir Vladimirovich Putin (Russian: ( listen); born 7 October 1952) is a Russian politician
who has been the President of Russia since 7 May 2012. Putin previously served as President from 2000 to 2008, and
as Prime Minister of Russia from 1999 to 2000 and again from 2008 to 2012. Putin was also previously the Chairman
of United Russia.
Saddam Hussein::: Saddam Hussein Abd al-Majid al-Tikriti 28 April 1937[2] – 30 December 2006)[3] was the fifth
President of Iraq, serving in this capacity from 16 July 1979 until 9 April 2003.[4][5] A leading member of the
revolutionary Arab Socialist Ba'ath Party.
Case 5. labelComparisonMethod => 'automate', goldKeyDataSource => 'wikipedia'
a) No Mapping between the clusters and Goldkeys.
b) User will just need to provide the comma separated goldstandard keys.
for e.g:
Tony Blair , Vladimir Putin, Saddam Hussein
Case 6. labelComparisonMethod => 'automate', goldKeyDataSource => 'wordnet'
a) No Mapping between the clusters and Goldkeys.
b) User will just need to provide the comma separated goldstandard keys.
for e.g:
Tony Blair , Vladimir Putin, Saddam Hussein
Sample files for all the cases are included in 'TestData' of the modules.
1. TestData/TVS/TVS.label- Files containing the Labels generated by SenseClusters.
2. TestData/TVS/TVSMappingUserData.txt - File contianing GoldKeys, their mapping with clusters and detailed data about the GoldKeys.
3. TestData/TVS/TVSMapping.txt - File contianing GoldKeys, their mapping with clusters.
4. TestData/TVS/TVSTopic.txt - File containing the GoldKeys and their mapping with clusters.
5. TestData/TVS/TVSUserData.txt - File containing the GoldKeys and user provided detailed data about these gold keys.
6. TestData/TVS/testTVS.pl - Perl test file which tells us, how to use these files in various scenarios.
RESULT
a) Contingency Matrix: Based on the similarity comparison of Labels with the gold standards, the Contingency Matrix is generated. Following shows an example of contingency matrix for the example mentioned in synposis:
Original Contingency Matrix:
Bill Clinton Tony Blair
-------------------------------------------------
Cluster0 54 48
-------------------------------------------------
Cluster1 31 16
-------------------------------------------------
b) Using Hungarian algorithm to display the new contingency matrix, whose diagonal elements indicates the assigned similarity-score between a cluster and a gold-standard key. This format of matrix has the maximum possible diagonal's total.
Example:
Contigency Matrix after Hungarian Algorithm:
Tony Blair Bill Clinton
-------------------------------------------------
Cluster0 48 54
-------------------------------------------------
Cluster1 16 31
-------------------------------------------------
c) Conclusion: Displays the conclusion of the Hungarian algorithm:
Example:
Final Conclusion using Hungarian Algorithm::
Cluster0 <--> Tony Blair
Cluster1 <--> Bill Clinton
d) Displaying the overall accuracy for the label assignment:
Sum (Diagonal Scores)
Accuracy = -------------------------------------------
Sum (All the Scores of contingency table)
Example:
Accuracy of labels is 53.02%
Help
The LabelEvaluation module expect the 'OptionsHash' as the required argument.
The 'optionHash' has the following elements:
labelFile:
Name of the file containing the labels from SenseClusters. The syntax of file must be similar to label file from SenseClusters. This is mandatory parameter.
labelComparisonMethod:
Name of the method for comparing the labels with GoldKey. This method tells the program whether the keyFile provided by the User will have the mapping between the assigned labels and expected topics of the clusters.
Possible options are :
A) 'DirectAssignment' and
B) 'AutomateAssignment'.
This is mandatory parameter.
goldKeyFile:
Name of the file containing the actual topics (keys) and their data for the clusters. This is mandatory parameter.
goldKeyLength:
This parameter tells about the length of data to be fetched from the external resource such as Wikipedia. The data will be used as reference data. Default value for this parameter is the first section of the Wikipedia page.
goldKeyDataSource:
This parameter tell the name of external application or user supplied file name from where we will get the key's data.
Options are:
1. 'Wikipedia'
2. 'User'
3. 'Wordnet' (Will be supported in future).
This is the mandatory parameter.
weightRatio:
This ratio tells us about the weightage we should provide to Discriminating label over the descriptive label. Default value is set to 10.
stopList:
This is the name of file which contains the list of all stop words. This is the optional parameter and its formating should match the requirement of the Text:: Simialrity i.e. a single stop word in a single line.
for e.g:
Content of stoplist.txt should look like:
the
of
in
:
:
to
isClean:
This variable will decide whether to keep or delete temporary files.Default value is 'true'.
verbose:
Variable used for the deciding whether to show detailed results to user or not. Default value = Off (0), to make it 'On' change value to 1.
help :
This variable will decide whether to display help to user or not. Default value for this parameter is 0.
%inputOptions = (
senseClusterLabelFileName => '<filelocation>/<SenseClusterLabelFileName>',
labelComparisonMethod => 'DirectAssignmentOrAutomateAssignment',
goldKeyFileName => '<filelocation>/<ActualTopicName>',
goldKeyLength => '<LenghtOfDataFetchedFromExternalResource>',
goldKeyDataSource => '<NameOfSourceFromWhichTopicDataBeFeteched>',
weightRatio => '<WeightageRatioOfDiscriminatingToDiscriptiveLabel>',
stopListFileLocation => '<filelocation>/<StopListFileLocation>',
isClean => 1,
verbose => 0,
help => 0
);
Examples
With minimum parameters:
%inputOptions = (
senseClusterLabelFileName => 'labelFile.txt',
labelComparisonMethod => 'DirectAssignment',
goldKeyFileName => 'goldKeyFile.txt',
goldKeyDataSource => 'UserData'
);
The above mentioned four mandatory parameters.
For Help:
%inputOptions = (
help => 1
);
With all parameters:
%inputOptions = (
senseClusterLabelFileName => 'labelFile.txt',
labelComparisonMethod => 'AutomateAssignment',
goldKeyFileName => 'goldKeyFile.txt',
goldKeyLength => 2000,
goldKeyDataSource => 'Wikipedia',
weightRatio => 10,
stopListFileLocation => 'stoplist.txt',
isClean => 1,
verbose => 1,
help => 0
);
Constructor: new()
This is the constructor which will create object for this class. Reference : http://perldoc.perl.org/perlobj.html
This constructor takes the hash argument and intialize it for the class.
%inputOptions = (
senseClusterLabelFileName => 'value1',
labelComparisonMethod => 'value2',
goldKeyFileName => 'value3',
goldKeyLength => value4,
goldKeyDataSource => 'value5',
weightRatio => value6,
stopListFileLocation => 'value7',
isClean => value8,
verbose => value9,
help => value10
);
Please refer to section "help" about the detailed discussion on this hash.
Function: evaluateLabels
Function which is responsible for evaluating the labels of the clusters. This function will call the other modules for completing the process.
@argument : $driverObject : Object of the current file.
@return : $accuracy : DataType(Float) Indicates the overall accuracy of the assignments.
@description :
Overall algorithm for calculating the accuracy of the labels assignment with the help of gold standard keys are:
Step 1: Read the clusters and their labels information from the ClusterLabel file.
Case A: User has provided the mapping information about the cluster and gold standard key.
Step 2:Read Clusters-Topics mapping information.
Subcase1: User provides data for gold standard keys.
Step 3:Read the gold standard keys and their data from the file provided by user.
Step 4: continue to next step :).
Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia.
User will just provide the data about the topics, but no mapping.
Step 3:Read gold standard keys from the file provided by user.
Step 4:Read data about the gold standard keys from the Wikipedia.
Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.
Step 3:Read gold standard keys from the file provided by user.
Step 4:Read data about the gold standard keys from the Wordnet.
Step 5: Create contingency matrix with similarity-scores of cluster's label against each
gold standard key's data (obtained from steps 3 and 4.)
Step 6: Using the mapping provided by user(step 2) to calculate the diagonal score for the
contingency matrix.
Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :
Sum (Diagonal Scores)
Accuracy =--------------------------------------------------
Sum (All the Scores of contingency table)
Case B: User has not provided the mapping information about the cluster and gold standard key.
We will use the Hungarian algorithm to compute the mapping.
Subcase1: User provides data for gold standard keys.
Step 2: Read the gold standard keys and their data from the file provided by user.
Step 3: Continue to next step :).
Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia. User will just provide the data about the topics, but no mapping.
Step 2: Read gold standard keys from the file provided by user.
Step 3: Read data about the gold standard keys from the Wikipedia.
Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.
Step 2: Read gold standard keys from the file provided by user.
Step 3: Read data about the gold standard keys from the Wordnet.
Common Steps for the all three subcases.
Step 4: Create contingency matrix with similarity-scores of cluster's label against each gold standard key's data (obtained from steps 3 and 4.)
Step 5: Use Hungarian algorithm to determine the mapping of Clusters with gold standard keys.
Step 6: Use the above mapping to calculate the total diagonal score for the new contingency matrix.
Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :
Sum (Diagonal Scores)
Accuracy = --------------------------------------------------
Sum (All the Scores of contingency table)
Function: makeContigencyMatrix
This method is responsible for making the Contigency Matrix containing the similarity-scores of the labels with the data of the gold standard keys.
@argument : $labelSenseClustersHashRef (Hash containing the labels generated by the SenseClusters)
@argument : $topicDataHashRef (Hash containing the data of the gold standard keys)
@argument : $weightageRatio (Parameter which tells the weightage to be given to discriminating labels over descriptive labels of the SenseClusters)
@return : 1. @matrixScore - Contingency matrix containing the similarity-scores.
@return : 2. @colHeader - Array containing the column header for the contingency matrix.
@return : 3. @rowHeader - Array containing the row header for the contingency matrix.
@return : 4. $totalMatrixScore - Total similarity scores of the contingency matrix.
@description :
1). It will iterate through the hash (%labelSenseClustersHash) and extracts the descriptive and discriminating labels for each clusters.
2). It will read the data about each gold standard key from the hash (%topicDataHash).
3). It then uses the module, Text::SenseClusters::LabelEvaluation::SimilarityScore to get various similarity score.
4). Finally, it uses the raw-lesk scores to prepare the contingency matrix.
Function: calculateAccuracy
Method used for calculating the Accuracy score for the labels generated by the SenseClusters or others.
@argument1 : $mappingHashRef (Reference to Hash which contains the mapping information about the cluster and gold standard)
@argument2 : $matrixScoreRef (2-D Array/Matrix which contains the similarity-scores of each labels)
@argument3 : $colHeaderRef (Reference of array which contains the column header)
@argument4 : $rowHeaderRef (Reference of array which contains the row header)
@argument5 : $totalMatrixScore (Total similarity score of the labels with gold standard)
@return : Return the overall accuracy of the labels assigned by the SenseClusters.
@description :
1). With the help of ()$mappingHashRef $matrixScoreRef $colHeaderRef $rowHeaderRef), this function try to calculate the sum of all diagonal elements.
2). It will then calculate the accuracy for the assignment as
Sum (Diagonal Scores)
Accuracy = -----------------------------------
Sum (All the Scores)
BUGS
Currently not supporting the WordNet gold standards comparison.
SEE ALSO
http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/
Last modified by : $Id: Driver.pm,v 1.6 2013/03/18 02:59:42 jhaxx030 Exp $
AUTHORS
Anand Jha, University of Minnesota, Duluth
jhaxx030 at d.umn.edu
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
COPYRIGHT AND LICENSE
Copyright (C) 2012-2013 Ted Pedersen, Anand Jha
See http://dev.perl.org/licenses/ for more information.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc., 59 Temple Place, Suite 330,
Boston, MA 02111-1307 USA
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 162:
Non-ASCII character seen before =encoding in '–'. Assuming UTF-8