Name

Text::SenseClusters::LabelEvaluation::Driver - Module for evaluation of labels of the clusters.

SYNOPSIS

The following code snippet will evaluate the labels by comparing
them with text data for a gold-standard key from Wikipedia.

In order to test this module, please copy 'TestData' folder in current directory
or adjust directory location while mentioning the label and GoldKeys files. 

# Including the LabelEvaluation Module.
use Text::SenseClusters::LabelEvaluation::Driver;

my $labelFileName  = 'TestData/TVS/TVS.label';
my $topicFileName	= 'TestData/TVS/TVSTopic.txt';

# Calling the LabelEvaluation modules by passing the following options
%inputOptions = (
		senseClusterLabelFileName => $labelFileName, 
		labelComparisonMethod => 'automate',
		goldKeyFileName => $topicFileName,
		goldKeyDataSource => 'wikipedia',
		weightRatio => 10,
		isClean => 1,
);


# Calling the LabelEvaluation modules by passing the name of the 
# label and topic files.
my $driverObject = Text::SenseClusters::LabelEvaluation::Driver->
		new (\%inputOptions);
	
if($driverObject->{"errorCode"}){
	print "Please correct the error before proceeding.\n\n";
	exit();
}
my $accuracyScore = $driverObject->evaluateLabels();

# Printing the score.			
print "\n\nScore of label evaluation is :: $accuracyScore \n";
	

Note: For more usage, please refer to test-cases in "t" folder of this package.

DESCRIPTION

This Program will compare the result obtained from the SenseClusters with that 
of Gold Standards. Gold Standards can be obtained from:
		1. Wikipedia
		2. Wordnet
		3. User Provided
		
For fetching the Wikipedia data it use the WWW::Wikipedia module from the CPAN 
and for comparison of Labels with Gold Standards it uses the Text::Similarity
Module. The comparison result is then further processed to obtain the result
and score of result.

FILE FORMATS:

senseClusterLabelFileName:

This tells about the file that will contains the labels for the clusters generated by SenseClusters. 
The file format for this file should be same as that of generated  by SenseClusters.

For e.g:

Cluster 0 (Descriptive): George Bush, Russian President, British Prime, British Minister, India Pakistan, US George, Prime Minister, 
Cluster 0 (Discriminating): Russian President, British Minister, India Pakistan, US George, 
Cluster 1 (Descriptive): George Bush, British Prime, weapons mass, United Nations, September 11, mass destruction, United States, 
			Prime Minister, military action
Cluster 1 (Discriminating): United Nations, September 11, United States
Cluster 2 (Descriptive): George Bush, weapons destruction, prime minister, axis evil, Saddam Hussein, weapons mass, mass destruction, 
			Gulf War, military action, Iraqi leader
Cluster 2 (Discriminating): weapons destruction, prime minister, axis evil, Saddam Hussein, Gulf War, Iraqi leader
	

goldKeyFileName:

This parameter contains the name of the file that contains the gold standard keys for the labels of clusters generated by
SenseClusters.
	
The file format provided by user for Gold-Standard key's are dependent on the following
two parameters that user pass to call this module:

labelComparisonMethod

This parameter tells that whether is passing the mapping information between
goldkeys and clusters or not.

Two options available are:	1. 'direct'		- this says user will provide the mapping info.
							 	2. 'automate' 		- this says module should find the best possible 
							 	                    mapping between cluster's label and goldkeys.	
		 

goldKeyDataSource

  		This parameter tell this module from where it can read more information about
  		the goldkeys
  		
		Options for this parameter are:		1. 'wikipedia'		- this tells to fetch data from wikipedia.
								 				2. 'wordnet'		- this tells to fetch data from wordnet.
								 				3. 'userData'		- this tells user will give the data along 
								 				                    with mapping.
								 				
	

Combinatios of the various values for the aboue two parameters will give the following six cases:	

	(Please note that separator between cluster name and Goldkeys are ":::".
	Also, the separator between Goldkeys and their data are ":::")  
	

Case 1. labelComparisonMethod => 'direct', goldKeyDataSource => 'userData'

a) In this case user should provide the mapping between the clusters and Goldkeys
b) User should also provide the data about these goldstandard keys.

		for e.g: 
				
		Cluster0:::Tony Blair  
		Cluster1:::Vladimir Putin 
		Cluster2:::Saddam Hussein

		Tony Blair::: Anthony Charles Lynton Blair (born 6 May 1953)[1] is a British Labour Party politician who served 
		as the Prime Minister of the United Kingdom from 1997 to 2007. He was the Member of Parliament (MP) for Sedgefield 
		from 1983 to 2007 and Leader of the Labour Party from 1994 to 2007. He resigned from all of these positions in 
		June 2007.
		
		Vladimir Putin::: Vladimir Vladimirovich Putin (Russian: ( listen); born 7 October 1952) is a Russian politician  
		who has been the President of Russia since 7 May 2012. Putin previously served as President from 2000 to 2008, and  
		as Prime Minister of Russia from 1999 to 2000 and again from 2008 to 2012. Putin was also previously the Chairman  
		of United Russia.
		
		Saddam Hussein::: Saddam Hussein Abd al-Majid al-Tikriti  28 April 1937[2] – 30 December 2006)[3] was the fifth 
		President of Iraq, serving in this capacity from 16 July 1979 until 9 April 2003.[4][5] A leading member of the 
		revolutionary Arab Socialist Ba'ath Party.

Case 2. labelComparisonMethod => 'direct', goldKeyDataSource => 'wikipedia'

a) In this case user just need to provide the mapping between the clusters and Goldkeys.
b) User do not need to provide the data about these goldstandard keys. Even though, if user provides the
   data about these topics, it will be ignored.


		 for e.g: 
			Cluster0:::Tony Blair  
			Cluster1:::Vladimir Putin 
			Cluster2:::Saddam Hussein

			

Case 3. labelComparisonMethod => 'direct', goldKeyDataSource => 'wordnet'

a) In this case also user just need to provide the mapping between the clusters and Goldkeys.
b) User do not need to provide the data about these goldstandard keys. 

		for e.g:
			Cluster0:::Tony Blair  
			Cluster1:::Vladimir Putin 
			Cluster2:::Saddam Hussein

Case 4. labelComparisonMethod => 'automate', goldKeyDataSource => 'userData'

	a) No Mapping between the clusters and Goldkeys.
	b) User will just need to provide the data about these goldstandard keys. 
	   
	   
		for e.g: 
		Tony Blair::: Anthony Charles Lynton Blair (born 6 May 1953)[1] is a British Labour Party politician who served 
		as the Prime Minister of the United Kingdom from 1997 to 2007. He was the Member of Parliament (MP) for Sedgefield 
		from 1983 to 2007 and Leader of the Labour Party from 1994 to 2007. He resigned from all of these positions in 
		June 2007.
		
		Vladimir Putin::: Vladimir Vladimirovich Putin (Russian: ( listen); born 7 October 1952) is a Russian politician  
		who has been the President of Russia since 7 May 2012. Putin previously served as President from 2000 to 2008, and  
		as Prime Minister of Russia from 1999 to 2000 and again from 2008 to 2012. Putin was also previously the Chairman  
		of United Russia.
		
		Saddam Hussein::: Saddam Hussein Abd al-Majid al-Tikriti  28 April 1937[2] – 30 December 2006)[3] was the fifth 
		President of Iraq, serving in this capacity from 16 July 1979 until 9 April 2003.[4][5] A leading member of the 
		revolutionary Arab Socialist Ba'ath Party.

Case 5. labelComparisonMethod => 'automate', goldKeyDataSource => 'wikipedia'

a) No Mapping between the clusters and Goldkeys.
b) User will just need to provide the comma separated goldstandard keys. 
	
for e.g: 
	Tony Blair , Vladimir Putin, Saddam Hussein

Case 6. labelComparisonMethod => 'automate', goldKeyDataSource => 'wordnet'

a) No Mapping between the clusters and Goldkeys.
b) User will just need to provide the comma separated goldstandard keys. 

	
for e.g: 
	Tony Blair , Vladimir Putin, Saddam Hussein

Sample files for all the cases are included in 'TestData' of the modules.

1. TestData/TVS/TVS.label- Files containing the Labels generated by SenseClusters.

2. TestData/TVS/TVSMappingUserData.txt - File contianing GoldKeys, their mapping with clusters and detailed data about the GoldKeys.

3. TestData/TVS/TVSMapping.txt - File contianing GoldKeys, their mapping with clusters.

4. TestData/TVS/TVSTopic.txt - File containing the GoldKeys and their mapping with clusters.

5. TestData/TVS/TVSUserData.txt - File containing the GoldKeys and user provided detailed data about these gold keys.

6. TestData/TVS/testTVS.pl - Perl test file which tells us, how to use these files in various scenarios.

RESULT

a) Contingency Matrix: Based on the similarity comparison of Labels with the gold standards, the Contingency Matrix is generated. Following shows an example of contingency matrix for the example mentioned in synposis:

	Original Contingency Matrix: 
	 
		   		Bill Clinton		Tony Blair  
	-------------------------------------------------
	 Cluster0			54				48
	-------------------------------------------------
	 Cluster1			31				16
	------------------------------------------------- 

b) Using Hungarian algorithm to display the new contingency matrix, whose diagonal elements indicates the assigned similarity-score between a cluster and a gold-standard key. This format of matrix has the maximum possible diagonal's total.

	Example:
	
	Contigency Matrix after Hungarian Algorithm: 
	 
		   			Tony Blair 	Bill Clinton  
	-------------------------------------------------
	 Cluster0			48				54
	-------------------------------------------------
	 Cluster1			16				31
	-------------------------------------------------

c) Conclusion: Displays the conclusion of the Hungarian algorithm:

		Example:
		
		Final Conclusion using Hungarian Algorithm::
			Cluster0	<-->	Tony  Blair 
			Cluster1	<-->	Bill Clinton  

d) Displaying the overall accuracy for the label assignment:

	 							Sum (Diagonal Scores)
  		Accuracy =	 -------------------------------------------
 						Sum (All the Scores of contingency table)
 		
 		Example:				
		Accuracy of labels is 53.02% 			

Help

The LabelEvaluation module expect the 'OptionsHash' as the required argument.

The 'optionHash' has the following elements:

labelFile:

Name of the file containing the labels from SenseClusters. The syntax of file must be similar to label file from SenseClusters. This is mandatory parameter.

labelComparisonMethod:

Name of the method for comparing the labels with GoldKey. This method tells the program whether the keyFile provided by the User will have the mapping between the assigned labels and expected topics of the clusters.

	Possible options are : 
		A) 'DirectAssignment' and 
		B) 'AutomateAssignment'.
 

This is mandatory parameter.

goldKeyFile:

Name of the file containing the actual topics (keys) and their data for the clusters. This is mandatory parameter.

goldKeyLength:

This parameter tells about the length of data to be fetched from the external resource such as Wikipedia. The data will be used as reference data. Default value for this parameter is the first section of the Wikipedia page.

goldKeyDataSource:

This parameter tell the name of external application or user supplied file name from where we will get the key's data.

 	Options are:
		1. 'Wikipedia'   
		2. 'User' 
		3. 'Wordnet' (Will be supported in future).
		

This is the mandatory parameter.

weightRatio:

This ratio tells us about the weightage we should provide to Discriminating label over the descriptive label. Default value is set to 10.

stopList:

This is the name of file which contains the list of all stop words. This is the optional parameter and its formating should match the requirement of the Text:: Simialrity i.e. a single stop word in a single line.

for e.g: 
Content of stoplist.txt should look like:
		the
		of
		in
		:
		:
		to

isClean:

This variable will decide whether to keep or delete temporary files.Default value is 'true'.

verbose:

Variable used for the deciding whether to show detailed results to user or not. Default value = Off (0), to make it 'On' change value to 1.

help :

This variable will decide whether to display help to user or not. Default value for this parameter is 0.

%inputOptions = (
	senseClusterLabelFileName => '<filelocation>/<SenseClusterLabelFileName>', 
	labelComparisonMethod => 'DirectAssignmentOrAutomateAssignment',
	goldKeyFileName => '<filelocation>/<ActualTopicName>',
	goldKeyLength => '<LenghtOfDataFetchedFromExternalResource>',
	goldKeyDataSource => '<NameOfSourceFromWhichTopicDataBeFeteched>',
	weightRatio => '<WeightageRatioOfDiscriminatingToDiscriptiveLabel>',
	stopListFileLocation => '<filelocation>/<StopListFileLocation>',
	isClean => 1,
	verbose => 0,
	help => 0
);

Examples

With minimum parameters:

	%inputOptions = (
		senseClusterLabelFileName => 'labelFile.txt', 
		labelComparisonMethod => 'DirectAssignment',
		goldKeyFileName => 'goldKeyFile.txt',
		goldKeyDataSource => 'UserData'
	);

The above mentioned four mandatory parameters.

For Help:

%inputOptions = (
	help => 1
);

With all parameters:

%inputOptions = (
	senseClusterLabelFileName => 'labelFile.txt', 
	labelComparisonMethod => 'AutomateAssignment',
	goldKeyFileName => 'goldKeyFile.txt',
	goldKeyLength => 2000,
	goldKeyDataSource => 'Wikipedia',
	weightRatio => 10,
	stopListFileLocation => 'stoplist.txt',
	isClean => 1,
	verbose => 1,
	help => 0
);

Constructor: new()

This is the constructor which will create object for this class. Reference : http://perldoc.perl.org/perlobj.html

This constructor takes the hash argument and intialize it for the class.

%inputOptions = (
	senseClusterLabelFileName => 'value1', 
	labelComparisonMethod => 'value2',
	goldKeyFileName => 'value3',
	goldKeyLength => value4,
	goldKeyDataSource => 'value5',
	weightRatio => value6,
	stopListFileLocation => 'value7',
	isClean => value8,
	verbose => value9,
	help => value10
);

Please refer to section "help" about the detailed discussion on this hash.

Function: evaluateLabels

Function which is responsible for evaluating the labels of the clusters. This function will call the other modules for completing the process.

@argument : $driverObject : Object of the current file.

@return : $accuracy : DataType(Float) Indicates the overall accuracy of the assignments.

@description :

Overall algorithm for calculating the accuracy of the labels assignment with the help of gold standard keys are:

Step 1: Read the clusters and their labels information from the ClusterLabel file.

Case A: User has provided the mapping information about the cluster and gold standard key.

Step 2:Read Clusters-Topics mapping information.

Subcase1: User provides data for gold standard keys.

Step 3:Read the gold standard keys and their data from the file provided by user.
Step 4: continue to next step :).
		

Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia.

   User will just provide the data about the topics, but no mapping.
   
	Step 3:Read gold standard keys from the file provided by user.
	Step 4:Read data about the gold standard keys from the Wikipedia.
	

Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.

				Step 3:Read gold standard keys from the file provided by user.
 				Step 4:Read data about the gold standard keys from the Wordnet.
	
	Step 5: Create contingency matrix with similarity-scores of cluster's label against each 
			 gold standard key's data (obtained from steps 3 and 4.)
	Step 6: Using the mapping provided by user(step 2) to calculate the diagonal score for the 
			 contingency matrix.
	Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as : 		 		  
	
			 				Sum (Diagonal Scores)
		  		Accuracy =--------------------------------------------------
		 					Sum (All the Scores of contingency table)
		 						

Case B: User has not provided the mapping information about the cluster and gold standard key.

We will use the Hungarian algorithm to compute the mapping.
	

Subcase1: User provides data for gold standard keys.

Step 2: Read the gold standard keys and their data from the file provided by user.

Step 3: Continue to next step :).

Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia. User will just provide the data about the topics, but no mapping.

Step 2: Read gold standard keys from the file provided by user.

Step 3: Read data about the gold standard keys from the Wikipedia.

Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.

Step 2: Read gold standard keys from the file provided by user.

Step 3: Read data about the gold standard keys from the Wordnet.

Common Steps for the all three subcases.

Step 4: Create contingency matrix with similarity-scores of cluster's label against each gold standard key's data (obtained from steps 3 and 4.)

Step 5: Use Hungarian algorithm to determine the mapping of Clusters with gold standard keys.

Step 6: Use the above mapping to calculate the total diagonal score for the new contingency matrix.

Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :

	 				  Sum (Diagonal Scores)
	  		Accuracy = --------------------------------------------------
 					  Sum (All the Scores of contingency table)

Function: makeContigencyMatrix

This method is responsible for making the Contigency Matrix containing the similarity-scores of the labels with the data of the gold standard keys.

@argument : $labelSenseClustersHashRef (Hash containing the labels generated by the SenseClusters)

@argument : $topicDataHashRef (Hash containing the data of the gold standard keys)

@argument : $weightageRatio (Parameter which tells the weightage to be given to discriminating labels over descriptive labels of the SenseClusters)

@return : 1. @matrixScore - Contingency matrix containing the similarity-scores.

@return : 2. @colHeader - Array containing the column header for the contingency matrix.

@return : 3. @rowHeader - Array containing the row header for the contingency matrix.

@return : 4. $totalMatrixScore - Total similarity scores of the contingency matrix.

@description :

1). It will iterate through the hash (%labelSenseClustersHash) and extracts the descriptive and discriminating labels for each clusters.

2). It will read the data about each gold standard key from the hash (%topicDataHash).

3). It then uses the module, Text::SenseClusters::LabelEvaluation::SimilarityScore to get various similarity score.

4). Finally, it uses the raw-lesk scores to prepare the contingency matrix.

Function: calculateAccuracy

Method used for calculating the Accuracy score for the labels generated by the SenseClusters or others.

@argument1 : $mappingHashRef (Reference to Hash which contains the mapping information about the cluster and gold standard)

@argument2 : $matrixScoreRef (2-D Array/Matrix which contains the similarity-scores of each labels)

@argument3 : $colHeaderRef (Reference of array which contains the column header)

@argument4 : $rowHeaderRef (Reference of array which contains the row header)

@argument5 : $totalMatrixScore (Total similarity score of the labels with gold standard)

@return : Return the overall accuracy of the labels assigned by the SenseClusters.

@description :

1). With the help of ()$mappingHashRef $matrixScoreRef $colHeaderRef $rowHeaderRef), this function try to calculate the sum of all diagonal elements.

2). It will then calculate the accuracy for the assignment as

	 				Sum (Diagonal Scores)
  		Accuracy =	-----------------------------------
 					Sum (All the Scores)
		 						

BUGS

  • Currently not supporting the WordNet gold standards comparison.

SEE ALSO

http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/

Last modified by : $Id: Driver.pm,v 1.6 2013/03/18 02:59:42 jhaxx030 Exp $

AUTHORS

Anand Jha, University of Minnesota, Duluth
jhaxx030 at d.umn.edu

Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu

COPYRIGHT AND LICENSE

Copyright (C) 2012-2013 Ted Pedersen, Anand Jha

See http://dev.perl.org/licenses/ for more information.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

The Free Software Foundation, Inc., 59 Temple Place, Suite 330, 
Boston, MA  02111-1307  USA

1 POD Error

The following errors were encountered while parsing the POD:

Around line 162:

Non-ASCII character seen before =encoding in '–'. Assuming UTF-8