Name
Text::SenseClusters::LabelEvaluation::ReadingFilesData - Module for reading the data from a file as single string object.
SYNOPSIS
The following code snippet will show how to use this module.
Example 1: Reading the label file generated by sense cluster.
use Text::SenseClusters::LabelEvaluation::ReadingFilesData;
# Reading the cluster's labels file.
my $clusterFileName = "TVS.label";
# Getting the clusters file name.
my $clusterFileName = $driverObject->{$senseClusterLabelFileName};
# Creating the read file object and reading the label examples.
my $readClusterFileObject =
Text::SenseClusters::LabelEvaluation::ReadingFilesData->new ($clusterFileName);
my %labelSenseClustersHash = ();
my $labelSenseClustersHashRef =
$readClusterFileObject->readLinesFromClusterFile(\%labelSenseClustersHash);
%labelSenseClustersHash = %$labelSenseClustersHashRef;
# Iterating the Hash to print the value.
foreach my $key (sort keys %labelSenseClustersHash){
foreach my $innerkey (sort keys %{$labelSenseClustersHash{$key}}){
print "$key :: $innerkey :: $labelSenseClustersHash{$key}{$innerkey} \n";
}
}
Example 2: Reading the user provided Gold Standard keys and their data.
use Text::SenseClusters::LabelEvaluation::ReadingFilesData;
# Reading the topic file name.
my $topicsFileName = "TVS.txt";
# Creating the read object, which will read the gold-standard keys and data provided by user.
my $readFileObject =
Text::SenseClusters::LabelEvaluation::ReadingFilesData->new($topicsFileName);
# Reading the Mapping with help of function.
my ( $hashRef, $topicArrayRef ) = $readFileObject->readMappingFromTopicFile();
# Reading the hash from its reference.
my %mappingHash = %$hashRef;
my @topicArray = @$topicArrayRef;
# Iterating the Hash to print the value.
foreach my $key ( sort keys %mappingHash ) {
print "$key=$mappingHash{$key}\n";
}
# Iterating the Hash to print the value.
foreach my $key (@topicArray) {
print "$key\n";
}
DESCRIPTION
This module provides the various functions to read the labels and topic files.
The first function reads the labelled data generated by the SenseClusters and
create hash from it. The data-format of the input file must match the format
of label-file generated by SenseClusters.
The second function reads a file into a string variable by removing all the
newline characters from it.
The remaining functions read the user provided file that contains the mapping
of clusters labels with gold standard keys, and/or data about the gold standard
key or list of topics.
Constructor: new()
This is the constructor which will create object for this class. Reference : http://perldoc.perl.org/perlobj.html
This constructor takes the following argument: 1. $fileNameArg : The name of the file whose data has to be read.
Function: readLinesFromClusterFile
This function will read lines from the file containing the Labels of the Clusters and make the hash file.
@argument1 : Name of the cluster file name.
@argument2 : Reference of Hash ($labelSenseClustersHash) which will hold the information in the following format:
For e.g.:\tCluster0{
Descriptive => George Bush, Al Gore, White House, New York
Discriminating => George Bush, York Times
}
Cluster1{
Descriptive => George Bush, BRITAIN London, Prime Minister
Discriminating => BRITAIN London, Prime Minister
}
@return : It will return the reference of the Hash mentioned above: $labelSenseClustersHashRef.
@description :
1. Read the file line by line. 2. Ignore the lines which do not follow one of the following format: Cluster 0 (Descriptive): George Bush, Al Gore, White House, New York Cluster 0 (Discriminating): George Bush, BRITAIN London 3. Create Key from the "Cluster # (Descriptive)" or "Cluster # (Discrim - inating)" as "OuterKey: Cluster#" "InnerKey: Descriptive". 4. Store the value of hash as the keywords similar to above example: for e.g: $labelSenseClustersGlobalRef{Cluster0}{Discriminating} = "BRITAIN London, Prime Minister";
Function: readLinesFromTopicFile
This function will read lines from the topic file and list of all the topics.
@argument1 : Name of the topicFile.
@return : String containing the list of all the topics(labels) for the clusters.
@description : 1. Read the file line by line. 2. Remove the new line characters and making string variable which contains the list of all the topics.
Function: readMappingFromTopicFile
This function will read mapping provided by the user for the Cluster's label (Cluster#) and gold standard key(topic-name).
Syntax of the file:
<Cluster><#><Seprator(:::)><topic>
Example:
Cluster0:::topic1
Cluster1:::topic2
Cluster2:::topic0
@argument : $readFileObject : Object of the current file.
@return1 : \%clusterTopicMappingHash : DataType : (Reference to Hash) Reference of Hash containing the mapping between the Cluster's label and gold standard key.
@return2 : \@topicArray : DataType : (Reference to array) Reference of array containg the gold standard keys.
@description : 1. Read the file line by line. 2. Check the line, if it contains the "Cluster#:::". 3. Spliting these line with Seprator":::". 4. A WordArray do not have 2 elements, ignore it. 3. Otherwise ignore the remaining lines.
Reason for selecting the separtor as ":::"
1. It will ensure that it is unique and it has very rare chance of occuring
in a documents or text.
Function: readTopicDataFromTopicFile
This function will read data about the gold standard key(topic-name).
Syntax of the file:
<topicName><Seprator(:::)><multi lines topic data>
Example:
topic1:::data1, data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
topic2:::data2, data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2
data2 data2 data2 data2 data2 data2 data2 data2 data2 data1 data1 data1 data1 data1
@argument : $readFileObject : Object of the current file.
@return : \%topicDataHash : DataType : (Reference to Hash) Reference of Hash containing the topics and their corresponding data.
@description : 1. Read the file line by line. 2. Check the line, if it contains the ":::" and starts with one of the topic: a. This indicates the start of the topic's data. b. Read the line till we encounter another "topic-name:::" or "cluster#:::" 4. Finally, make hash containing the topic as the key and topic's data as the value. 3. Return the reference of this hash.
Function: readTopicNamesFromTopicFile
This function will list all the topics from the file provided by user.
Syntax of the file:
<Cluster#><Seprator(:::)><topicName>
<topicName><Seprator(:::)><multi lines topic data>
<topicName><Seprator(:::)><multi lines topic data>
<topicName><Seprator(:::)><multi lines topic data>
<Cluster#><Seprator(:::)><topicName>
<Cluster#><Seprator(:::)><topicName>
Example:
topic1:::data1, data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
topic2:::data2, data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2
data2 data2 data2 data2 data2 data2 data2 data2 data2 data1 data1 data1 data1 data1
cluster0:::topic1
cluster1:::topic2
cluster2:::topic0
@argument : $readFileObject : Object of the current file.
@return : \@topicNameArray : DataType : (Reference to array) Reference of array containing the list of topics.
@description : 1. Read the file line by line. 2. Check the line, if it contains the ":::" a. if starts with "cluster" ignore it. b. otherwise, split that line with separator, ":::" and store the results in array. c. The first element of the array is the topic-name. d. Push, this topic-name into the array. 3. Return the reference of this array.
Reason for selecting the separtor as ":::" 1. It will ensure that it is unique and it has very rare chance of occuring in a documents or text.
SEE ALSO
http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/
Last modified by : $Id: ReadingFilesData.pm,v 1.5 2013/03/07 23:15:49 jhaxx030 Exp $
AUTHORS
Anand Jha, University of Minnesota, Duluth
jhaxx030 at d.umn.edu
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
COPYRIGHT AND LICENSE
Copyright (C) 2012-2013 Ted Pedersen, Anand Jha
See http://dev.perl.org/licenses/ for more information.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc., 59 Temple Place, Suite 330,
Boston, MA 02111-1307 USA