NAME
format_clusters.pl - Map Cluto output to Senseval-2 format input file
SYNOPSIS
format_clusters.pl [OPTIONS] CLUTO_SOLUTION RLABEL
DESCRIPTION
This program maps Cluto's clustering solution file into Senseval2 input file to give more legible forms of output.
INPUT
Required Arguments:
CLUTO_SOLUTION
This is an output file from Cluto that shows which cluster each context is assigned to. This is referred to as *.cluster_solution by the SenseClusters Web interface, or can be specified via the -clustfile option in Cluto. It consists of N lines, where N is the number of contexts, each each line contains an integer value indicating the cluster to which the context represented by that line is assigned.
Each line of this file shows the cluster id assigned to the instance id, specified at the same line number in *.rlabel file. The number of lines in the CLUTO_SOLUTION file should be the same as in the RLABEL file.
RLABEL
Row Label shows the instance id to which the cluster id, specified at the same line number in *.cluster_solution is assigned. The file name has an extension as .rlabel
Other Options :
--context SENSEVAL2
SENSEVAL2 should be a file of contexts formatted in the Senseval2 format. These are the contexts that have been clustered. The --context option causes the contexts to be reorganized such that those that occur in the same cluster are grouped together.
--senseval2 SENSEVAL2
SENSEVAL2 should be a file of contexts formatted in the Senseval2 format. These are the contexts that have been clustered. The --senseval2 option causes the contexts to be assigned (or tagged) with the cluster value assigned by Cluto. This cluster value will be put into the answer tag. They are displayed in their original order.
--help
Displays the summary of command line options.
--version
Displays the version information.
OUTPUT
If neither of the options (--context or --senseval2) are specified, the default behavior is that contexts are identified by instance id *only* and grouped together by clusters. Thus, the actual written contexts are not displayed in this case.
Each line is formatted as -
<cluster id="CID">
[<instance id="IID"/>]+
</cluster>
If --context option is used, then all the instances along with the actual context data, grouped by clusters are displayed. The output sent to STDOUT looks like:
<cluster id="CID">
[<instance id="IID"><context>DATA</context></instance>]+
</cluster>
If --senseval2 option is used, then output is copy of the input senseval2 file except that now, answer tags contain cluster id assigned to the instance. The output is sent to STDOUT.
Note: --context and --senseval2 cannot be used together.
BUGS
SYSTEM REQUIREMENTS
AUTHORS
Ted Pedersen, University of Minnesota, Duluth
Amruta Purandare, University of Pittsburgh
Anagha Kulkarni, Carnegie-Mellon University
COPYRIGHT
Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.