NAME
README.Toolkit Description of SenseClusters Toolkit directory structure
Toolkit Organization
This briefly describes the structure of the Toolkit directory, and gives a brief idea of what each program does. Directories are indicated with a / at the end of their name (preprocess/) while programs end with the .pl suffix. All of this is contained in the Toolkits/ directory. Note that these are organized roughly in the order in which they will be used by SenseClusters.
Please review the flowcharts found in doc/Flowcharts for additional information.
preprocess/ (text preprocessing programs)
plain/ (processes input in plain text format)
text2sval.pl - Convert simple plain text into Senseval2 format
sval2/ (processes input in Senseval-2 format)
balance.pl - Balances sense distribution in a Senseval-2 input file by removing some instances
filter.pl - Removes instances associated with low frequency sense tags from Senseval-2 input
frequency.pl - Displays frequency distribution of senses
keyconvert.pl - Convert KEY file from Senseval-2 format to SenseCluster's format
maketarget.pl - Create a Perl regex for the target word by spotting all <head> tags in the given file
prepare_sval2.pl - Prepare Senseval-2 data for experiments
preprocess.pl - Tokenize and optionally split Senseval-2 input into training and test portions
sval2plain.pl - Convert a Senseval-2 input file to plain text format
windower.pl - Cut a window of context W words big around a target word in a given Senseval-2 input file
count/ (Modify count.pl output from Text-NSP)
reduce-count.pl - Reduce the size of the Text-NSP output created with huge training data
matrix/ - (Similarity matrix constructors)
bitsimat.pl - Create a similarity matrix for given bit vectors
simat.pl - Create a similarity matrix for given non-binary (integer or real) vectors
vector/ (Represent contexts as vectors to be clustered)
nsp2regex.pl - Creates regular expressions from Text-NSP output to represent features
order1vec.pl - Creates first order context vectors
order2vec.pl - Creates second order context vectors
wordvec.pl - Creates word vectors from Text-NSP output
svd/ (SVDPACKC interface)
mat2harbo.pl - Convert matrices from SenseClusters format to Harwell-Boeing format
svdpackout.pl - Reconstruct a matrix from its singular vectors as found by by SVDPACKC
clusterstopping/ (Cluster Stopping program)
clusterstopping.pl - Predicts the number of clusters that a given data should be divided into. Provides three such cluster stopping measures.
evaluate/ (Evaluate the results of SenseClusters by comparing to gold standard data)
cluto2label.pl - Convert clustering output of Cluto to a cluster by sense confusion matrix for evaluation
format_clusters.pl - Display contexts that were clustered with assigned sense id, or display senseval-2 format with assigned sense id
label.pl - Assign sense tags to the discovered clusters for evaluation
report.pl - Report performance in terms of the precision, recall, and F-Measure, and show a confusion matrix
clusterlabel/ (Cluster Labeling programs)
clusterlabeling.pl - Selects significant word-pairs from the contents/instances of the clusters and assigns them as the labels to the clusters. Also creates separate file for each cluster.
Acknowledgements
This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784).
COPYRIGHT
Copyright 2003-2008, Ted Pedersen
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.