NAME

README.Toolkit Description of SenseClusters Toolkit directory structure

Toolkit Organization

This briefly describes the structure of the Toolkit directory, and gives a brief idea of what each program does. Directories are indicated with a / at the end of their name (preprocess/) while programs end with the .pl suffix. All of this is contained in the Toolkits/ directory. Note that these are organized roughly in the order in which they will be used by SenseClusters.

Please review the flowcharts found in doc/Flowcharts for additional information.

preprocess/ (text preprocessing programs)

  • plain/ (processes input in plain text format)

    • text2sval.pl - Convert simple plain text into Senseval2 format

  • sval2/ (processes input in Senseval-2 format)

    • balance.pl - Balances sense distribution in a Senseval-2 input file by removing some instances

    • filter.pl - Removes instances associated with low frequency sense tags from Senseval-2 input

    • frequency.pl - Displays frequency distribution of senses

    • keyconvert.pl - Convert KEY file from Senseval-2 format to SenseCluster's format

    • maketarget.pl - Create a Perl regex for the target word by spotting all <head> tags in the given file

    • prepare_sval2.pl - Prepare Senseval-2 data for experiments

    • preprocess.pl - Tokenize and optionally split Senseval-2 input into training and test portions

    • sval2plain.pl - Convert a Senseval-2 input file to plain text format

    • windower.pl - Cut a window of context W words big around a target word in a given Senseval-2 input file

count/ (Modify count.pl output from Text-NSP)

  • reduce-count.pl - Reduce the size of the Text-NSP output created with huge training data

matrix/ - (Similarity matrix constructors)

  • bitsimat.pl - Create a similarity matrix for given bit vectors

  • simat.pl - Create a similarity matrix for given non-binary (integer or real) vectors

vector/ (Represent contexts as vectors to be clustered)

  • nsp2regex.pl - Creates regular expressions from Text-NSP output to represent features

  • order1vec.pl - Creates first order context vectors

  • order2vec.pl - Creates second order context vectors

  • wordvec.pl - Creates word vectors from Text-NSP output

svd/ (SVDPACKC interface)

  • mat2harbo.pl - Convert matrices from SenseClusters format to Harwell-Boeing format

  • svdpackout.pl - Reconstruct a matrix from its singular vectors as found by by SVDPACKC

clusterstopping/ (Cluster Stopping program)

  • clusterstopping.pl - Predicts the number of clusters that a given data should be divided into. Provides three such cluster stopping measures.

evaluate/ (Evaluate the results of SenseClusters by comparing to gold standard data)

  • cluto2label.pl - Convert clustering output of Cluto to a cluster by sense confusion matrix for evaluation

  • format_clusters.pl - Display contexts that were clustered with assigned sense id, or display senseval-2 format with assigned sense id

  • label.pl - Assign sense tags to the discovered clusters for evaluation

  • report.pl - Report performance in terms of the precision, recall, and F-Measure, and show a confusion matrix

clusterlabel/ (Cluster Labeling programs)

  • clusterlabeling.pl - Selects significant word-pairs from the contents/instances of the clusters and assigns them as the labels to the clusters. Also creates separate file for each cluster.

Acknowledgements

This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784).

COPYRIGHT

Copyright 2003-2008, Ted Pedersen

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.