NAME
discriminate.pl
SYNOPSIS
Discriminates among the given text instances based on their contextual similarities.
USAGE
discriminate.pl [OPTIONS] TEST
INPUT
Required Arguments:
TEST
Senseval-2 formatted TEST instance file that contains the instances to be clustered.
Optional Arguments:
DATA OPTIONS :
--training TRAIN
Training file in plain text format that can be used to select features. If this is not specified, features are selected from the given TEST file.
--split N
Splits the given TEST file into two portions, N% for the use as the TRAIN data and (100-N)% as the TEST data. The value for N is a percentage and should be an integer between 1 to 99 (inclusive). The instances from the original TEST file are not picked or split in any particular order but are randomly split into the two portions of TRAIN and TEST data while maintaining the ratio of N/(100-N).
Note: This option cannot be used when --training option is also used.
--token TOKEN
A file containing Perl regex/s that define the tokenization scheme in TRAIN and TEST files. If --token is not specified, default token regex file token.regex is searched in the current directory.
--target TARGET
A file containing Perl regex/s for identifying the target word. A sample target.regex file containing regex:
/<head>\w+</head>/
is provided with this distribution. If --target is not specified, default target regex file target.regex is searched in the current directory. If this file doesn't exist, target.regex is automatically created by finding all instances of <head> tags from the TEST data. If there are no instances of <head> tags in TEST, the given data is assumed to be global and target word is not searched in either TRAIN or TEST.
Note: --target cannot be specified with headless input data
i.e. test file without head/target word(s).
--prefix PRE
Specify a prefix to be used in all output file names. e.g. context vector file will have name 'PRE.vectors', features file will have name 'PRE.features' and so on ... By default, a random prefix is created using the time stamp.
--format f16.XX
The default format for floating point numbers is f16.06. This means that there is room for 6 digits to the right of the decimal, and 9 to the left. You may change XX to any value between 0 and 15, however, the format must remain 16 spaces long due to formatting requirements of SVDPACKC.
--wordclust
Discriminates and clusters each word based upon its direct and indirect co-occurrence with other words (when used without the --lsa switch) or clusters words or features based upon their occurrences in different contexts (when used with the --lsa switch).
Note: 1. Separate (--training) TRAIN data should not be used with word
clustering.
2. Starting with Version 0.93, word clustering is no longer
restricted to using only headless data. However, options
specific to headed data such as --scope_test and target
co-occurrence features (see below) cannot be used.
--lsa
Uses Latent Semantic Analysis (LSA) style representation for clustering features or contexts. LSA representation is the transpose of the context-by-feature matrix created using the native SenseClusters order1 context representation.
This option can be used only in the following two combinations of the --context and the --wordclust options:
- 1. --context o1 --wordclust --lsa
-
Performs feature clustering by grouping together features based on the contexts that they occur in. Features can be unigrams, bigrams or co-occurrences. Feature vectors are the rows of the transposed context-by-feature representation created by order1vec.pl.
- 2. --context o2 --lsa
-
Performs context clustering by creating context vectors by averaging the feature vectors from the transposed context-by-feature representation of order1vec.pl.
FEATURE OPTIONS :
--feature TYPE
Specify the feature type to be used for representing contexts. Possible options for feature type with first order context representation:
bi - bigrams [default]
tco - target co-occurrences
co - co-occurrences
uni - unigrams
Possible options for feature type with second order context representation:
bi - bigrams [default]
co - co-occurrences
tco - target co-occurrences
Note: --tco (target co-occurrences) cannot be used with headless
data i.e. test/train file without head/target word(s).
--scope_train S1
Limits the scope of the training contexts to S1 words around (on both sides of) the TARGET word. Thus, it allows selection of local features. If --scope_train is used, each training instance is expected to include the target word as specified by the --target option or default target.regex.
Note: --scope_train cannot be used with headless data i.e. train files
without head/target word(s).
--scope_test S2
Limits the scope of the test contexts to S2 words around (on both sides of) the TARGET word. Thus, it allows to match and use local features in the context vectors.
Note: --scope_test cannot be used with headless data i.e. test files
without head/target word(s).
--stop STOPFILE
A file of Perl regexes that define the stop list of words to be excluded from the features.
STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word.
Both modes exclude stop words from unigram features.
Default is OR mode.
--remove F
Removes features that occur less than F times in the training corpus.
--window W
Specifies the window size for bigram/co-occurrence features. Pairs of words that co-occur within the specified window from each other (window W allows at most W-2 intervening words) will form the bigram/co-occurrence features.
Default window size is 2 which allows only consecutive word pairs.
Not applicable to unigram features.
--stat STAT
Bigrams and co-occurrences can be selected based on their statistical scores of association as specified by this option. If --vector = o2 and --stat is used, word association matrix will use the scores computed by the specified statistical test instead of simple joint frequency counts of the word pairs.
Available tests of association are :
dice - Dice Coefficient
ll - Log Likelihood Ratio
odds - Odds Ratio
phi - Phi Coefficient
pmi - Point-Wise Mutual Information
tmi - True Mutual Information
x2 - Chi-Squared Test
tscore - T-Score
leftFisher - Left Fisher's Test
rightFisher - Right Fisher's Test
By default, features are selected and represented using their frequency counts.
--stat_rank N
Word pairs ranking below N when arranged in descending order of their test scores are ignored.
--stat_rank has no effect unless --stat is specified.
--stat_score S
Selects word pairs with scores greater than S after performing the selected test of association. Score could be any real number that will give reasonable number of features for the requested test.
--stat_score has no effect unless --stat is specified.
VECTOR OPTIONS :
--context ORD
Specifies the context representation to be used. Set ORD to 'o1' to use 1st order context vectors, and to 'o2' to select 2nd order context vectors. Default context representation is o2.
--binary
Creates binary feature and context vectors. By default, feature vectors show the joint frequency scores of the associated word pairs while the context vectors show the average of the feature vectors of words that occur in the context. With --binary turned ON, feature vectors show mere presence or absence of the particular word pair (co-occurrence/bigram) in TRAIN, while the context vectors will represent a binary 'OR' operation on the corresponding vectors of contextual features.
SVD OPTIONS :
--svd
Reduces the feature space dimensions by performing Singular Value Decomposition (SVD). By default, all feature dimensions are retained.
--k K
Reduces the dimensions of the feature space to K. Default K = 300
--rf RF
Specifies the scaling factor for reducing feature space dimensions such that feature space with N dimensions is reduced down to N/RF. Default RF = 4. RF should be an integer greater than 1.
If both --k and --rf are specified, dimensions are reduced to min(k,N/RF).
Note: If the reduced dimensions ( min(k,N/RF) ) turn-out to be less than
or equal to 10 then svd is not performed.
--iter I
Specifies the number of iterations of SVD. Recommended value is 3 times the desired K.
CLUSTER-STOPPING OPTIONS:
--cluststop CS
Specifies the cluster stopping measure to be used to predict the number the number of clusters.
The possible option values:
pk1 - Use PK1 measure [ PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM])) ]
pk2 - Use PK2 measure [ PK2[m] = (crfun[m]/crfun[m-1]) ]
pk3 - Use PK3 measure [ PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1])) ]
gap - Use Adapted Gap Statistic.
pk - Use all the PK measures.
all - Use all the four cluster stopping measures.
More about these measures can be found in the documentation of Toolkit/clusterstop/clusterstopping.pl
NOTE: Options --cluststop and --clusters (described under Clustering options) cannot be used together.
--delta INT
NOTE: Delta value can only be a positive integer value.
Specify 0 to stop the iterating clustering process when two consecutive crfun values are exactly equal. This is the default setting when the crfun values are integer/whole numbers.
Specify non-zero positive integer to stop the iterating clustering process when the difference between two consecutive crfun values is less than or equal to this value. However, note that the integer value specified is internally shifted to capture the difference in the least significant digit of the crfun values when these crfun values are fractional. For example: For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001 For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005 The default delta value when the crfun values are fractional is 1.
However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta value is internally shifted only until the least significant digit in the scientific notation. For example: For crfun = 1.23e+04 & delta = 2 will be transformed to 200 For crfun = 2.45e+02 & delta = 5 will be transformed to 5 For crfun = 1.44e+03 & delta = 1 will be transformed to 10
--threspk1 NUM
Specifies the threshold value that should be used by the PK1 measure to predict the k value. Default = -0.7
NOTE: This option should be used only when --cluststop option is also used with option value of "all" or "pk1".
CLUSTER-STOPPING: ADAPTED GAP STATISTIC OPTIONS:
--B NUM
The number of replicates/references to be generated. Default: 1
--typeref TYP
Specifies whether to generate B replicates from a reference or to generate B references.
The possible option values: rep - replicates [Default] ref - references
--percentage NUM
Specifies the percentage confidence to be reported in the log file. Since Gap Statistic uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population ("true") mean and with what certainty. Default: 90
--seed NUM
The seed to be used with the random number generator. Default: No seed is set.
CLUSTERING OPTIONS :
--clusters N
Specifies number of clusters to be created. Default is set to 2.
--space SPACE
Specifies whether clustering is to be performed in vector or similarity space. Set the value of SPACE to 'vector' to perform clustering in vector space i.e. to cluster the context vectors directly. To cluster in similarity space by explicitly finding the pair-wise similarities among the contexts, set SPACE to 'similarity'.
By default, clustering is performed in vector space.
--clmethod CL
Specifies the clustering method.
Possible option values are :
rb - Repeated Bisections [Default]
rbr - Repeated Bisections for by k-way refinement
direct - Direct k-way clustering
agglo - Agglomerative clustering
graph - Graph partitioning-based clustering
bagglo - Partitional biased Agglomerative clustering
For large amount of data, 'rb', 'rbr' or 'direct' are recommended.
--crfun CR
Selects the criteria function for Clustering. The meanings of these criteria functions are explained in Cluto's manual.
The possible values are:
i1 - I1 Criterion function
i2 - I2 Criterion function [default for partitional]
e1 - E1 Criterion function
g1 - G1 Criterion function
g1p - G1' Criterion function
h1 - H1 Criterion function
h2 - H2 Criterion function
slink - Single link merging scheme
wslink - Single link merging scheme weighted w.r.t. cluster sim
clink - Complete link merging scheme
wclink - Complete link merging scheme weighted w.r.t. cluster sim
upgma - Group average merging scheme [default for agglomerative]
Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion functions can only be used. If a crfun other than these is selected then cluster stopping uses the default crfun (i2) while the final clustering of contexts is performed using the crfun specified.
--sim SIM
Specifies the similarity measure to be used for either vector or similarity space clustering.
When --space = vector (or default), possible values of SIM are :
cos - Cosine [default]
corr - Correlation Coefficient
dist - Euclidean distance
jacc - Extended Jaccard Coefficient
When --space = similarity and --binary is ON, possible values of SIM are -
cos - Cosine [default]
mat - Match
jac - Jaccard
ovr - Overlap
dic - Dice
Otherwise, only cosine measure is available and is default.
The following table summarizes availability of similarity measures for 2 clustering approaches - vector(vcl) and similarity(scl) and on 2 different types of context vectors - binary Vs frequency
vcl+bin vcl+freq scl+bin scl+freq
cos Y Y Y Y
mat N N Y N
jacc Y Y Y N
dice N N Y N
ovr N N Y N
dist Y Y N N
corr Y Y N N
The reasons are purely implementation issues and in future, we plan to support more consistent measures across these combinations.
--rowmodel RMOD
The option is used to specify the model to be used to scale every column of each row. (For further details please refer Cluto manual)
The possible values for RMOD - none - no scaling is performed (default setting) maxtf - post scaling the values are between 0.5 and 1.0 sqrt - square-root of actual values log - log of actual values
--colmodel CMOD
The option is used to specify the model to be used to (globally) scale each column across all rows. (For further details please refer Cluto manual)
The possible values for CMOD - none - no scaling is performed (default setting) idf - scaling according to inverse-document-frequency
LABELING OPTIONS :
Note: Labeling options cannot be used with word-clustering (--wordclust).
--label_stop LABEL_STOPFILE
A file of Perl regexes that define the stop list of words to be excluded from the features.
LABEL_STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the LABEL_STOPFILE - ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the LABEL_STOPFILE - ignores word pairs in which either word is a stop word.
Default is OR.
--label_remove LABEL_N
Removes bigrams that occur less than LABEL_N times.
--label_window LABEL_W
Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window LABEL_W allows at most LABEL_W-2 intervening words) will form the bigram features. Default window size is 2 which allows only consecutive word pairs.
--label_stat LABEL_STAT
Specifies the statistical scores of association.
Available tests of association are :
dice - Dice Coefficient
ll - Log Likelihood Ratio
odds - Odds Ratio
phi - Phi Coefficient
pmi - Point-Wise Mutual Information
tmi - True Mutual Information
x2 - Chi-Squared Test
tscore - T-Score
leftFisher - Left Fisher's Test
rightFisher - Right Fisher's Test
--label_rank LABEL_R
Word pairs ranking below LABEL_R when arranged in descending order of their test scores are ignored.
Other Options :
--eval
Evaluates clustering performance by computing precision and recall for maximally accurate assignment of sense tags to clusters. Maximal Assignment is when clusters are given sense labels such that maximum number of instances will be attached with their true sense tags.
TEST instances tagged with multiple senses are automatically attached with the single sense-tag that is the most frequent among the attached tags.
Note: This option can be used only if the answer tags are provided in the TEST file.
--rank_filter R
Allows to remove low frequency senses during evaluation. This will remove the senses that rank below R when senses in TEST are arranged in the descending order of their frequencies. In other words, it selects top R most frequent senses. An instance will be removed if it has all sense tags below rank R.
--percent_filter P
Allows to remove low frequency senses based on their percentage frequencies. This will remove senses whose frequency is below P% in the TEST data.
If rank or percent filters are specified, they are applied after removing the multiple sense tags.
--help
Displays the quick summary of program options.
--version
Displays the version information.
--verbose
Displays to STDERR the current program status.
--showargs
Displays to STDOUT values of compulsory and required parameters. [NOT SUPPORTED IN THIS VERSION]
OUTPUT
discriminate.pl creates several output files. The discrimination of contexts performed by discriminate.pl, (i.e., a cluster assigned to each context) is given by the file $PREFIX.clusters if the number of clusters was set manually, otherwise by the file $PREFIX.clusters.$CLUSTSTOP where the $CLUSTSTOP specifies the cluster stopping measure that was used to predict the number of clusters.
In addition, discriminate.pl also creates following files:
NOTE: If a cluster stopping measure was used then it is indicated in the names of several output files by appending the cluster stopping measure name with the file name. Represented below as filename[.$CLUSTSTOP]
$PREFIX.clusters_context[.$CLUSTSTOP] - File containing all the input instances grouped by the cluster-id assigned to them.
$PREFIX[.$CLUSTSTOP].cluster.CLUSTERID - All the identified clusters and their instances are separated into different files. The filenames end with the cluster-id. e.g.: File containing instances of cluster 0 will be named as $PREFIX.cluster.0
$PREFIX.report[.$CLUSTSTOP] - Confusion table if --eval is ON
$PREFIX.cluster_labels[.$CLUSTSTOP] - List of labels (word-pairs) assigned to each cluster.
$PREFIX[.$CLUSTSTOP].dendogram.ps - Dendograms + some information.
$PREFIX.features - Features file
$PREFIX.regex - File containing regular expressions for identifying the features listed in $PREFIX.features file.
$PREFIX.testregex - File containing only those regular expressions from the $PREFIX.regex file above, which match at least once in the test contexts, only created in second order context clustering mode (SC native as well as LSA) and LSA feature clustering mode
$PREFIX.wordvec - Word Vectors if --context = o2
$PREFIX.vectors - Context Vectors
$PREFIX.rlabel - Row Labels of $PREFIX.vectors
$PREFIX.clabel - Column Labels of $PREFIX.vectors
$PREFIX.rclass - Class Ids of $PREFIX.vectors if --eval is ON
$PREFIX.cluster_solution[.$CLUSTSTOP] - Cluster ids of $PREFIX.vectors
$PREFIX.cluster_output[.$CLUSTSTOP] - Clustering program output
Cluster Stopping related output files:
$PREFIX.pk1 - crfun[k] values, delta values, PK1[k] values and predicted k value
$PREFIX.pk2 - crfun[k] values, delta values, PK2[k] values and predicted k value
$PREFIX.pk3 - crfun[k] values, delta values, PK3[k] values and predicted k value
$PREFIX.gap - crfun[k] values, delta values and predicted k value
$PREFIX.gap.log - Gap(k), Obs(crfun(k)), Exp(crfun(k)) values etc.
The following files are created to facilitate creation of plots, if needed:
$PREFIX.cr.dat - value-pairs :- k-value crfun-value
$PREFIX.pk1.dat - value-pairs :- k-value PK1[k] value
$PREFIX.pk2.dat - value-pairs :- k-value PK2[k] value
$PREFIX.pk3.dat - value-pairs :- k-value PK3[k] value
$PREFIX.gap.dat - value-pairs :- k-value Gap[k] value
$PREFIX.exp.dat - value-pairs :- k-value Exp(crfun[k]) value
SYSTEM REQUIREMENTS
Perl (version 5.8.5 or better) - http://www.perl.org
Text::NSP - http://search.cpan.org/dist/Text-NSP
Perl Data Language (PDL) - http://search.cpan.org/dist/PDL/
Bit::Vector http://search.cpan.org/dist/Bit-Vector/ when --binary is turned ON.
Math::SparseVector - http://search.cpan.org/dist/Math-SparseVector/
Math::SparseMatrix - http://search.cpan.org/dist/Math-SparseMatrix/
Math::BigInt - http://search.cpan.org/dist/Math-BigInt/
Set::Scalar - http://search.cpan.org/dist/Set-Scalar/
Algorithm::Munkres - http://search.cpan.org/dist/Algorithm-Munkres/
SVDPACK - http://www.netlib.org/svdpack/
Clustering Toolkit - Cluto http://www-users.cs.umn.edu/~karypis/cluto
AUTHOR
Ted Pedersen, University of Minnesota, Duluth
Amruta Purandare, University of Pittsburgh
Anagha Kulkarni, University of Minnesota, Duluth
Mahesh Joshi, University of Minnesota, Duluth
COPYRIGHT
Copyright (c) 2002-2006,
Ted Pedersen, University of Minnesota, Duluth
tpederse@d.umn.edu
Amruta Purandare, University of Pittsburgh
amruta@cs.pitt.edu
Anagha Kulkarni, University of Minnesota, Duluth
kulka020@d.umn.edu
Mahesh Joshi, University of Minnesota, Duluth
joshi031@d.umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.