NAME
order2vec.pl - Convert Senseval-2 contexts into second order context vectors in Cluto format
SYNOPSIS
order2vec.pl [OPTIONS] SVAL2 WORDVEC FEATURE_REGEX
Type order2vec.pl --help
for a quick summary of options.
DESCRIPTION
Creates second order context vectors by averaging word or feature vectors of the contextual features.
INPUT
Required Arguments:
SVAL2
A tokenized, preprocessed and well formatted Senseval-2 instance file showing instances whose context vectors are to be generated.
order2vec creates a context vector for each instance in the given SVAL2 file by averaging the word or feature vectors of the features that appear in the context.
WORDVEC
Should be one of the following type of files:
A file containing word vectors as created by program wordvec.pl
A file containing feature vectors as created by order1vec.pl, using its --transpose option.
Each line in WORDVEC should show a word or feature vector of the feature represented by the corresponding line in the FEATURE_REGEX file.
order2vec accepts WORDVEC in both sparse and dense formats. If WORDVEC is in dense format, switch --dense should be selected.
FEATURE_REGEX
Should be one of the following type of files:
The output file generated by running nsp2regex.pl on the FEATURE file as generated by program wordvec.pl while creating the WORDVEC file.
The TEST_REGEX file created by order1vec.pl using its --testregex option, while creating the feature-by-context output file using the --transpose option.
Each line in FEATURE_REGEX file should show a regular expression for a feature whose feature vector appears on the corresponding line in the WORDVEC file. FEATURE_REGEX should be formatted like the output of the nsp2regex.pl program.
Sample FEATURE_REGEX files:
A file output by nsp2regex.pl when it is run on the file produced by --feats option of wordvec.pl:
/\s(<[^>]*>)*details(<[^>]*>)*\s/ @name = details /\s(<[^>]*>)*weather(<[^>]*>)*\s/ @name = weather /\s(<[^>]*>)*test(<[^>]*>)*\s/ @name = test /\s(<[^>]*>)*cloth(<[^>]*>)*\s/ @name = cloth /\s(<[^>]*>)*health(<[^>]*>)*\s/ @name = health /\s(<[^>]*>)*art(<[^>]*>)*\s/ @name = art
A TEST_REGEX file output by order1vec.pl using its --testregex option:
/\s(<[^>]*>)*polygonal(<[^>]*>)*\s/ @name = polygonal /\s(<[^>]*>)*ectoderm(<[^>]*>)*\s/ @name = ectoderm /\s(<[^>]*>)*fluid(<[^>]*>)*\s/ @name = fluid /\s(<[^>]*>)*CEMx174(<[^>]*>)*\s/ @name = CEMx174 /\s(<[^>]*>)*adjacent(<[^>]*>)*\s/ @name = adjacent /\s(<[^>]*>)*mutant(<[^>]*>)*\s/ @name = mutant /\s(<[^>]*>)*progenitor(<[^>]*>)*\s/ @name = progenitor /\s(<[^>]*>)*Ganglion(<[^>]*>)*\s/ @name = Ganglion /\s(<[^>]*>)*MLS(<[^>]*>)*\s/ @name = MLS /\s(<[^>]*>)*male(<[^>]*>)*\s/ @name = male /\s(<[^>]*>)*mother(<[^>]*>)*\s/ @name = mother
Optional Arguments:
--binary
Select this switch to create binary context vectors. Binary vectors are computed by taking the binary OR of the word vectors of features that are found in the context. By default, order2vec creates frequency score vectors that show arithmatic avearge of the word vectors of contextual features.
--dense
By default, word vectors in WORDVEC are assumed to be in sparse format. Also, the context vectors displayed by order2vec are in sparse format.
Select --dense if the word vectors are in dense format. This will automatically create output vectors in dense format as well.
**************** IMPORTANT NOTE ************
Dense word vectors (when --dense is ON) should be formatted i.e. each entry
of WORDVEC should be represented using the same numeric format and should
occupy exactly same number of spaces. Use --format option to specify the
format of dense word vectors.
--rlabel RLABELFILE
Creates a RLABELFILE containing row labels for Cluto's --rlabelfile option. Each line in RLABELFILE shows an instance id of the instance whose context vector appears on the corresponding line on STDOUT.
Instance ids are extracted from the SVAL2 file by matching regex
/<instance id\s*=\s*"IID"/>/
where 'IID' is an instance id of the <context> that follows this <instance> tag.
--rclass RCLASSFILE
Creates RCLASSFILE for Cluto's --rclassfile option. Each line in the RCLASSFILE shows the true sense id of the instance whose context vector appears on the corresponding line on STDOUT.
Sense ids are extracted from the SVAL2 file by matching regex
/sense\s*id\s*=\s*"SID"\/>/
where SID shows the true sense tag of the instance whose IID is recently extracted by matching
/<instance id\s*=\s*"IID"/>/
--showkey
Displays the name of the system generated KEY file on the first line of STDOUT. KEY file preserves the instance ids and sense tags of the instances in the given SVAL2 file. This information will be automatically used by some of the clustering and evaluation programs in SenseClusters that operate on purely numeric instance formats. The option should be selected if the user is planning to run SenseClusters' clustering code.
Other Options :
--format FORM
If --dense is ON, input WORD VECtors need to be formatted i.e. should be represented using same numeric format and occupy same number of digit spaces. If wordvec.pl was run using its --format option, then the value of --format to order2vec.pl should be same as that specified in wordvec.pl's --format option.
Format should be represented as
iN -> integer format where each entry occupies total N bytes/digits
fN.M -> floating point format where each entry occupies total N bytes/digits
of which last M digits show the fractional part
When --binary is ON, default format is i2 that assumes 2 digit space for each entry. When --binary is OFF, default format is f16.10 that assumes each entry is fractional occupying total 16 digit equivalent spaces of which last 10 digits show the fractional part.
Output context vectors (sparse or dense) will be represented using the specified format value or default f16.10.
--help
Displays this message.
--version
Displays the version information.
OUTPUT
Output shows a single context vector on each line. Context vectors represent instances in the same order as they appear in the given SVAL2 file i.e. each ith vector on STDOUT shows a context vector of the ith instance in the SVAL2 file.
Each context vector is an average of the WORD VECtors of the features that are found in the context using FEATURE_REGEX.
Sample Sparse Output
Input Sval2 file => test.sval2
<corpus lang="english">
<lexelt item="LEXELT">
<instance id="hard-a.sjm-098_3:">
<answer instance="hard-a.sjm-098_3:" senseid="HARD1"/>
<context>
someone has to kill him to defeat him and that s <head>HARD</head> to do
</context>
</instance>
<instance id="hard-a.w8_038:">
<answer instance="hard-a.w8_038:" senseid="HARD3"/>
<context>
I find it <head>HARD</head> to believe that you don't believe me
</context>
</instance>
<instance id="hard-a.sjm-255_13:">
<answer instance="hard-a.sjm-255_13:" senseid="HARD3"/>
<context>
when you get bad credit data or are confused with another person your life gets <head>HARD</head>
</context>
</instance>
<instance id="hard-a.sjm-231_3:">
<answer instance="hard-a.sjm-231_3:" senseid="HARD2"/>
<context>
Our life is <head>HARDER</head> now yes but it is better to live hungry and free life
</context>
</instance>
<instance id="hard-a.sjm-096_2:">
<answer instance="hard-a.sjm-096_2:" senseid="HARD1"/>
<context>
Ray who told his colleagues We have to face the <head>HARD</head> facts of life due to bad credit
</context>
</instance>
</lexelt>
</corpus>
Input FEATURE_REGEX file => test.regex
/\s(<[^>]*>)*<head>HARD<\/head>(<[^>]*>)*\s/ @name = <head>HARD</head>
/\s(<[^>]*>)*to(<[^>]*>)*\s/ @name = to
/\s(<[^>]*>)*defeat(<[^>]*>)*\s/ @name = defeat
/\s(<[^>]*>)*believe(<[^>]*>)*\s/ @name = believe
/\s(<[^>]*>)*credit(<[^>]*>)*\s/ @name = credit
/\s(<[^>]*>)*life(<[^>]*>)*\s/ @name = life
/\s(<[^>]*>)*facts(<[^>]*>)*\s/ @name = facts
/\s(<[^>]*>)*kill(<[^>]*>)*\s/ @name = kill
Input Saprse Word Vectors => test.sparse_wordvec
8 10 41
1 4.977 4 7.813 8 9.114 10 1.431
1 5.944 2 5.728 3 2.978 5 5.604 7 9.444 9 3.680
2 3.984 3 8.306 4 6.632 5 4.514 7 1.785 9 7.609
2 9.147 4 3.086 5 0.325 9 1.456
1 0.741 4 3.450 6 2.363
1 9.549 2 3.921 3 8.131 4 4.301 5 9.059 6 8.607 10 1.138
2 8.203 4 7.297 5 1.095 7 4.362 8 2.963 10 7.264
2 4.296 4 9.802 7 9.268 9 8.856 10 9.723
Command =>
order2vec.pl --format f9.4 test.sval2 test.sparse_wordvec test.regex
Output =>
5 10 45
1 3.8015 2 4.2440 3 2.8733 4 4.0412 5 3.5543 7 6.5642 8 1.5190 9 4.5842 10 1.8590
1 2.7302 2 6.0055 3 0.7445 4 3.4962 5 1.5635 7 2.3610 8 2.2785 9 1.6480 10 0.3578
1 5.0890 2 1.3070 3 2.7103 4 5.1880 5 3.0197 6 3.6567 8 3.0380 10 0.8563
1 8.3473 2 4.5233 3 6.4133 4 2.8673 5 7.9073 6 5.7380 7 3.1480 9 1.2267 10 0.7587
1 4.5258 2 3.9300 3 2.3478 4 3.8102 5 3.5603 6 1.8283 7 3.8750 8 2.0128 9 1.2267 10 1.6388
Explanation =>
First instance <hard-a.sjm-098_3:> contains features
'<head>HARD</head>' once,
'to' thrice,
'defeat' once,
'kill' once.
Hence, the context vector of instance <hard-a.sjm-098_3:> shown on Line 2 on STDOUT is an average of sparse word vectors ->
[1 4.977 4 7.813 8 9.114 10 1.431]
3 * [1 5.944 2 5.728 3 2.978 5 5.604 7 9.444 9 3.680]
[2 3.984 3 8.306 4 6.632 5 4.514 7 1.785 9 7.609]
[2 4.296 4 9.802 7 9.268 9 8.856 10 9.723]
OR
[1 4.977 4 7.813 8 9.114 10 1.431]
[1 17.832 2 17.184 3 8.934 5 16.812 7 28.332 9 11.04]
[2 3.984 3 8.306 4 6.632 5 4.514 7 1.785 9 7.609]
[2 4.296 4 9.802 7 9.268 9 8.856 10 9.723]
The Sum of above vectors is a sparse vector =>
[1 22.809 2 25.464 3 17.24 4 24.247 5 21.326 7 39.385 8 9.114 9 27.505 10 11.154]
And the average is =>
[1 3.8015 2 4.2440 3 2.8733 4 4.0412 5 3.5543 7 6.5642 8 1.5190 9 4.5842 10 1.8590]
Similarly, all context vectors are computed by averaging the word vectors of features that match in the context.
Sample Dense Output
In the above example, if WORDVEC is dense => test.dense_wordvec
8 10
4.9770 0.0000 0.0000 7.8130 0.0000 0.0000 0.0000 9.1140 0.0000 1.4310
5.9440 5.7280 2.9780 0.0000 5.6040 0.0000 9.4440 0.0000 3.6800 0.0000
0.0000 3.9840 8.3060 6.6320 4.5140 0.0000 1.7850 0.0000 7.6090 0.0000
0.0000 9.1470 0.0000 3.0860 0.3250 0.0000 0.0000 0.0000 1.4560 0.0000
0.7410 0.0000 0.0000 3.4500 0.0000 2.3630 0.0000 0.0000 0.0000 0.0000
9.5490 3.9210 8.1310 4.3010 9.0590 8.6070 0.0000 0.0000 0.0000 1.1380
0.0000 8.2030 0.0000 7.2970 1.0950 0.0000 4.3620 2.9630 0.0000 7.2640
0.0000 4.2960 0.0000 9.8020 0.0000 0.0000 9.2680 0.0000 8.8560 9.7230
Command =>
order2vec.pl --format f9.4 --dense test.sval2 test.dense_wordvec test.feat
Output =>
5 10
3.8015 4.2440 2.8733 4.0412 3.5543 0.0000 6.5642 1.5190 4.5842 1.8590
2.7302 6.0055 0.7445 3.4962 1.5635 0.0000 2.3610 2.2785 1.6480 0.3578
5.0890 1.3070 2.7103 5.1880 3.0197 3.6567 0.0000 3.0380 0.0000 0.8563
8.3473 4.5233 6.4133 2.8673 7.9073 5.7380 3.1480 0.0000 1.2267 0.7587
4.5258 3.9300 2.3478 3.8102 3.5603 1.8283 3.8750 2.0128 1.2267 1.6388
Shows same context vectors as shown in Sample Sparse Output section only with --dense ON.
Note that, if --dense is ON, --format has to be used and must specify the format of dense word vectors.
SYSTEM REQUIREMENTS
AUTHORS
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Amruta Purandare, University of Pittsburgh
Mahesh Joshi, Carnegie-Mellon University
COPYRIGHT
Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Mahesh Joshi
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.