NAME
prepare_sval2.pl - Makes sure Senseval-2 data is cleaned and has sense tags prior to invocation of SenseClusters
SYNOPSIS
prepare_sval2.pl [Options] SOURCE
Here is a Senseval-2 file that is untagged
cat notags.txt
Output =>
<corpus lang="english">
<lexelt item="line">
<instance id="0">
<context>
he played on the offensive <head>line</head> in college
</context>
</instance>
<instance id="1">
<context>
i think the phone <head>line</head> is down
</context>
</instance>
</lexelt>
</corpus>
Here is a key file that contains sense tags for these instances:
cat key.txt
Output =>
<instance id="0"/> <sense id="formation"/>
<instance id="1"/> <sense id="cable"/>
Now we can apply the tags in the key file to the previously untagged instances:
prepare_sval2.pl notags.txt --key key.txt
Output =>
<corpus lang="english" tagged="NO">
<lexelt item="line">
<instance id="0">
<answer instance="0" senseid="formation"/>
<context>
he played on the offensive <head>line</head> in college
</context>
</instance>
<instance id="1">
<answer instance="1" senseid="cable"/>
<context>
i think the phone <head>line</head> is down
</context>
</instance>
</lexelt>
</corpus>
Type prepare_sval2.pl --help
for quick summary of options
DESCRIPTION
This program prepares Senseval-2 Data for SenseClusters experiments by making sure that all instances have sense tags. Sense tags can be applied from a separate key file, and if any instances do not have tags, then a NOTAG is inserted. This program also deals with P tags that may exist in some Senseval data. The P tag indicates that the target word is a proper noun. In may cases P tagged instances are ommited from experiments since they are a different kind of sense. If "bush" were the target word, some instances might refer to "George Bush", which may not be one of the senses we wish to evaluate. Finally, this program can also deal with satellite tags that exist in some Senseval data. When the target word is a verb, in some cases it may have a satellite (particle), that we may or may not want to consider as a part of the target word. The satellite tags have identifiers in them that may cause parsing trouble, so they are often removed.
INPUT
Required Arguments:
SOURCE
A Senseval-2 formatted Data file that is to be prepared for the SenseClusters experiments.
Optional Arguments:
--key KEY
Sense Tagging mechanism in prepare_sval2.pl -
prepare_sval2.pl makes sure that all SOURCE instances are tagged with some answer tags (or NOTAGs at least).
If the sense tags are found in the same SOURCE file, these will be retained, however if the SOURCE instances are not tagged, instances will be either attached "NOTAG"s or will be attached the sense tags given in the separate KEY file.
A KEY file that has true answer keys of the SOURCE instances can be provided via --key option. If the SOURCE instances are not sense tagged, they will be tagged with the sense tags as given in the KEY file.
KEY file should be in SenseClusters format showing
<instance id="I"/> [<sense id="S"/>]+
on each line where an instance id is followed by its true sense ids on a single line.
prepare_sval2 takes into account following anamolies in SOURCE/KEY -
If the 1st SOURCE instance is sense tagged, it assumes that SOURCE is sense tagged and will disable the KEY file option. If some of the SOURCE instances are not tagged, regardless of whether they have keys in KEY file or not, these are given "NOTAG"s.
If the 1st SOURCE instance is not sense tagged, it assumes that SOURCE is untagged and will give an error if any SOURCE instance is found sense tagged in the SOURCE file.
If the 1st SOURCE instance is not sense tagged and has an entry in the KEY file, it will enable the KEY file and will attach the instances with their answer keys as given in the KEY file. Any instance that doesn't have an answer key in the KEY file is attached "NOTAG".
If the 1st SOURCE instance is not sense tagged and doesn't have an entry in the KEY file, KEY file will be disabled and no instance will be attached a tag from the KEY file. All instances are given "NOTAG"s.
--attachP
P tag handling mechanism in prepare_sval2.pl -
prepare_sval2.pl by default removes the sense tags that have value P. According to Senseval-2 standard, these are not true sense tags but indicate that the target word is a proper noun.
--attachP option will attach a P tag to an immediately following sense tag for the same instance.
e.g. If --attachP is selected,
<instance id="art.40012" docsrc="bnc_A0E_130">
<answer instance="art.40012" senseid="P"/>
<answer instance="art.40012" senseid="arts%1:09:00::"/>
will be modified to
<instance id="art.40012" docsrc="bnc_A0E_130">
<answer instance="art.40012" senseid="P_arts%1:09:00::"/>
and if --attachP is not selected, by default P tag will be removed as
<instance id="art.40012" docsrc="bnc_A0E_130">
<answer instance="art.40012" senseid="arts%1:09:00::"/>
--modifysat
This switch if selected will remove the satellite tag ids from <head sats=" ID"/> and <sat id="ID"/> tags, retaining basic <head> and <sat> tag information.
e.g. by selecting --modifysat,
Perhaps he 'd have <head sats="call_for.018:0">called</head> <sat
id="call_for.018:0">for</sat> a decentralized political and economic
system
will be transformed to
perhaps he 'd have <head> called </head> <sat> for </sat> a
decentralized political and economic system
By not selecting --modifysat, the satellite ids would be retained.
--nolc
prepare_sval2 converts everything to lowercase by default. Select this switch to not do any case conversion.
--help
Displays this message.
--version
Displays the version information.
OUTPUT
Output will be a Senseval-2 file displayed to stdout.
AUTHORS
Amruta Purandare, University of Pittsburgh
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
COPYRIGHT
Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.