NAME
semcor-reformat.pl - Reformat SemCor sense tagged files for use by wsd.pl
SYNOPSIS
semcor-reformat.pl {--semcor-dir DIR | --file FILE [FILE ...]} [--key]
EXAMPLE
semcor-reformat.pl --semcor-dir ~/semcor2.0
DESCRIPTION
This script reads a SemCor-formatted file and produces formatted text that can be used as input to wsd.pl. Alternatively, if the --key option is specified, the output will also include the sense number for each work, and this output can be used as a key file.
There are a few sources of data that are SemCor formatted, including SemCor itself and the Senseval-2 and Senseval-3 all words data sets. They have been made available for download by Rada Mihalcea:
http://www.cs.unt.edu/~rada/downloads.html
Only the words that are assigned valid sense numbers will be passed through this program. All other words are discarded. This means that only open-class words that appear in WordNet will be passed through. Closed class words (pronouns, conjuctions, etc.) and other words not appearing in WordNet are discarded.
head1 OPTIONS
- --semcor-dir=DIRECTORY
-
The location of the SemCor directory. This directory will contain several sub-directories, including 'brown1' and 'brown2'. Do not specify these sub-directories. Only specify the directory name that contains them. For example, if /home/user/semcor2.0 contains the brown1 and brown2 directories, you would only specify /home/user/semcor2.0 as the value of this option. Do not use this option at the same time as the --file option.
- --file=FILE
-
A semcor-formatted file to process. This can be used instead of the previous option to only specify a few Semcor files or to specify Senseval files. When this option is used, multiple files can be specified on the command line. For example
semcor-reformat.pl --file br-a01 br-a02 br-k18 br-m02 br-r05
Do not attempt to use this option when using the previous option.
- --key
-
Generates a key file for use by the scorer2 program instead of a file that can be used for wsd.pl. The scorer2 program can be used to measure the performance of a word sense disambiguation program. See the documentation for scorer2-format.pl for more information.
AUTHORS
Jason Michelizzi
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
COPYRIGHT AND LICENSE
Copyright (C) 2005-2008 by Jason Michelizzi and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.