NAME
filter.pl Remove low frequency sense tags
SYNOPSIS
Filters given data by removing low frequency sense tags.
USAGE
filter.pl [OPTIONS] DATA FREQUENCY_OUTPUT
INPUT
Required Arguments:
filter.pl requires two compulsory arguments -
DATA
Senseval-2 formatted data file that is to be filtered.
FREQUENCY_OUTPUT
This should be an output created by program frequency.pl of this package that shows percentage frequency of each sense tag appearing in given DATA. FREQUENCY_OUTPUT should be created by running frequency.pl on the same DATA file that is input to filter.
This should show tags
<sense id="S" percent="P"/>
that specify percent of each sense tag S in the DATA file.
Optional Arguments:
Filter Options:
--percent P
With this option, user can specify the percentage cutoff for filtering. When --percent is specified, filter.pl will remove all sense tags whose frequency in FREQUENCY_OUTPUT is below P %. A DATA instance that has all sense tags attached to it below P% is removed. In other words, only those DATA instances are retained which have atleast one sense tag with frequency more than or equal to P%.
--rank R
With this option, user can specify the rank cutoff for filtering. When --rank is specified, filter.pl will remove those sense tags that are ranked below R when senses are ordered according to their percentages. A DATA instance that has all sense tags attached to it below the rank R will be removed. In other words, only those DATA instances are retained which have atleast one sense tag above rank R.
filter.pl allows only one of the above filter conditions to be specified.
If neither of the filter options is specified, it will set the default filter condition as P = 1 and will filter DATA by removing sense tags less then 1%.
--nomulti
Removes multiple sense tags attached to an instance such that each instance is tagged with the most frequent sense tag among the tags attached to it.
Other Options :
--count COUNT
Filters the corresponding COUNT file created by preprocess.pl along with the DATA file. COUNT file is filtered such that it stays consistent with the new filtered DATA file and contains only those instances left after filtering, in the same order as they appear in the output.
Filtered COUNT is written to file COUNT.filtered and every ith line in COUNT.filtered shows the instance data within <context> and </context> tags for the ith instance in the output of filter.
--help
Displays this message.
--version
Displays the version information.
OUTPUT
Output is a sense filtered Senseval-2 file that shows only those DATA instances which have at least one sense tag left after filtering.
AUTHOR
Amruta Purandare, Ted Pedersen. University of Minnesota, Duluth.
COPYRIGHT
Copyright (c) 2002-2005,
Amruta Purandare, University of Pittsburgh. amruta@cs.pitt.edu
Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.