The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

filter.pl

SYNOPSIS

Filters given data by removing low frequency sense tags.

USAGE

filter.pl [OPTIONS] DATA FREQUENCY_OUTPUT

INPUT

Required Arguments:

filter.pl requires two compulsory arguments -

DATA

Senseval-2 formatted data file that is to be filtered.

FREQUENCY_OUTPUT

This should be an output created by program frequency.pl of this package that shows percentage frequency of each sense tag appearing in given DATA. FREQUENCY_OUTPUT should be created by running frequency.pl on the same DATA file that is input to filter.

This should show tags

       <sense id="S" percent="P"/>

that specify percent of each sense tag S in the DATA file.

Optional Arguments:

Filter Options:

--percent P

With this option, user can specify the percentage cutoff for filtering. When --percent is specified, filter.pl will remove all sense tags whose frequency in FREQUENCY_OUTPUT is below P %. A DATA instance that has all sense tags attached to it below P% is removed. In other words, only those DATA instances are retained which have atleast one sense tag with frequency more than or equal to P%.

--rank R

With this option, user can specify the rank cutoff for filtering. When --rank is specified, filter.pl will remove those sense tags that are ranked below R when senses are ordered according to their percentages. A DATA instance that has all sense tags attached to it below the rank R will be removed. In other words, only those DATA instances are retained which have atleast one sense tag above rank R.

filter.pl allows only one of the above filter conditions to be specified.

If neither of the filter options is specified, it will set the default filter condition as P = 1 and will filter DATA by removing sense tags less then 1%.

--nomulti

Removes multiple sense tags attached to an instance such that each instance is tagged with the most frequent sense tag among the tags attached to it.

Other Options :

--count COUNT

Filters the corresponding COUNT file created by preprocess.pl along with the DATA file. COUNT file is filtered such that it stays consistent with the new filtered DATA file and contains only those instances left after filtering, in the same order as they appear in the output.

Filtered COUNT is written to file COUNT.filtered and every ith line in COUNT.filtered shows the instance data within <context> and </context> tags for the ith instance in the output of filter.

--help

Displays this message.

--version

Displays the version information.

OUTPUT

Output is a sense filtered Senseval-2 file that shows only those DATA instances which have at least one sense tag left after filtering.

AUTHOR

Amruta Purandare, Ted Pedersen. University of Minnesota, Duluth.

COPYRIGHT

Copyright (c) 2002-2005,

Amruta Purandare, University of Pittsburgh. amruta@cs.pitt.edu

Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.