NAME
frequency.pl - Compute the distribution of senses in a Senseval-2 data file
SYNOPSIS
frequency.pl [OPTIONS] SOURCE
You can find begin.v-test.xml in samples/Data
frequency.pl begin.v-test.xml
Output =>
<sense id="begin%2:30:00::" percent="64.31"/>
<sense id="begin%2:30:01::" percent="14.51"/>
<sense id="begin%2:42:04::" percent="21.18"/>
Total Instances = 255
Total Distinct Senses=3
Distribution={64.31,21.18,14.51}
% of Majority Sense = 64.31
Type frequency.pl --help
for a quick summary of options
DESCRIPTION
Displays distribution of senses in a given Senseval-2 file to STDOUT. This information can be used to better understand the data, and also to decide to filter low frequency senses (using filter.pl) or balance the distribution of senses (using balance.pl).
INPUT
Required Arguments:
SOURCE
SOURCE should be a Senseval-2 formatted file. The sense ids are searched by matching a regex /sense\s*id="S"/.
An instance having multiple sense ids should appear only once with multiple <answer> tags. e.g. If an instance IID has 2 sense ids SID1 and SID2, then in the SOURCE file, instance IID should be formatted as -
<instance id="IID">
<answer instance="IID" senseid="SID1"/>
<answer instance="IID" senseid="SID2"/>
<context>
Context Data comes here ....
</context>
</instance>
Optional Arguments:
--help
Displays this message.
--version
Displays the version information.
OUTPUT
Output displays
1. Total number of instances in SOURCE
These are counted by matching regex /instance id=\"ID\"/ for unique instance ids.
2. Total number of distinct sense tags found in SOURCE
These are searched by matching a regex /sense\s*id="S"/.
3. Sense Distribution
Output shows
<sense id="S" percent="P"/>
for each sense id found in SOURCE. P is the percentage frequency of the sense S.
4. % of Majority sense
This will be the highest sense percentage found in SOURCE.
Sample Output
<sense id="begin%2:30:00::" percent="59.49"/>
<sense id="begin%2:30:01::" percent="13.38"/>
<sense id="begin%2:42:00::" percent="4.70"/>
<sense id="begin%2:42:03::" percent="3.44"/>
<sense id="begin%2:42:04::" percent="18.99"/>
Total Instances = 548
Total Distinct Senses=5
Distribution={59.49,18.99,13.38,4.70,3.44}
% of Majority Sense = 59.49
Shows that there are total 548 instances and 5 senses.
The senses are distributed with frequencies
{59.49,18.99,13.38,4.70,3.44}
where majority sense has frequency = 59.49
The <sense> tags show the frequency of each individual tag.
AUTHORS
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Amruta Purandare, University of Pittsburgh
COPYRIGHT
Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to :
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.