NAME
rawtextFreq.pl - Compute Information Content from Raw / Plain Text
SYNOPSIS
rawtextFreq.pl --outfile OUTFILE [--stopfile=STOPFILE]
{--stdin | --infile FILE [--infile FILE ...]}
[--wnpath WNPATH] [--resnik] [--smooth=SCHEME]
| --help | --version
OPTIONS
--outfile=filename
The name of a file to which output should be written
--stopfile=filename
A file containing a list of stop listed words that will not be
considered in the frequency counts. A sample file can be down-
loaded from
http://www.d.umn.edu/~tpederse/Group01/WordNet/words.txt
--wnpath=path
Location of the WordNet data files (e.g.,
/usr/local/WordNet-3.0/dict)
--resnik
Use Resnik (1995) frequency counting
--smooth=SCHEME
Smoothing should used on the probabilities computed. SCHEME can
only be ADD1 at this time
--help
Show a help message
--version
Display version information
--stdin
Read from the standard input the text that is to be used for
counting the frequency of words.
--infile=PATTERN
The name of a raw text file to be used to count word frequencies.
This can actually be a filename, a directory name, or a pattern (as
understood by Perl's glob() function). If the value is a directory
name, then all the files in that directory and its subdirectories will
be used.
If you are looking for some interesting files to use, check out
Project Gutenberg: <http://www.gutenberg.org>.
This option may be given more than once (if more than one file
should be used).
DESCRIPTION
This program reads a corpus of plain text and computes frequency counts from that corpus and then uses those to determine the information content of each synset in WordNet. In brief it does this by first assigning counts to each synset for which it obtains a frequency count in the corpus, and then those counts are propagated up the WordNet hierarchy. More details on this process can be found in the documentation of the lin, res, and jcn measures in WordNet::Similarity and in the publication by Patwardhan, et. al. (2003) referred to below.
The utility programs BNCFreq.pl, SemCorRawFreq.pl, treebankFreq.pl, brownFreq.pl all function in exactly the same way as this plain text program (rawtextFreq.pl), except that they include the ability to deal with the format of the corpus with which they are used.
None of these programs requires sense-tagged text; instead they simply distribute the counts of the observed form of word to all the synsets in the corpus to which it could be associated. The different forms of a word are found via the validForms and querySense methods of WordNet::QueryData.
For example, if the observed word is 'bank', then a count is given to the synsets associated with the financial institution, a river shore, the act of turning a plane, etc.
Distributing Counts to Synsets
If the corpora is sense-tagged, then distributing the counts of sense-tagged words to synsets is trivial; you increment the count of each synset for which you have a sense tagged instance. It is very hard to obtain large quantities of sense tagged text, so in general it is not feasible to obtain information content values from large sense-tagged corpora.
As such this program and the related *Freq.pl utilities are all trying to increment the counts of synsets based on the occurence of raw untagged word forms. In this case it is less obvious how to proceed. This program supports two methods for distributing the counts of an observed word forms in untagged text to synsets.
One is our default method, and we refer to the other as Resnik counting. In our default counting scheme, each synset receives the total count of each word form associated with it.
Suppose the word 'bank' can be associated with six different synets. In our default scheme each of those synsets would receive a count for each occurrence of 'bank'. In Resnik counting, the count would be divided between the possible synsets, so in this case each synset would get one sixth (1/6) of the total count.
How are These Counts Used?
This program maps word forms to synsets. These synset counts are then propagated up the WordNet hierarchy to arrive at Information Content values for each synset, which are then used by the Lin (lin), Resnik (res), and Jiang & Conrath (jcn) measures of semantic similarity.
By default these measures use counts derived from the cntlist file provided by WordNet, which is based on frequency counts from the sense-tagged SemCor corpus. This consists of approximately 200,000 sense tagged tokens taken from the Brown Corpus and the Red Badge of Courage.
A file called ic-semcor.dat is created during installation of WordNet::Similarity from cntlist. In fact, the util program semCorFreq.pl is used to do this. This is the only one of the *Freq.pl utility programs that uses sense tagged text, and in fact it only uses the counts from cntlist, not the actual sense tagged text.
This program simply creates an alternative version of the ic-semcor.dat file based on counts obtained from raw untagged text.
Why Use This Program?
The default information content file (ic-semcor.dat) is based on SemCor, which includes sense tagged portions of the Brown Corpus and the Red Badge of Courage. It has the advantage of being sense tagged, but is from a rather limited domain and is somewhat small in size (200,000 sense tagged tokens).
If you are working in a different domain or have access to a larger quantity of corpora, you might find that this program provides information content values that better reflect your underlying domain or problem.
How can these counts be reliable if they aren't based on sense tagged text?
Remember once the counts are given to a synset, those counts are propogated upwards, so that each synset receives the counts of its children. These are then used in the calculation of the information content of each synset, which is simply :
information content (synset) = - log [probability (synset)]
More details on this calculation and how they are used in the res, lin, and jcn measures can be found in the WordNet::Similarity module doumentation, and in the following publication:
Using Measures of Semantic Relatedness for Word Sense Disambiguation
(Patwardhan, Banerjee and Pedersen) - Appears in the Proceedings of
the Fourth International Conference on Intelligent Text Processing and
Computational Linguistics, pp. 241-257, February 17-21, 2003, Mexico City.
L<http://www.d.umn.edu/~tpederse/Pubs/cicling2003-3.pdf>
We believe that a propagation effect will result in concentrations or clusters of information content values in the WordNet hierarchy. For example, if you have a text about banking, while the different counts of "bank" will be dispersed around WordNet, there will also be other financial terms that occur with bank that will occur near the financial synset in WordNet, and lead to a concentration of counts in that region of WordNet. It is best to view this as a conjecture or hypothesis at this time. Evidence for or against would be most interesting.
You can use raw text of any kind in this program. We sometimes use text from Project Gutenburg, for example the Complete Works of Shakespeare, available from http://www.gutenberg.org/ebooks/100
BUGS
Report to WordNet::Similarity mailing list : http://groups.yahoo.com/group/wn-similarity
SEE ALSO
WordNet home page : http://wordnet.princeton.edu
WordNet::Similarity home page : http://wn-similarity.sourceforge.net
AUTHORS
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh
banerjee+ at cs.cmu.edu
Siddharth Patwardhan, University of Utah, Salt Lake City
sidd at cs.utah.edu
Jason Michelizzi
COPYRIGHT
Copyright (c) 2005-2008, Ted Pedersen, Satanjeev Banerjee, Siddharth Patwardhan and Jason Michelizzi
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
Free Software Foundation, Inc.
59 Temple Place - Suite 330
Boston, MA 02111-1307, USA