NAME
TODO Possible changes for WORDNET::SIMILARITY
SYNOPSIS
A list of things to do for WordNet Similarity.
DESCRIPTION
As these items are completed, move them down into Recently Completed Items, make sure to date and initial. When we have a version release, all of the recently completed items should be moved into changelog.pod.
FOR FUTURE VERSIONS
update hso to support new relations in WordNet 2.0
update hso to support the use of a hypothetical root node. Currently (as of version 0.06 and 0.07) its paths (for hypernyms) are limited to a particular taxonomy. This might be problematic when it comes to nouns, which are split into 9(?) separate taxonomies within wordnet. And of course verbs are split into hundreds of taxonomies. Right now when hso is on a hypernym path it isn't able to cross "up and over". Seems like it should be able to do so.
re-write hso to make it faster and more generic. check to see if hso uses hypo root node, and consider adding ability to turn on/off.
re-write *Freq.pl programs to reduce redundancy and make faster. At present there are bugs in all of these programs. Create test cases that are manually verified and included in testing directory.
support --trace option on info content programs to allow for wps format to be displayed in addition to (or instead of?) offset.
run profiles of rawtextFreq.pl and BNCFreq.pl to determine where time is being spent. Brown, SemCor and Treebank all seem to run reasonably quickly (20 minutes, 5 minutes, and 40 minutes, respectively). Run 1 million words worth of BNC in order to compare with Brown and Treebank.
rawtextFreq.pl runs really slowly. It may have to do with the fact that raw text has no markup in the text to identify sentence boundaries or otherwise guide the programs. This might particularly slow down compound identification.
Makefile.PL and semCorFreq.pl seem to be somewhat alike. Can Makefile.PL simply call /utils/SemCorFreq.pl, or can this duplication be avoided in some other way?
Update documentation to clarify that stoplists must also be all lowercase. Consider adjusting stoplists to use regular expressions.
speed up lesk, and make it more generic. string matching is the big offender with respect to speed, and wordnet specific stuff is the problem with respect to generality.
update lesk/vector to support new relations in WordNet 2.0
GOOD IDEAS FOR FUTURE WORK, DO WHEN POSSIBLE
edge/path and jcn are both distance measures. To convert them to similarity measures, we currently use 1/distance. This shifts the scale of the measures and changes the relative distance between pairs. Alternatives are to use -dist or maxdist-dist. Computation of maxdist for path is much like computation for lch (with and without hypo root node). for jcn it poses a new issue, in that we would need to find the pair of concepts that had the greatest individual information content, and are subsumed by a root node (either hypo or "real").
check if warnings are issued when there are version clashes between info content files and wordnet version.
RECENTLY COMPLETED ITEMS
AUTHORS
Ted Pedersen, University of Minnesota Duluth
tpederse at d.umn.edu
Siddharth Patwardhan, University of Utah, Salt Lake City
sidd at cs.utah.edu
Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh
banerjee+ at cs.cmu.edu
Jason Michelizzi, University of Minnesota Duluth
mich0212 at d.umn.edu
BUGS
None.
SEE ALSO
changelog.pod
COPYRIGHT
Copyright (c) 2005, Ted Pedersen, Siddharth Patwardhan, Satanjeev Banerjee, and Jason Michelizzi.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.