NAME

DEVELOPERS - [documentation] Instructions on how to write a new measure for WordNet::Similarity

WRITING A NEW MEASURE

All the existing measures are written in an object-oriented manner, and if you are writing your own measure, you will need to write your measure in a like manner. If object-oriented Perl is new to you, see the perldoc pages relating to object-oriented Perl: perlboot, perltoot, perltooc, and perlbot.

The existing measure modules are found in lib/WordNet when you unpack the source tarball. The following methods are defined in WordNet::Similarity and are available to you:

sub new
sub initialize
sub configure
sub traceOptions
sub getTraceString
sub getError
sub getRelatedness
sub printSet
sub fetchFromCache
sub storeToCache
sub traceOptions

If you are writing a measure based on information content, the module WordNet::Similarity::ICFinder defines some extra methods:

sub probability
sub IC
sub getFrequency

And if you are writing a measure that does some sort of path-finding, the WordNet::Similarity::PathFinder module supplies some extra methods as well.

sub getShortestPath
sub getAllPaths

If you are writing a measure where you need to know the depth of a synset in the WordNet taxonomies or the maximum depth of a particular taxonomy, the WordNet::Similarity::DepthFinder module has methods that will be useful.

sub getSynsetDepth
sub getTaxonomyDepth
sub getTaxonomyRoot

If you want to find LCSs (Least Common Subsumers), there are three different ways of doing so, depending upon whether you want to use path length, depth, or information content. The three methods for finding LCSs are:

sub getLCSbyPath
sub getLCSbyDepth
sub getLCSbyIC

They are found in WordNet::Similarity::PathFinder, WordNet::Similarity::DepthFinder, and WordNet::Similarity::ICFinder.

For writing a measure that uses glosses (like vector, vector_pairs, and lesk), the WordNet::Similarity::GlossFinder module may be useful.

The documentation for the respective modules has detailed descriptions of how each methods works, what parameters each one expects, etc.

STEP BY STEP INSTRUCTIONS

The following steps should get you started.

  1. Create a file ending in .pm, such as newmeasure.pm.

  2. Declare the name of the package. This should be the same name as your filename (except for the .pm):

    package newmeasure;
  3. We need to 'use' WordNet::Similarity, or a sub-class of it. We also need to declare that our module is-a (is inherited from) WordNet::Similarity. We do this by adding WordNet::Similarity to the ISA array in your module. If your measure uses information content, then you probably want to use WordNet::Similarity::ICFinder instead. If you are doing some type of path-finding, then you might want to use WordNet::Similarity::PathFinder. Both PathFinder and ICFinder are sub-classes of Similarity, so if you put one of them in your @ISA array, you don't need WordNet::Similarity.

    In our case, let's try making a new information content measure:

    use WordNet::Similarity::ICFinder;
    our @ISA = qw/WordNet::Similarity::ICFinder/;
  4. The Similarity.pm module provides a 'new' method for us, and it does everything for us that we need.

  5. You need to write a getRelatedness function that actually computes the relatedness of two word senses. In our example here, we'll define relatedness as the average information content of the two input synsets.

    # a simple example
    sub getRelatedness {
      my $self = shift;
    
      # $wps1 and $wps2 need to be strings in
      # word#part_of_speech#sense format
      my $wps1 = shift;
      my $wps2 = shift;
    
      my $ref = $self->parseWps ($wps1, $wps2);
    
      # if ref is not a reference, that means an error has occured;
      # parseWps will have already set the error level to non-zero
      # and generated an error string
      ref $ref or return $ref;
    
      # now from ref, get all the elements of the array
      my (undef, $pos1, undef, $offset1, undef, $pos2, undef, $offset2) = @$ref;
    
      my $score;
      # first we check to see if relatedness was already computed
      if ($self->{doCache}) { 
         $score = $self->fetchFromCache ($wps1, $wps2);
         defined $score and return $score;
      }
    
      my $wn = $self->{wn}; # get reference to WordNet::QueryData
    
      # here's where we do the real work of finding relatedness 
    
      my $ic1 = $self->IC ($offset1);
      my $ic2 = $self->IC ($offset2);
    
      $score = ($ic1 + $ic2) / 2;
    
      # if tracing in enabled, print some information to traceString 
      if ($self->{trace}) {
          $self->{traceString} .= "IC(";
          $self->printSet ($pos1, 'offset', $offset1);
          $self->{traceString} .= ") = $ic1\n";
          $self->{traceString} .= "IC(";
          $self->printSet ($pos2, 'offset', $offset2);
          $self->{traceString} .= ") = $ic2\n";
      }
    
      $self->storeToCache ($wps1, $wps2, $score) if $self->{doCache};
    
      return $score;
    }

SUMMARY

You should follow the same conventions for error handling and tracing as the other measure modules do. Be sure to support cache as well (as demonstrated above).

If you would like to contribute to the project, please see our SourceForge page: http://wn-similarity.sourceforge.net as well as our current todo list (in doc/todo.pod). We especially welcome contributions of new measures of relatedness!

SEE ALSO

intro.pod

Mailing list: http://groups.yahoo.com/group/wn-similarity

Project Home page: http://wn-similarity.sourceforge.net

AUTHORS

Ted Pedersen, University of Minnesota Duluth
tpederse at d.umn.edu

Siddharth Patwardhan, University of Utah, Salt Lake City
sidd at cs.utah.edu

Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh
banerjee+ at cs.cmu.edu

Jason Michelizzi

COPYRIGHT

Copyright (c) 2005-2008, Ted Pedersen, Siddharth Patwardhan, Satanjeev Banerjee and Jason Michelizzi

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.