NAME

WordNet::Similarity::PathFinder - module to implement path finding methods (by node counting) for WordNet::Similarity measures of semantic relatedness

SYNOPSIS

use WordNet::QueryData;

my $wn = WordNet::QueryData->new;

use WordNet::Similarity::PathFinder;

my $obj = WordNet::Similarity::PathFinder->new ($wn);

my $result = $obj->parseWps($wps1, $wps2);

my @paths = $obj->getShortestPath("dog#n#1", "cat#n#1", "n", "wps");

my ($length, $path) = @{shift @paths};

defined $path or die "No path between synsets";

my @paths = $obj->getAllPaths("worship#v#1", "adore#v#1", "v", "wps");

my ($length, $path) = @{shift @paths};

defined $path or die "No path between synsets";

my @paths = $obj->getShortestPath("02895418", "02724985", "n", "offset");

my ($length, $path) = @{shift @paths};

defined $path or die "No path between synsets";

DESCRIPTION

Introduction

This class is derived from (i.e., is a sub-class of) WordNet::Similarity.

The methods in this module are useful for finding paths between concepts in WordNet's 'is-a' taxonomies. Concept A is-a concept B if, and only if, B is a hypernym of A or A is in the hypernym tree of B. N.B., only nouns and verbs have hypernyms.

The methods that find path lengths (such as getShortestPath() and getAllPaths() compute the lengths using node-counting not edge-counting. In general, the length of a path using node-counting will always be one more than the length using edge-counting. For example, if concept A is a hyponym of concept B, then the path length between A and B using node-counting is 2, but the length using edge-counting is 1. Likewise, the path between A and A is 1 using node-counting and 0 using edge-counting.

Methods

This module inherits all the methods of WordNet::Similarity. Additionally, the following methods are also defined.

Public methods

$measure->setPosList()

Specifies the parts of speech that measures derived from this module support (namely, nouns and verbs).

parameters: none

returns: true

$self->traceOptions()

Overrides method of same name in WordNet::Similarity. Prints module-specific configuration options to the trace string (if tracing is on). PathFinder supports one module specific option: rootNode.

Parameters: none

returns: nothing

$measure->parseWps($synset1, $synset2)

parameters: synset1, synset2 -- two synsets in wps format

returns: a reference to an array, WordNet::Similarity::UNRELATED, or undef

Overrides the parseWps() method in WordNet::Similarity in order to run additional checks, but calls WordNet::Similarity::parseWps() to get those checks accomplished as well. Thus, this method does everything that WordNet::Similarity::parseWps does.

quote from WordNet::Similarity::parseWps:

This method checks the format of the two input synsets by calling validateSynset() for each synset.

If the synsets are in wps format, a reference to an array will be returned. This array has the form [$word1, $pos1, $sense1, $offset1, $word2, $pos2, $sense2, $offset2] where $word1 is the word part of $wps1, $pos1, is the part of speech of $wps1, $sense1 is the sense from $wps. $offset1 is the offset for $wps1.

If an error occurs (such as a synset being poorly-formed), then undef is returned, the error level is set to non-zero, and an error message is appended to the error string.

In addition, if the two synsets are from different parts of speech, then WordNet::Similarity::UNRELATED is returned, the error level is set to 1, and a message is appended to the error string.

If either synset is not a noun or a verb, then the error level is set to 1, a message is appended to the error string, and undef is returned.

If the synsets are in wps format, a reference to an array will be returned. This array has the form [$word1, $pos1, $sense1, $offset1, $word2, $pos2, $sense2, $offset2].

$measure->getShortestPath($synset1, $synset2, $pos, $mode)

Given two input synsets, returns the shortest path between the two synsets.

Parameters: two synsets, a part-of-speech, and a mode indicator (i.e., the string 'offset' or 'wps'). If the mode is 'offset', then the synsets should be WordNet offsets. If the mode is 'wps', then the synsets should be in word#pos#sense format.

Returns: a list of references to arrays. Each array has the form ($path_length, $path_ref) where $path_ref is a reference to an array whose elements are the synsets along the shortest path between the two input synsets. There will be as many array references returned as there are shortest paths between the synsets. That is, there will be no arrays returned if there is no path between the synsets, and there will be at least one array returned if there is a path between the synsets. If there are multiple paths tied for being shortest in length, then all those paths are returned (hence, this is why multiple array references can be returned).

Upon error, returns undef, sets the error level to non-zero, and appends a message to the error string.

$measure->getAllPaths($synset1, $synset2, $pos, $mode)

Given two input synsets, returns all the paths between the two synsets.

Parameters: a reference to the object, two synsets, a part-of-speech, and a mode indicator (the string 'offset' or 'wps').

If the mode is 'offset', then the synsets should be WordNet offsets. If the mode is 'wps', then they should be strings in word#pos#sense format.

Returns: A list of all paths, sorted by path length in ascending order. The format for each item in the list is a reference to an array that has the format: [$top, $length, [@synsets_list]] where @synset_list is a list of synsets along the path (including the two input synsets)

Returns undef on error.

$measure->validateSynset($synset)

parameters: synset -- a string in word#pos#sense format

returns: a list or undef on error

This method overrides the method of the same name in WordNet::Similarity to provide additional behavior but calls WordNet::Similarity::validateSynset to accomplish that method's behavior. Thus, this method does everything that WordNet::Similarity::validateSynset does.

quote from WordNet::Similarity::validateSynset:

This method does the following:

  1. Verifies that the synset is well-formed (i.e., that it consists of three parts separated by #s, the pos is one of {n, v, a, r} and that sense is a natural number). A synset that matches the pattern '[^\#]+\#[nvar]\#\d+' is considered well-formed.

  2. Checks if the synset exists by trying to find the offset for the synset

This method, however, has a slightly different return value. Instead of merely breaking the synset into three parts, it returns the "safe" form of the synset. That is, if a synset has multiple word senses, this method returns the first word sense in that synset (this is so that other path-finding methods work properly). For example, if the input to this method is auto#n#1, the return value is ('car', 'n', 1, 2853224) since the sense 'car#n#1' is the first member of the synset to which 'auto#n#1' belongs.

If any of these tests fails, then the error level is set to non-zero, a message is appended to the error string, and undef is returned.

Private methods

$measure->_getHypernymTrees($synset, $pos, $mode)

This method takes as input a synset and returns a list of references to arrays where these arrays are paths from the input synset to the top of the taxonomy (*Root*#[nv]#1 if the root node is on).

Parameters: a synset, a part-of-speech, and a mode. The mode must be either the string 'wps' or 'offset'. If the mode is 'wps', then the synset must be in wps format; otherwise, it must be an offset.

Returns: a list of references to arrays. These arrays are paths (hypernym trees).

getLCSbyPath($synset1, $synset2, $pos, $mode)

Given two input synsets, finds the least common subsumer (LCS) of them. If there are multiple candidates for the LCS (due to multiple inheritance), the LCS that results in the shortest path between in input concepts is chosen.

Parameters: two synsets, a part of speech, and a mode.

Returns: a list of references to arrays where each array has the from ($lcs, $pathlength). $pathlength is the length of the path between the two input concepts. There can be multiple LCSs returned if there are ties for the shortest path between the two synsets. Returns undef on error.

$measure->_getSubsumerFromTrees($treeref1, $treeref2, $mode)

This subroutine returns takes two trees as produced by getHypernymTrees and returns the most specific subsumer from them.

Parameters: two references to arrays, and a string indicating mode ('wps' or 'offset').

Returns: the subsumer or undef

getDepth()

This method is non-functional and likely to be moved to a different module soon.

Usage

The semantic relatedness modules in this distribution are built as classes. The classes define four methods that are useful in finding relatedness values for pairs of synsets.

new()
getRelatedness()
getError()
getTraceString()

Typical Usage Examples

To create an object of the Resnik measure, we would have the following lines of code in the Perl program.

use WordNet::Similarity::path;
$object = WordNet::Similarity::path->new($wn, '~/path.conf');

The reference of the initialized object is stored in the scalar variable '$object'. '$wn' contains a WordNet::QueryData object that should have been created earlier in the program. The second parameter to the 'new' method is the path of the configuration file for the path measure. If the 'new' method is unable to create the object, '$object' would be undefined. This, as well as any other error/warning may be tested.

die "Unable to create path object.\n" unless defined $object;
($err, $errString) = $object->getError();
die $errString."\n" if($err);

To create a Leacock-Chodorow measure object, using default values, i.e. no configuration file, we would have the following:

use WordNet::Similarity::lch;
$measure = WordNet::Similarity::lch->new($wn);

To find the semantic relatedness of the first sense of the noun 'car' and the second sense of the noun 'bus' using the path measure, we would write the following piece of code:

$relatedness = $object->getRelatedness('car#n#1', 'bus#n#2');

To get traces for the above computation:

print $object->getTraceString();

However, traces must be enabled using configuration files. By default traces are turned off.

Discussion

Many of the methods in this module can work with either offsets or wps strings internally. There are several interesting consequences of each mode.

  1. An offset is not a unique identifier for a synset, but neither is a wps string. An offset only indicates a byte offset in one of the WordNet data files (data.noun, data.verb, etc. on Unix-like systems). An offset along with a part of speech, however, does uniquely identify a synset.

    A word#pos#sense string, on the other hand, is the opposite extreme. A word#pos#sense string is an identifier for a unique word sense. A synset can have several word senses in it (i.e., a synset is a set of word senses that are synonymous). The synset {beer_mug#n#1, stein#n#1} has two word senses. The wps strings 'beer_mug#n#1' and 'stein#n#1' can both be used to refer to the synset. For simplicity, we usually just use the first wps string when referring to the synset. N.B., the wps representation was developed by WordNet::QueryData.

  2. Early versions of WordNet::Similarity::* used offsets internally for finding paths, hypernym trees, subsumers, etc. The module WordNet::QueryData that is used by Similarity, however, accepts only wps strings as input to its querySense method, which is used to find hypernyms. We have found that it is more efficient (faster) to use wps strings internally.

AUTHORS

Ted Pedersen, University of Minnesota Duluth
tpederse at d.umn.edu

Jason Michelizzi, University of Minnesota Duluth
mich0212 at d.umn.edu

Siddharth Patwardhan, University of Utah, Salt Lake City
sidd at cs.utah.edu

BUGS

None.

SEE ALSO

WordNet::Similarity(3) WordNet::Similarity::path(3) WordNet::Similarity::lch(3) WordNet::Similarity::wup(3)

COPYRIGHT

Copyright (c) 2005, Ted Pedersen, Siddharth Patwardhan and Jason Michelizzi

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.

Note: a copy of the GNU General Public License is available on the web at http://www.gnu.org/licenses/gpl.txt and is included in this distribution as GPL.txt.