NAME

WordNet::BestStem -- get the best guess stem of a word.

VERSION

0.2.2

SYNOPSIS

my $best = best_stem( 'roses', {V=>1} );

DESCRIPTION

Based on the assumption that the stem has the highest occurence frequency in text corpus. Of course it is not always true, but for certain purposes it may be justifiable to treat the most frequent form as stem.

Find a word's variant forms. Returns the highest frequency (part-of-speech) form according to ICFinder's "information content file", which comes by default with WordNet but can be customized.

ICFinder has frequency count for n and v part-of-speech and not a or r. When a or r is involved, use the number of senses for part-of-speech intead of fre of wp to choose form.

Alternatively, best_stem can use a custom word variant frequency table.

METHODS

best_stem

Returns in list context the best guess stem form, part-of-speech, and frequency; returns in scalar context the stem form.

*Note: WordNet does not at the moment have variant forms for very high frequency words, like "what", "the", "would". best_stem returns empty string in such cases.

Default options (case insensitive):

V     => 0,         # verbose. for debugging / checking
FRE   => undef,     # % ref to custom word variant frequency table

Usage:

use WordNet::BestStem qw( best_stem );

print best_stem('misgivings');          # misgiving n 8
print best_stem('roses');               # rose n 5
print best_stem('rose');                # rise v 17

Compared to WordNet::stem,

use WordNet::QueryData;
use WordNet::stem;

$WN = WordNet::QueryData->new();
$stemmer = WordNet::stem->new($WN)

print $stemmer->stemWord('misgivings')  # misgiving
print $stemmer->stemWord('roses')       # rose
print $stemmer->stemWord('rose')        # rose rise

Compared to Lingua::Stem::En,

use Lingua::Stem::En qw( stem );

$stems = stem( { -words => ['misgivings'] } );
print @$stems;                          # misgiv

$stems = stem( { -words => ['roses'] } );
print @$stems;                          # rose

$stems = stem( { -words => ['rose'] } );
print @$stems;                          # rose

deluxe_stems

Uses contextual info, ie appearances of word forms in paragraph/corpus to help choose stem form.

Default options (case insensitive):

V     => 0,
FRE   => undef,    # % ref to custom word variant frequency table
STEM  => undef,    # % ref to stem_of{string} table per best_stem

Usage:

use WordNet::BestStem qw( deluxe_stems );

my $stemmed_text = deluxe_stems \@text;

or in list context

  # ref to @, %, %, %
my ($stemmed, $stem_of, $stem_fre, $str_fre) = deluxe_stems \@paragraph;

For two paragraphs / sentences,

a) beautiful roses i would like a long stem rose
b) he thinks that average salary rose in the last few years

deluxe_stems,

$a_ = deluxe_stems \@a;
print @$a_;
  # beautiful rose i would like a long stem rose
  # he think that average salary rise in the last few year

Compared to best_stem,

@a_ = map { scalar( best_stem $_ ) || $_ } @a;
print "@a_\n";
  # beautiful rose i would like a long stem rise
  # he think that average salary rise in the last few year

DEPENDENCIES

WordNet  ( http://wordnet.princeton.edu )
WordNet::QueryData
WordNet::Similarity::ICFinder

AUTHOR

~~~~~~~~~~~~ ~~~~~ ~~~~~~~~ ~~~~~ ~~~ `` ><(((">

Copyright (C) 2009 Maggie J. Xiong < maggiexyz users.sourceforge.net >

All rights reserved. There is no warranty. You are allowed to redistribute this software / documentation as Perl itself.