The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::Stem::Snowball - Perl interface to Snowball stemmers.

SYNOPSIS

use  Lingua::Stem::Snowball;

my @lang = stemmers();

OO interface:

my $lang = 'en';
my $dict = Lingua::Stem::Snowball->new(lang => $lang);
# Test if $lang is correct
die $@ if ($@);
my $locale = 'C'; 

my $dict = Lingua::Stem::Snowball->new(lang => $lang, locale => $locale);
my $lemm = $dict->stem($word);
my $lemm = $dict->stem($word, \$is_stemmed);

my $dict = Lingua::Stem::Snowball->new();
$dict->lang($lang);
$dict->locale($locale);
my $lemm = $dict->stem($word);
my @lemm = $dict->stem(\@words);

Plain interface:

my $lemm = stem($lang, $word);
my $lemm = stem($lang, $word, $locale);
my $lemm = stem($lang, $word, $locale, \$is_stemmed);

DESCRIPTION

This module provides unified perl interface to Snowball stemmers (http://snowball.tartarus.org) and virtually supports various languages. It's written using C for high performance and provides OO and plain interfaces.

The motivation of developing this module was to provide a generic access to stemming algorithms for OpenFTS project - full text search engine (http://openfts.sourceforge.net).

The module is very similar with Lingua::Stem. But Lingua::Stem is written in pure perl whereas Lingua::Stem::Snowball is an XS version of the snowball stemmers.

The following stemmers are available (as of Lingua::Stem 0.70):

|------------------------------|
| Language	 | L:S 	 | L:S:S | 
|------------------------------|
| English	 | y	 | y	 | 
| French	 | y	 | y	 | 
| Spanish	 | n	 | y	 | 
| Portuguese	 | y	 | y	 | 
| Italian	 | y	 | y	 | 
| German	 | y	 | y	 | 
| Dutch	 | n	 | y	 | 
| Swedish	 | y	 | y	 | 
| Norwegian	 | y	 | y	 | 
| Danish	 | y	 | y	 | 
| Russian	 | n	 | y	 | 
| Finnish	 | n	 | y	 | 
| Galician	 | y	 | n	 | 
|------------------------------|

Here is a little benchmark with examples files from the snowball distribution (with no cache):

|---------------------------------------------------|
| Language | Unique |          Time (s)             | 
|          | words  | L:S:S | L:S:S | L:S   | L:S:S | 
|          |        | @     | $     | @     | $     | 
|---------------------------------------------------|
| DA       | 23829  | 0.5   | 1.1   | 7.3   | 14.2  | 
| DE       | 35033  | 0.9   | 1.9   | 64.3  | 73.5  | 
| EN       | 30428  | 0.7   | 1.5   | 2.5   | 8.8   | 
| FR       | 20403  | 0.6   | 1.1   | 182.7 | 188.0 | 
| IT       | 35494  | 1.0   | 2.0   | 345.6 | 350.2 | 
| NO       | 20628  | 0.4   | 1.0   | 14.3  | 20.6  | 
| PT       | 32016  | 0.8   | 1.7   | 405.6 | 414.8 | 
| SV       | 30623  | 0.0   | 0.5   | 15.9  | 25.6  | 
|---------------------------------------------------|

Here is the same benchmark with all unique words found in the bible:

|---------------------------------------------------|
| EN       | 12718  | 0.3   | 0.7   | 1.0   | 3.6   | 
|---------------------------------------------------|

METHODS

$dict = Lingua::Stem::Snowball->new

Creates a new instance of the stemmer.

The constructor takes hash style parameters. The following parameters are recognized:

lang: language (ISO code).

locale: locale.

my $stemmed = $dict->stem($word)

Returns the stemmed word for $word.

my @stemmed = $dict->stem(\@words)

Returns an array of the stemmed words contained in @words.

$dict->lang([$lang])

Accessor for the lang parameter. If there is no stemmer for $lang, the language is not changed.

$dict->locale([$locale])

Accessor for the locale parameter.

stemmers()

Returns a list of all available languages with a stemmer.

$dict->strip_apostrophes([1|0])

By default, the stemmer will not strip apostrophes for you. So, if you make the following call:

my @words = ('The', 'Ranger\'s', 'Digest');
my @stemmed = $dict->stem(\@words);

The result might not be what you expected (if you split(' ') a user search entry for example).

Stripping 's in perl can be a little expensive, so you can let the stemmer do it in C:

my @words = ('The', 'Ranger\'s', 'Digest');
$dict->strip_apostrophes(1);
my @stemmed = $dict->stem(\@words);

This method strips 's (english) and l', d', ... (french).

REQUESTS & BUGS

Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to bug-Lingua-Stem-Snowball\@rt.cpan.org.

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball. Please check to see if your bug has already been reported.

COPYRIGHT

Copyright 2004-2005

Currently maintained by Fabien Potencier, fabpot@cpan.org Original authors Oleg Bartunov, oleg@sai.msu.su, Teodor Sigaev, teodor@stack.net

This software may be freely copied and distributed under the same terms and conditions as Perl.

Snowball files and stemmers are covered by the BSD license.

SEE ALSO

http://snowball.tartarus.org, Lingua::Stem