Why not adopt me?
NAME
Lingua::Stem::Snowball - Perl interface to Snowball stemmers.
SYNOPSIS
use Lingua::Stem::Snowball;
my @lang = stemmers();
OO interface:
my $lang = 'en';
my $dict = Lingua::Stem::Snowball->new(lang => $lang);
# Test if $lang is correct
die $@ if ($@);
my $locale = 'C';
my $dict = Lingua::Stem::Snowball->new(lang => $lang, locale => $locale);
my $lemm = $dict->stem($word);
my $lemm = $dict->stem($word, \$is_stemmed);
my $dict = Lingua::Stem::Snowball->new();
$dict->lang($lang);
$dict->locale($locale);
my $lemm = $dict->stem($word);
my @lemm = $dict->stem(\@words);
Plain interface:
my $lemm = stem($lang, $word);
my $lemm = stem($lang, $word, $locale);
my $lemm = stem($lang, $word, $locale, \$is_stemmed);
DESCRIPTION
This module provides unified perl interface to Snowball stemmers (http://snowball.tartarus.org) and virtually supports various languages. It's written using C for high performance and provides OO and plain interfaces.
The motivation of developing this module was to provide a generic access to stemming algorithms for OpenFTS project - full text search engine (http://openfts.sourceforge.net).
The module is very similar with Lingua::Stem. But Lingua::Stem is written in pure perl whereas Lingua::Stem::Snowball is an XS version of the snowball stemmers.
The following stemmers are available (as of Lingua::Stem 0.70):
|------------------------------|
| Language | L:S | L:S:S |
|------------------------------|
| English | y | y |
| French | y | y |
| Spanish | n | y |
| Portuguese | y | y |
| Italian | y | y |
| German | y | y |
| Dutch | n | y |
| Swedish | y | y |
| Norwegian | y | y |
| Danish | y | y |
| Russian | n | y |
| Finnish | n | y |
| Galician | y | n |
|------------------------------|
Here is a little benchmark with examples files from the snowball distribution (with no cache):
|---------------------------------------------------|
| Language | Unique | Time (s) |
| | words | L:S:S | L:S:S | L:S | L:S:S |
| | | @ | $ | @ | $ |
|---------------------------------------------------|
| DA | 23829 | 0.5 | 1.1 | 7.3 | 14.2 |
| DE | 35033 | 0.9 | 1.9 | 64.3 | 73.5 |
| EN | 30428 | 0.7 | 1.5 | 2.5 | 8.8 |
| FR | 20403 | 0.6 | 1.1 | 182.7 | 188.0 |
| IT | 35494 | 1.0 | 2.0 | 345.6 | 350.2 |
| NO | 20628 | 0.4 | 1.0 | 14.3 | 20.6 |
| PT | 32016 | 0.8 | 1.7 | 405.6 | 414.8 |
| SV | 30623 | 0.0 | 0.5 | 15.9 | 25.6 |
|---------------------------------------------------|
Here is the same benchmark with all unique words found in the bible:
|---------------------------------------------------|
| EN | 12718 | 0.3 | 0.7 | 1.0 | 3.6 |
|---------------------------------------------------|
METHODS
- $dict = Lingua::Stem::Snowball->new
-
Creates a new instance of the stemmer.
The constructor takes hash style parameters. The following parameters are recognized:
lang: language (ISO code).
locale: locale.
- my $stemmed = $dict->stem($word)
-
Returns the stemmed word for $word.
- my @stemmed = $dict->stem(\@words)
-
Returns an array of the stemmed words contained in @words.
- $dict->lang([$lang])
-
Accessor for the lang parameter. If there is no stemmer for $lang, the language is not changed.
- $dict->locale([$locale])
-
Accessor for the locale parameter.
- stemmers()
-
Returns a list of all available languages with a stemmer.
- $dict->strip_apostrophes([1|0])
-
By default, the stemmer will not strip apostrophes for you. So, if you make the following call:
my @words = ('The', 'Ranger\'s', 'Digest'); my @stemmed = $dict->stem(\@words);
The result might not be what you expected (if you split(' ') a user search entry for example).
Stripping 's in perl can be a little expensive, so you can let the stemmer do it in C:
my @words = ('The', 'Ranger\'s', 'Digest'); $dict->strip_apostrophes(1); my @stemmed = $dict->stem(\@words);
This method strips 's (english) and l', d', ... (french).
REQUESTS & BUGS
Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to bug-Lingua-Stem-Snowball\@rt.cpan.org.
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball. Please check to see if your bug has already been reported.
COPYRIGHT
Copyright 2004-2005
Currently maintained by Fabien Potencier, fabpot@cpan.org Original authors Oleg Bartunov, oleg@sai.msu.su, Teodor Sigaev, teodor@stack.net
This software may be freely copied and distributed under the same terms and conditions as Perl.
Snowball files and stemmers are covered by the BSD license.
SEE ALSO
http://snowball.tartarus.org, Lingua::Stem