NAME

Search::Tokenizer - Decompose a string into tokens (words)

SYNOPSIS

# generic usage
use Search::Tokenizer;
my $tokenizer = Search::Tokenizer->new(
   regex     => qr/.../,
   filter    => sub { ... },
   stopwords => {word1 => 1, word2 => 1, ... },
   lower     => 1,
 );
my $iterator = $tokenizer->($string);
while (my ($term, $len, $start, $end, $index) = $iterator->()) {
  ...
}

# usage for DBD::SQLite (with builtin tokenizers: word, word_locale,
#   word_unicode, unaccent)
use Search::Tokenizer;
$dbh->do("CREATE VIRTUAL TABLE t "
        ."  USING fts3(tokenize=perl 'Search::Tokenizer::unaccent')");

DESCRIPTION

This module builds an iterator function that will progressively extract terms from a given input string. Terms are defined by a regular expression (for example \w+). Extraction of terms relies on the builtin "global match" operator of Perl (the 'g' flag), and therefore is quite efficient.

Before being returned to the caller, terms may be filtered by an auxiliary function, for performing tasks such as stemming or stopword elimination.

A tokenizer returned from the new method is a code reference, not a regular Perl object. To use the tokenizer, just call it with a string to parse : this will return another code reference, which works as an iterator. Each call to the iterator will return the next term from the string, until the string is exhausted.

This API was explicitly designed for integrating Perl with the FTS3 fulltext search engine in DBD::SQLite; however, the API is general enough to be useful for other purposes, which is why it is published in its own, separate distribution.

METHODS

Creating a tokenizer

my $tokenizer = Search::Tokenizer->new($regex);
my $tokenizer = Search::Tokenizer->new(%options);

Builds a new tokenizer, returned as a code reference. The first syntax with a single Regexp argument is a shorthand for ->new(regex => $regex). The second syntax, with named arguments, has the following available options :

regex => $regex

$regex is a compiled regular expression that specifies how to match a term; that regular expression should not match the empty string (otherwise the tokenizer would enter into an infinite loop). The default is qr/\p{Word}+/. Here are some examples of more advanced regexes :

# perl's basic notion of "word"
$regex = qr/\w+/;

# take 'locale' into account
$regex = do {use locale; qr/\w+/}; 

# words like "don't", "it's" are treated as a single term
$regex = qr/\w+(?:'\w+)?/;

# same thing but also with internal hyphens like "fox-trot"
$regex = qr/\w+(?:[-']\w+)?/;
lower => $bool

If true, the term returned by the $regex is converted to lowercase (or more precisely: is "case-folded" through "fc" in Unicode::CaseFold). This option is activated by default.

filter => $filter

$filter is a reference to a function that may modify or cancel a term before it is returned to the caller. The filter takes one single argument (the term) and returns a scalar (the modified term). If the value returned from the filter is empty, then this term is canceled.

filter_in_place => $filter

Like filter, except that the filtering function directly modifies the term in its $_[0] argument instead of returning a new term. This is useful for example when building a filter from Lingua::Stem::Snowball or from Text::Transliterator::Unaccent.

stopwords => $hashref

The keys in $hashref are terms to cancel (usually : common terms for which indexing would consume lots of resources with little added value). Values in the hash should evaluate to true. Lists of stopwords for various languages may be found in the Lingua::StopWords module. Stopwords filtering is applied after the filter or filter_in_place function (if any).

Whenever a term is canceled through the filter or stopwords options, the tokenizer does not return that term to the client, but nevertheless rembembers the canceled position: so for example when tokenizing "Once upon a time" with

$tokenizer = Search::Tokenizer->new(
   stopwords => Lingua::StopWords::getStopWords('en')
);

we get the term sequence

("upon", 4,  5,  9, 1)
("time", 4, 12, 16, 3)

where terms "once" and "a" in positions 0 and 2 have been canceled, so the only remaining terms are in positions 1 and 3.

Creating an iterator

my $iterator = $tokenizer->($text);

# loop over terms ..
while (my $term = $iterator->()) {
  work_with_term($term);
}

# .. or loop over terms with detailed information
while (my @term_details = $iterator->()) {
  work_with_details(@term_details); # ($term, $len, $start, $end, $index)
}

The tokenizer takes one string argument and returns an iterator. The iterator takes no argument; each call returns a next term from the string, until the string is exhausted, at which point the iterator returns an empty result.

If called in a scalar context, the iterator returns just a string; if called in a list context, it returns a tuple composed from :

$term

the term (after filtering);

$len

the length of this term;

$start

the starting offset in the string where this term was found;

$end

the end offset. This is also the place where the search for the next term will start;

$index

the position of this term within the string, starting at 0.

Length and start/end offsets are computed in characters, not in bytes. Note for SQLite users : the C layer in SQLite needs byte values, but the conversion will be automatically taken care of by the C implementation in DBD::SQLite.

Beware that ($end - $start) is the length of the original extracted term, while $len is the length of the final $term, after filtering; both lengths may differ, especially if stemming is being applied.

BUILTIN TOKENIZERS

For convenience, the following tokenizers are builtin :

Search::Tokenizer::word

Terms are "words" according to Perl's notion of \w+.

Search::Tokenizer::word_locale

Terms are "words" according to Perl's notion of \w+ under use locale.

Search::Tokenizer::word_unicode

Terms are "words" according to Unicode's notion of \p{Word}+.

Search::Tokenizer::unaccent

Like Search::Tokenizer::word_unicode, but filtered through Text::Transliterator::Unaccent to replace all accented characters by their base character.

These builtin tokenizers may take the same arguments as new(): for example

use Search::Tokenizer;
my $tokenizer = Search::Tokenizer::unaccent(lower => 0, stopwords => ...);

UNROLLING THE ITERATOR

unroll

my @tokens = Search::Tokenizer::unroll($iterator, $no_details);

This utility method returns the list of all tokens obtained from repetitive calls to the $iterator. The $no_details argument is optional; if true, the results are just strings, instead of tuples with positional information.

SEE ALSO

AUTHOR

Laurent Dami, <dami@cpan.org>

LICENSE AND COPYRIGHT

Copyright 2010, 2021 Laurent Dami.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.