NAME

Search::FreeText::LexicalAnalysis - basic lexical analyser for the open search system

DESCRIPTION

An open lexical analysis processor, which you can either override by subclassing, or which you can add your own filters to. Each filter is called with a reference to an array of words, and returns a reference to a new array of words. This is the process method, and the base class Search::FreeText::LexicalAnalysisProcess defines the protocol for each step in the pipeline.

SYNOPSIS

 # Selects default filters
 my $lexicalizer = new Search::FreeText::LexicalAnalysis ();
 # Selects named filters only
 my $lexicalizer = new Search::FreeText::LexicalAnalysis 
     (-filters => [ qw(MyLexicalAnalysis::Heuristics
		       Search::FreeText::LexicalAnalysis::Tokenize
		       Search::FreeText::LexicalAnalysis::Stop 
		       Search::FreeText::LexicalAnalysis::Stem) ]);

 my $words = $lexicalizer->process($text);

METHODS

new Search::FreeText::LexicalAnalysis( -search => searchmod [, -filters => FilterList] );

This is the main constructor for a lexicon. The -search parameter passes the search object instance, and is passed in turn to each of the filters, allowing them to look inside the search instance for any additional data if they need to.

You can use the -filters initialisation key to pass a list of classes for filters. By default the set of filters implements stemming, a reasonably complete stop list, and a few heuristics that tighten up the searching. the order of the filters is fairly important, and looks a bit like this:

Heuristics

Pattern-level heuristics that work on whole strings, implemented by default by Search::FreeText::LexicalAnalysis::Heuristics.

Tokenize

Splits a set of strings into an array of words. Implemented by default by Search::FreeText::LexicalAnalysis::Tokenize. Before this, strings represent documents; after this, they represent words, which is why its position in the list of filters is important.

Stop

Pass the array of words through a stop list filter, removing words that are likely to be irrelevant. Implemented by default by Search::FreeText::LexicalAnalysis::Stop.

Stem

Pass the array of words through a stemmer. Implemented by default by Search::FreeText::LexicalAnalysis::Stem, which in turn uses Lingua::Stem.

$self->initialize();

Initializes the lexical analyser, loading any modules that are needed for the list of filters.

$self->process(words...);

Passes the list of words to the filters as a pipeline. The array of words usually starts as a single string containing all the words, and one of the filters (Tokenize) turns this into an array of individual words. This allows some processing before words are split, as well as the usual stemming and stoplisting afterwards.

AUTHOR

Stuart Watt <S.N.K.Watt@rgu.ac.uk>

Copyright (c) 2003 The Robert Gordon University. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 142:

You forgot a '=back' before '=head1'