NAME

Search::FreeText::LexicalAnalysis::Heuristics - lexical analysis heuristics

DESCRIPTION

A pseudo-filter which does a bit before we get into the real lexical analysis system. This can do full text substitutions and corrections on the free text. It's really there to handle a few minor corrections and linguistic issues which can break the later stages. The main issue it handles is prefixes, which are sometimes fixed with a "-" character and sometimes without. We fix this.

SYNOPSIS

my $stemmer = new Search::FreeText::LexicalAnalysis::Heuristics();
my $words = $lexicaliser->process($oldwords);

METHODS

$self->initialize();

Called when the lexicon system is initialised. This method actually does very little, although it could compile and cache stuff if it seemed appropriate.

$self->process($oldwords);

Called to process a reference to an array containing strings (well, one string) which can then be tokenised for further lexical processing.

Heuristics applied include:

  • Convert a few common prefixes with hyphenations, e.g. re-, pre-, and so on, into complete words. This is useful for words where the prefix affects the sense of the word (other prefixes don't to the same extent) and where we don't want the prefix treated as a separate word. For example "re-cycled" is the same as "recycled", not as "re cycled". In comparison, "case-based" should be treated as "case based", not as "casebased".

AUTHOR

Stuart Watt <S.N.K.Watt@rgu.ac.uk>

Copyright (c) 2003 The Robert Gordon University. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.