NAME

Lingua::BioYaTeA::PreProcessing - Perl extension for preprocessing BioYaTeA input.

SYNOPSIS

use Lingua::BioYaTeA::PreProcessing;

$preProc = Lingua::BioYaTeA::PreProcessing->new(); open($fh, ">t/example_output_preprocessing-new.ttg") or ($fh = *STDERR); $preProc->process_file("t/example_input_preprocessing.ttg", $fh); close($fh);

DESCRIPTION

The module implements an extension for the pre-processing of the TreeTagger output in order to improve the extraction of both terms containing prepositional phrases (with TO and AT prepositions) and terms containing participles (past participles -ED and gerunds -ING).

Context-based rules are applied to the POS tags either to trigger the extraction of relevant structures or to prevent the extraction of irrelevant ones. The modified file becomes a new input file for BioYaTeA.

The input and output files are in the TreeTagger format.

METHODS

new()

new();

The method creates a pre-processing component of BioYaTeA and loads the additional resources (stop verbs, stop participles, stop words) the rewritting patrerns (all are currently hardcoded), and returns the created object.

The pre-processing object is defined with 4 attributes: the list of stop verbs stopVerbs, the list of stop participles stopParticiples, the list of stop words stoplist and the list of rewritting patterns patterns.

getStopVerbs()

getStopVerbs($form);

This method returns the attribute stopVerbs or the specific value associated to form $form.

existsInStopVerbs()

existsInStopVerbs($form);

This method indicates if the form $form exists in the list of stop verbs (stopVerbs attribute).

loadStopVerbs()

loadStopVerbs($form);

This method loads the list of stop verbs in the attribute stopVerbs and returns the attribute.

getStopParticiples()

getStopParticiples($form);

This method returns the attribute stopParticiples or the specific value associated to form $form.

existsInStopParticiples()

existsInStopParticiples($form);

This method indicates if the form $form exists in the list of stop participles (stopParticiples attribute).

loadStopParticiples()

loadStopParticiples($form);

This method loads the list of stop participles in the attribute stopParticiples and returns the attribute.

getStopList()

getStopList($form);

This method returns the attribute stopList or the specific value associated to form $form.

existsInStopList()

existsInStopList($form);

This method indicates if the form $form exists in the list of stop words (stopList attribute).

loadStopList()

loadStopList($form);

This method loads the list of stop words in the attribute stopList and returns the attribute.

compile1()

compile1($pattern, $result);

This method performs the first step of the compilation of the pattern $pattern by generating the related regular expression and creating the related pattern structure. This structure is composed 4 fields: the pattern itself (root), the array of predicates (predicates), the array of named groups (namedgroup) and the regular expression. The array of predicates are functions which will be used for checking the Part-of-speech tags or the form of the words.

The second argument is not set at the fist call. The method returns the resulting structure (an array reference).

compile2()

compile2($result, $child_pattern);

This method performs the second step of the compilation of the patterns. Patterns have been already processed by the method compile1 and represented in the structure $result. This step generates the regular expression (field re).

The second argument is not set at the fist call.

compile()

compile($pattern);

This method compiles the pattern $pattern in order to have the relevant represenation and the corresponding regular expression into a array structure $result. This structure is returned.

translate()

translate($compiledpattern, $sequence);

This method applies the compiled pattern ($compiledpattern) to the sequence sequence into a string and return it. The string provides information associated to various elements of the pattern (it depends on the pattern).

match()

match($compiledpattern, $sequence);

This method applies the pattern $compiledpattern to the token sequence $sequence and merges the information in order to correct the part-of-speech tag associated to some words. Any rewriting operation is recorded in a array which is returned.

pred()

pred($predicate, $quantifier);

The method returns the structure defining a predicate. The structure is composed of 3 fields: the type of structure (here "predicate"), the function associated to the predicate (field predicate) and is set with $predicate), and the quantifier associated to the predicate (field quantifier) which is set with $quantifier.

group()

group($children, $quantifier);

This method returns the structure defining a group of predicates. The structure is composed of 3 fields: the type of structure (here "group"), the list of predicates (field children) and is set with $children), and the quantifier associated to the child list (field quantifier) which is set with $quantifier.

named()

named($name, $children, $quantifier);

This method returns the structure defining a named group of predicates. The structure is composed of 4 fields: the type of structure (here "group"), the name associated to the group (field name) which is set with $name, the list of predicates (field children) and is set with $children), and the quantifier associated to the child list (field quantifier) which is set with $quantifier.

is_ing()

is_ing($word);

This method indicates if the word terminates by ing and is not in the list of stop words. o =head2 patterns()

patterns();

This method returns the list of patterns associated to the current object (field patterns).

setPattern1()

setPattern1();

This method sets the pattern 1.

getPattern1()

getPattern1();

This method returns the pattern 1.

setPattern2()

setPattern2();

This method sets the pattern 2.

getPattern2()

getPattern2();

This method returns the pattern 2.

setPattern3()

setPattern3();

This method sets the pattern 3.

getPattern3()

getPattern3();

This method returns the pattern 3.

setPattern4()

setPattern4();

This method sets the pattern 4.

getPattern4()

getPattern4();

This method returns the pattern 4.

setPattern5()

setPattern5();

This method sets the pattern 5.

getPattern5()

getPattern5();

This method returns the pattern 5.

setPattern6()

setPattern6();

This method sets the pattern 6.

getPattern6()

getPattern6();

This method returns the pattern 6.

setPattern7()

setPattern7();

This method sets the pattern 7.

getPattern7()

getPattern7();

This method returns the pattern 7.

setPattern8()

setPattern8();

This method sets the pattern 8.

getPattern8()

getPattern8();

This method returns the pattern 8.

setPattern9()

setPattern9();

This method sets the pattern 9.

getPattern9()

getPattern9();

This method returns the pattern 9.

setPattern10()

setPattern10();

This method sets the pattern 10.

getPattern10()

getPattern10();

This method returns the pattern 10.

not_sent()

not_sent($element);

This method indicates whether the part-of-speech of element $element is a mark of sentence end.

is_to()

is_to($element);

This method indicates whether the form of element $element is the preposition to.

process_sentence()

process_sentence($sentence, $fh);

This method processes the sentence $sentence in order to correct the part-of-speech tags if necessary, and print the corrected sentence in the file handle $fh (the output respects the TreeTagger format).

process_file()

process_file($file, $fhout);

The method performs the correction process on the file $file. The output will be printed in the file handle $fhout. $file is the filename of the file to process.

SEE ALSO

Documentation of Lingua::BioYaTeA and Lingua::YaTeA

AUTHORS

Wiktoria Golik <wiktoria.golik@jouy.inra.fr>, Zorana Ratkovic <Zorana.Ratkovic@jouy.inra.fr>, Robert Bossy <Robert.Bossy@jouy.inra.fr>, Claire Nédellec <claire.nedellec@jouy.inra.fr>, Thierry Hamon <thierry.hamon@univ-paris13.fr>

LICENSE

Copyright (C) 2012 Wiktoria Golik, Zorana Ratkovic, Robert Bossy, Claire Nédellec and Thierry Hamon

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.