NAME
Lingua::BioYaTeA::PreProcessing - Perl extension for preprocessing BioYaTeA input.
SYNOPSIS
use Lingua::BioYaTeA::PreProcessing;
$preProc = Lingua::BioYaTeA::PreProcessing->new(); open($fh, ">t/example_output_preprocessing-new.ttg") or ($fh = *STDERR); $preProc->process_file("t/example_input_preprocessing.ttg", $fh); close($fh);
DESCRIPTION
The module implements an extension for the pre-processing of the TreeTagger output in order to improve the extraction of both terms containing prepositional phrases (with TO
and AT
prepositions) and terms containing participles (past participles -ED
and gerunds -ING
).
Context-based rules are applied to the POS tags either to trigger the extraction of relevant structures or to prevent the extraction of irrelevant ones. The modified file becomes a new input file for BioYaTeA.
The input and output files are in the TreeTagger format.
METHODS
new()
new();
The method creates a pre-processing component of BioYaTeA and loads the additional resources (stop verbs, stop participles, stop words) the rewritting patrerns (all are currently hardcoded), and returns the created object.
The pre-processing object is defined with 4 attributes: the list of stop verbs stopVerbs
, the list of stop participles stopParticiples
, the list of stop words stoplist
and the list of rewritting patterns patterns
.
getStopVerbs()
getStopVerbs($form);
This method returns the attribute stopVerbs
or the specific value associated to form $form
.
existsInStopVerbs()
existsInStopVerbs($form);
This method indicates if the form $form
exists in the list of stop verbs (stopVerbs
attribute).
loadStopVerbs()
loadStopVerbs($form);
This method loads the list of stop verbs in the attribute stopVerbs
and returns the attribute.
getStopParticiples()
getStopParticiples($form);
This method returns the attribute stopParticiples
or the specific value associated to form $form
.
existsInStopParticiples()
existsInStopParticiples($form);
This method indicates if the form $form
exists in the list of stop participles (stopParticiples
attribute).
loadStopParticiples()
loadStopParticiples($form);
This method loads the list of stop participles in the attribute stopParticiples
and returns the attribute.
getStopList()
getStopList($form);
This method returns the attribute stopList
or the specific value associated to form $form
.
existsInStopList()
existsInStopList($form);
This method indicates if the form $form
exists in the list of stop words (stopList
attribute).
loadStopList()
loadStopList($form);
This method loads the list of stop words in the attribute stopList
and returns the attribute.
compile1()
compile1($pattern, $result);
This method performs the first step of the compilation of the pattern $pattern
by generating the related regular expression and creating the related pattern structure. This structure is composed 4 fields: the pattern itself (root
), the array of predicates (predicates
), the array of named groups (namedgroup
) and the regular expression. The array of predicates are functions which will be used for checking the Part-of-speech tags or the form of the words.
The second argument is not set at the fist call. The method returns the resulting structure (an array reference).
compile2()
compile2($result, $child_pattern);
This method performs the second step of the compilation of the patterns. Patterns have been already processed by the method compile1
and represented in the structure $result
. This step generates the regular expression (field re
).
The second argument is not set at the fist call.
compile()
compile($pattern);
This method compiles the pattern $pattern
in order to have the relevant represenation and the corresponding regular expression into a array structure $result
. This structure is returned.
translate()
translate($compiledpattern, $sequence);
This method applies the compiled pattern ($compiledpattern
) to the sequence sequence
into a string and return it. The string provides information associated to various elements of the pattern (it depends on the pattern).
match()
match($compiledpattern, $sequence);
This method applies the pattern $compiledpattern
to the token sequence $sequence
and merges the information in order to correct the part-of-speech tag associated to some words. Any rewriting operation is recorded in a array which is returned.
pred()
pred($predicate, $quantifier);
The method returns the structure defining a predicate. The structure is composed of 3 fields: the type of structure (here "predicate
"), the function associated to the predicate (field predicate
) and is set with $predicate
), and the quantifier associated to the predicate (field quantifier
) which is set with $quantifier
.
group()
group($children, $quantifier);
This method returns the structure defining a group of predicates. The structure is composed of 3 fields: the type of structure (here "group
"), the list of predicates (field children
) and is set with $children
), and the quantifier associated to the child list (field quantifier
) which is set with $quantifier
.
named()
named($name, $children, $quantifier);
This method returns the structure defining a named group of predicates. The structure is composed of 4 fields: the type of structure (here "group
"), the name associated to the group (field name
) which is set with $name
, the list of predicates (field children
) and is set with $children
), and the quantifier associated to the child list (field quantifier
) which is set with $quantifier
.
is_ing()
is_ing($word);
This method indicates if the word terminates by ing
and is not in the list of stop words. o =head2 patterns()
patterns();
This method returns the list of patterns associated to the current object (field patterns
).
setPattern1()
setPattern1();
This method sets the pattern 1.
getPattern1()
getPattern1();
This method returns the pattern 1.
setPattern2()
setPattern2();
This method sets the pattern 2.
getPattern2()
getPattern2();
This method returns the pattern 2.
setPattern3()
setPattern3();
This method sets the pattern 3.
getPattern3()
getPattern3();
This method returns the pattern 3.
setPattern4()
setPattern4();
This method sets the pattern 4.
getPattern4()
getPattern4();
This method returns the pattern 4.
setPattern5()
setPattern5();
This method sets the pattern 5.
getPattern5()
getPattern5();
This method returns the pattern 5.
setPattern6()
setPattern6();
This method sets the pattern 6.
getPattern6()
getPattern6();
This method returns the pattern 6.
setPattern7()
setPattern7();
This method sets the pattern 7.
getPattern7()
getPattern7();
This method returns the pattern 7.
setPattern8()
setPattern8();
This method sets the pattern 8.
getPattern8()
getPattern8();
This method returns the pattern 8.
setPattern9()
setPattern9();
This method sets the pattern 9.
getPattern9()
getPattern9();
This method returns the pattern 9.
setPattern10()
setPattern10();
This method sets the pattern 10.
getPattern10()
getPattern10();
This method returns the pattern 10.
not_sent()
not_sent($element);
This method indicates whether the part-of-speech of element $element
is a mark of sentence end.
is_to()
is_to($element);
This method indicates whether the form of element $element
is the preposition to
.
process_sentence()
process_sentence($sentence, $fh);
This method processes the sentence $sentence
in order to correct the part-of-speech tags if necessary, and print the corrected sentence in the file handle $fh
(the output respects the TreeTagger format).
process_file()
process_file($file, $fhout);
The method performs the correction process on the file $file
. The output will be printed in the file handle $fhout
. $file
is the filename of the file to process.
SEE ALSO
Documentation of Lingua::BioYaTeA and Lingua::YaTeA
AUTHORS
Wiktoria Golik <wiktoria.golik@jouy.inra.fr>, Zorana Ratkovic <Zorana.Ratkovic@jouy.inra.fr>, Robert Bossy <Robert.Bossy@jouy.inra.fr>, Claire Nédellec <claire.nedellec@jouy.inra.fr>, Thierry Hamon <thierry.hamon@univ-paris13.fr>
LICENSE
Copyright (C) 2012 Wiktoria Golik, Zorana Ratkovic, Robert Bossy, Claire Nédellec and Thierry Hamon
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.