NAME
Lingua::BioYaTeA::PostProcessing - Perl extension for postprocessing BioYaTeA term extraction.
SYNOPSIS
use Lingua::BioYaTeA::PostProcessing;
my $postProc = Lingua::BioYaTeA::PostProcessing->new( { 'input-file' => "sampleEN-output.xml", 'output-file' => "sampleEN-bioyatea-out-pp.xml", 'configuration' => "post-processing-filtering.conf", } ); $postProc->logfile(dirname($postProc->output_file) . '/term-filtering.log'); $postProc->load_configuration; $postProc->defineTwigParser; $postProc->filtering; $postProc->printResume;
DESCRIPTION
The module implements an extension for the post-processing of the BioYaTeA (Lingua::BioYaTeA
output. Currently, the XML BioYaTeA output is filtered according to rules in order to remove non relevant extracted terms.
The input and output files are in the XML YaTeA format.
The configuration file provides patterns related to various types: inflected forms (FORM
) or lemmatized forms (LEMMA
) of terms or term components and action to perform. Currently only the CLEAN
action (to remove terms) is implemented.
METHODS
new()
new(\%options);
The method creates a post-processing component of BioYaTeA and sets the option attribute with the hashtable @options
, and returns the created object.
The hashtable @options
contains several fields: the input file name input-file
, the output file name output-file
, the configuration file name configuration
and the temporary directory name tmp-dir
.
Other attributes are: the XML::Twig
parser twig_parser
, the counter of term candidates tc_counter
, the counter of rejected terms count_rejected
, the list of regular expressions used to identify terms to reject reg_exps
, the indication whether the application of each regular expression is case insensitive case_insensitive
, the log file handler logfh
, the output file handler logout
, and the log file name logfile
.
reg_exps
is a hashtable where keys are FORM
and values are an array of regular expressions.
case_insensitive
is a hashtable where keys are regular expressions.
tc_counter()
tc_counter($tc_counter);
This method sets the attribute tc_counter
with the value $tc_counter
and returns it. When no argument is given, the value of the attribute tc_counter
is return.
logfh()
logfh($logfh);
This method sets the attribute logfh
with the handler $logfh
and returns it. When no argument is given, the value of the attribute logfh
is return.
outfh()
outfh();
This method sets the attribute outfh
with the handler $outfh
and returns it. When no argument is given, the value of the attribute outfh
is return.
count_rejected()
count_rejected($count_rejected);
This method sets the attribute count_rejected
with the value $count_rejected
and returns it. When no argument is given, the value of the attribute count_rejected
is return.
case_insensitive()
case_insensitive(\%case_insensitive);
This method sets the attribute case_insensitive
with the hashtable %case_insensitive
and returns it. When no argument is given, the hashtable reference of the attribute case_insensitive
is return.
case_insensitive_elt()
case_insensitive_elt($case_insensitive_name, case_insensitive_value);
This method sets the indication whether the regular expression $case_insensitive_name
is case insensitive or not (value $case_insensitive_value
) in the hashtable referred by the attribute case_insensitive
and returns it. When one argument is set, the value associated to the regular expression $case_insensitive_name
is return. When no argument is given, an undefined value is return.
exists_case_insensitive_elt()
exists_case_insensitive_elt($case_insensitive_name);
The method indicates if the application of the regular expression $case_insensitive_name
is case insensitive or not.
options()
options(\%options);
This method sets the attribute options
with the hashtable %options
and returns it. When no argument is given, the hashtable reference of the attribute options
is return.
configuration()
configuration($configuration);
This method sets the attribute configuration
with the value $configuration
and returns it. When no argument is given, the value of the attribute configuration
is return.
input_file()
input_file($input_file);
This method sets the field input-file
of the attribute options
with the value $input_file
(input file name) and returns it. When no argument is given, the value of the field input-file
of the attribute options
is return.
logfile()
logfile($logfile);
This method sets the field log-file
of the attribute options
with the value $log_file
(log file name) and returns it. When no argument is given, the value of the field log-file
of the attribute options
is return.
tmp_dir()
tmp_dir($tmp_dir);
This method sets the field tmp-dir
of the attribute options
with the value $output_file
(output file name) and returns it. When no argument is given, the value of the field output_file
of the attribute options
is return.
output_file()
output_file($output_file);
This method sets the field output-file
of the attribute options
with the value $output_file
(output file name) and returns it. When no argument is given, the value of the field output-file
of the attribute options
is return.
reg_exps()
reg_exps(\%reg_exps);
This method sets the attribute reg_exps
with the hashtable %reg_exps
and returns it. When no argument is given, the hashtable reference of the attribute reg_exps
is return.
reg_exp_elt()
reg_exp_elt($reg_exp_name, $reg_exp_value);
This method adds the regular expression $reg_exp_value
to the array related to the type of patterns $reg_exp_name
and returns it. When one argument is set, the array referred by $reg_exp_name
is return. When no argument is given, a reference to an empty array is return.
twig_parser()
twig_parser($twig_parser);
This method sets the attribute twig_parser
with the XML:Twig
parser $twig_parser
and returns it. When no argument is given, the value of the attribute twig_parser
is return.
defineTwigParser()
defineTwigParser();
The method defines the XML::Twig
parser and associates to the object.
processTerms()
processTerms($twig_parser,$data);
The function processes terms which match regular expressions by applying associated actions (as defined in the configuration file, for instance). The terms are in XML tree $data
.
Note: this is a function which uses in the XML::Twig
parser (called as function pointer).
load_configuration()
load_configuration();
The method process and loads the configuration file (set in the attribute configuration
of the current object). The attributes reg_exps
and case_insensitive
are set by this method.
filtering()
filtering();
The method performs the full filtering of the terms:
- setting of the temporary file if not defined
- opening the XML output file
- setting the
XML::Twig
parser - processing of the XML input file in order to apply action associated to the regular expressions
printResume()
printResume();
The method prints the number of rejected terms and the number of remaining candidate terms.
rmlog()
rmlog();
The method deletes the log file.
CONFIGURATION FILE FORMAT
The configuration file defines the action to perform when an associated regular expression matches a term form. For instance:
CLEAN=FORM::/[Vv]arious/
Each line defines an association between an action (only CLEAN
for the moment) and a regular expression to apply to a form of a term (FORM
for the inflected form, LEMMA
for the lemmatised form).
The action and regular expression parts are separated by the character =
. The two elements of the regular expression are separated by two collons (::
).
Comments are introduced by a #
character at the begin of the line.
SEE ALSO
Documentation of Lingua::YaTeA
AUTHORS
Wiktoria Golik <wiktoria.golik@jouy.inra.fr>, Zorana Ratkovic <Zorana.Ratkovic@jouy.inra.fr>, Robert Bossy <Robert.Bossy@jouy.inra.fr>, Claire Nédellec <claire.nedellec@jouy.inra.fr>, Thierry Hamon <thierry.hamon@univ-paris13.fr>
LICENSE
Copyright (C) 2012 Wiktoria Golik, Zorana Ratkovic, Robert Bossy, Claire Nédellec and Thierry Hamon
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.