NAME

Lingua::BioYaTeA::PostProcessing - Perl extension for postprocessing BioYaTeA term extraction.

SYNOPSIS

use Lingua::BioYaTeA::PostProcessing;

my $postProc = Lingua::BioYaTeA::PostProcessing->new( { 'input-file' => "sampleEN-output.xml", 'output-file' => "sampleEN-bioyatea-out-pp.xml", 'configuration' => "post-processing-filtering.conf", } ); $postProc->logfile(dirname($postProc->output_file) . '/term-filtering.log'); $postProc->load_configuration; $postProc->defineTwigParser; $postProc->filtering; $postProc->printResume;

DESCRIPTION

The module implements an extension for the post-processing of the BioYaTeA (Lingua::BioYaTeA output. Currently, the XML BioYaTeA output is filtered according to rules in order to remove non relevant extracted terms.

The input and output files are in the XML YaTeA format.

The configuration file provides patterns related to various types: inflected forms (FORM) or lemmatized forms (LEMMA) of terms or term components and action to perform. Currently only the CLEAN action (to remove terms) is implemented.

METHODS

new()

new(\%options);

The method creates a post-processing component of BioYaTeA and sets the option attribute with the hashtable @options, and returns the created object.

The hashtable @options contains several fields: the input file name input-file, the output file name output-file, the configuration file name configuration and the temporary directory name tmp-dir.

Other attributes are: the XML::Twig parser twig_parser, the counter of term candidates tc_counter, the counter of rejected terms count_rejected, the list of regular expressions used to identify terms to reject reg_exps, the indication whether the application of each regular expression is case insensitive case_insensitive, the log file handler logfh, the output file handler logout, and the log file name logfile.

reg_exps is a hashtable where keys are FORM and values are an array of regular expressions.

case_insensitive is a hashtable where keys are regular expressions.

tc_counter()

tc_counter($tc_counter);

This method sets the attribute tc_counter with the value $tc_counter and returns it. When no argument is given, the value of the attribute tc_counter is return.

logfh()

logfh($logfh);

This method sets the attribute logfh with the handler $logfh and returns it. When no argument is given, the value of the attribute logfh is return.

outfh()

outfh();

This method sets the attribute outfh with the handler $outfh and returns it. When no argument is given, the value of the attribute outfh is return.

count_rejected()

count_rejected($count_rejected);

This method sets the attribute count_rejected with the value $count_rejected and returns it. When no argument is given, the value of the attribute count_rejected is return.

case_insensitive()

case_insensitive(\%case_insensitive);

This method sets the attribute case_insensitive with the hashtable %case_insensitive and returns it. When no argument is given, the hashtable reference of the attribute case_insensitive is return.

case_insensitive_elt()

case_insensitive_elt($case_insensitive_name, case_insensitive_value);

This method sets the indication whether the regular expression $case_insensitive_name is case insensitive or not (value $case_insensitive_value) in the hashtable referred by the attribute case_insensitive and returns it. When one argument is set, the value associated to the regular expression $case_insensitive_name is return. When no argument is given, an undefined value is return.

exists_case_insensitive_elt()

exists_case_insensitive_elt($case_insensitive_name);

The method indicates if the application of the regular expression $case_insensitive_name is case insensitive or not.

options()

options(\%options);

This method sets the attribute options with the hashtable %options and returns it. When no argument is given, the hashtable reference of the attribute options is return.

configuration()

configuration($configuration);

This method sets the attribute configuration with the value $configuration and returns it. When no argument is given, the value of the attribute configuration is return.

input_file()

input_file($input_file);

This method sets the field input-file of the attribute options with the value $input_file (input file name) and returns it. When no argument is given, the value of the field input-file of the attribute options is return.

logfile()

logfile($logfile);

This method sets the field log-file of the attribute options with the value $log_file (log file name) and returns it. When no argument is given, the value of the field log-file of the attribute options is return.

tmp_dir()

tmp_dir($tmp_dir);

This method sets the field tmp-dir of the attribute options with the value $output_file (output file name) and returns it. When no argument is given, the value of the field output_file of the attribute options is return.

output_file()

output_file($output_file);

This method sets the field output-file of the attribute options with the value $output_file (output file name) and returns it. When no argument is given, the value of the field output-file of the attribute options is return.

reg_exps()

reg_exps(\%reg_exps);

This method sets the attribute reg_exps with the hashtable %reg_exps and returns it. When no argument is given, the hashtable reference of the attribute reg_exps is return.

reg_exp_elt()

reg_exp_elt($reg_exp_name, $reg_exp_value);

This method adds the regular expression $reg_exp_value to the array related to the type of patterns $reg_exp_name and returns it. When one argument is set, the array referred by $reg_exp_name is return. When no argument is given, a reference to an empty array is return.

twig_parser()

twig_parser($twig_parser);

This method sets the attribute twig_parser with the XML:Twig parser $twig_parser and returns it. When no argument is given, the value of the attribute twig_parser is return.

defineTwigParser()

defineTwigParser();

The method defines the XML::Twig parser and associates to the object.

processTerms()

processTerms($twig_parser,$data);

The function processes terms which match regular expressions by applying associated actions (as defined in the configuration file, for instance). The terms are in XML tree $data.

Note: this is a function which uses in the XML::Twig parser (called as function pointer).

load_configuration()

load_configuration();

The method process and loads the configuration file (set in the attribute configuration of the current object). The attributes reg_exps and case_insensitive are set by this method.

filtering()

filtering();

The method performs the full filtering of the terms:

setting of the temporary file if not defined
opening the XML output file
setting the XML::Twig parser
processing of the XML input file in order to apply action associated to the regular expressions

printResume()

printResume();

The method prints the number of rejected terms and the number of remaining candidate terms.

rmlog()

rmlog();

The method deletes the log file.

CONFIGURATION FILE FORMAT

The configuration file defines the action to perform when an associated regular expression matches a term form. For instance:

CLEAN=FORM::/[Vv]arious/

Each line defines an association between an action (only CLEAN for the moment) and a regular expression to apply to a form of a term (FORM for the inflected form, LEMMA for the lemmatised form).

The action and regular expression parts are separated by the character =. The two elements of the regular expression are separated by two collons (::).

Comments are introduced by a # character at the begin of the line.

SEE ALSO

Documentation of Lingua::YaTeA

AUTHORS

Wiktoria Golik <wiktoria.golik@jouy.inra.fr>, Zorana Ratkovic <Zorana.Ratkovic@jouy.inra.fr>, Robert Bossy <Robert.Bossy@jouy.inra.fr>, Claire Nédellec <claire.nedellec@jouy.inra.fr>, Thierry Hamon <thierry.hamon@univ-paris13.fr>

LICENSE

Copyright (C) 2012 Wiktoria Golik, Zorana Ratkovic, Robert Bossy, Claire Nédellec and Thierry Hamon

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.