NAME

Alvis::TermTagger - Perl extension for tagging terms in a text

SYNOPSIS

use Alvis::TermTagger;

Alvis::TermTagger::termtagging($text, $termlist, $outputfile);

or

Alvis::TermTagger::termtagging($text, $termlist, $outputfile, $lemmatised_text);

DESCRIPTION

This module is used to tag a text with terms (either with inflected or lemmatised form of their words). The text or the text corpus ($text) is a file with one sentence per line. Term list ($termlist) is a file containing one term per line. For each term, additionnal information (as canonical form, a semantic tag and the lemmatised word of the term) can be given after the first column. This information can be separated by either a colon, either by a vertical bar. Each line of the output file ($outputfile) contains the sentence number, the term, additional information, all separated by a tabulation character. The lemmatised text ($lemmatised_text) is build as the concatenation of the lemma of the word of the corpus;

This module is mainly used in the Alvis NLP Platform.

METHODS

termtagging()

termtagging($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename, $caseSensitive);

This is the main method of module. It loads the term list ($term_list_filename) and tags the text corpus ($corpus_filename). It produces the list of matching terms and the sentence offset (and additional information given in the input file) where the terms can be found. The file $output_filename contains this output. To look up the lemmatised term (as a concatenation of lemmatised word), the lemmatised corpus $lemmatised_corpus_filename has to be specified as fourth argument of the method.

The parameter $caseSensitive indicates if the term matching is case sensitive (value greater or equal to 0) or insensitive ((value strictly lesser than 0). If the value of $caseSensitive is equal to 0, the case sensitive match is carried out for any terms. If the value of $caseSensitive is strictly greater than 0, the case sensitive match is carried out only for the terms with a number of characters lesser or equal to $caseSensitive.

termtagging_brat()

termtagging_brat($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename, $caseSensitive);

This is the main method of module. It loads the term list ($term_list_filename) and tags the text corpus ($corpus_filename). The output can be read by Brat (<http://brat.nlplab.org/>).

It produces the list of matching terms and the sentence offset (and additional information given in the input file) where the terms can be found. The file $output_filename contains this output. To look up the lemmatised term (as a concatenation of lemmatised word), the lemmatised corpus $lemmatised_corpus_filename has to be specified as fourth argument of the method.

The parameter $caseSensitive indicates if the term matching is case sensitive (value greater or equal to 0) or insensitive ((value strictly lesser than 0). If the value of $caseSensitive is equal to 0, the case sensitive match is carried out for any terms. If the value of $caseSensitive is strictly greater than 0, the case sensitive match is carried out only for the terms with a number of characters lesser or equal to $caseSensitive.

load_TermList()

load_TermList($term_list_filename,\@term_list);

This method loads the term list ($term_list_filename is the file name) in the array given by reference (\@term_list). Each element of term list contains a reference to a two element array (the term and its canonical form).

get_Regex_TermList()

get_Regex_TermList(\@term_list, \@regex_term_list, \@ref_regex_lemmaWordtermlist);

This method generates the regular expression from the term list (\@term_list). stored in the specific array (\@regex_term_list). \@ref_regex_lemmaWordtermlist records the regular expression for the term lemma.

load_Corpus()

load_Corpus($corpus_filename\%corpus, \%lc_corpus);

This method loads the corpus ($corpus_filename) in hashtable (\%corpus) and prepares the corpus in lower case (recorded in a specific hashtable, \%lc_corpus)

corpus_Indexing()

corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $caseSensitive);

This method indexes the lower case version of the corpus (\%lc_corpus) or the normal case version of the corpus according to the value of the case sensitive parameter ($caseSensitive). The words are stored in the index \%corpus_index (the index is a hashtable given by reference).

print_corpus_index(\%corpus_index);

This method prints on STDERR the corpus index \%corpus_index.

term_Selection()

term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);

This method selects the terms from the term list (\@term_list) potentially appearing in the corpus (that is the indexed corpus, \%corpus_index). Results are recorded in the hash table \%idtrm_select.

The parameter $caseSensitive indicates if the term matching is case sensitive (value greater or equal to 0) or insensitive ((value strictly lesser than 0). If the value of $caseSensitive is equal to 0, the case sensitive match is carried out for any terms. If the value of $caseSensitive is strictly greater than 0, the case sensitive match is carried out only for the terms with a number of characters lesser or equal to $caseSensitive.

term_tagging_offset()

term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);

This method tags the corpus \%corpus with the terms (issued from the term list \@term_list, \@regex_term_list is the term list with regular expression), and selected in a previous step (\%idtrm_select). Resulting selected terms are recorded with their offset, and additional information in the file $output_filename.

The parameter $caseSensitive indicates if the term matching is case sensitive (value greater or equal to 0) or insensitive ((value strictly lesser than 0). If the value of $caseSensitive is equal to 0, the case sensitive match is carried out for any terms. If the value of $caseSensitive is strictly greater than 0, the case sensitive match is carried out only for the terms with a number of characters lesser or equal to $caseSensitive.

term_tagging_offset_brat()

term_tagging_offset_brat(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);

This method tags the corpus \%corpus with the terms (issued from the term list \@term_list, \@regex_term_list is the term list with regular expression), and selected in a previous step (\%idtrm_select). Resulting selected terms are recorded with their offset, and additional information in the file $output_filename in the Brat input format (<http://brat.nlplab.org/>).

The parameter $caseSensitive indicates if the term matching is case sensitive (value greater or equal to 0) or insensitive ((value strictly lesser than 0). If the value of $caseSensitive is equal to 0, the case sensitive match is carried out for any terms. If the value of $caseSensitive is strictly greater than 0, the case sensitive match is carried out only for the terms with a number of characters lesser or equal to $caseSensitive.

term_tagging_offset_tab()

term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results, $caseSensitive);

or

term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results, $caseSensitive);

This method tags the corpus \%corpus with the terms (issued from the term list \@term_list, \@regex_term_list is the term list with regular expression), and selected in a previous step (\%idtrm_select). Resulting selected terms are recorded with their offset, and additional information in the array @tab_results (values are sentence id, selected terms and additional information separated by tabulation) or in the hashtable %tabh_results (keys form is "sentenceid_selectedterm", values are an array reference containing sentence id, selected terms and additional ifnormation).

The parameter $caseSensitive indicates if the term matching is case sensitive (value greater or equal to 0) or insensitive ((value strictly lesser than 0). If the value of $caseSensitive is equal to 0, the case sensitive match is carried out for any terms. If the value of $caseSensitive is strictly greater than 0, the case sensitive match is carried out only for the terms with a number of characters lesser or equal to $caseSensitive.

printMatchingTerm

printMatchingTerm($descriptor, $ref_matching_term, $sentence_id);

This method prints into the file descriptor $descriptor, the sentence id ($sentence_id) and the matching term (named by its reference $ref_matching_term). Both data are on a line and are separated by a tabulation character.

print_brat_output($descriptor, $termId, $matching_term, $start_offset, $end_offset);

This method prints into the file descriptor $descriptor, the term id ($termId), its semantic tag, the start and end offset of the term ($start_offset and $end_offset) and the matching term (named by its reference $matching_term) in the Brat input. Both data are on a line and are separated by a tabulation character.

printMatchingTerm_tab

printMatchingTerm_tab($ref_matching_term, $sentence_id, $ref_tab_results);

This method stores into $ref_tab_results, the sentence id ($sentence_id) and the matching term (named by its reference $ref_matching_term). $ref_tab_results can be a array or a hash table. In case of an array, both data are concatanated in a line and are separated by a tabulation character. In case of a hash table, both data are stored in an array, hash key is the concatenation of the sentence id and the matching term.

SEE ALSO

Alvis web site: http://www.alvis.info

Brat: http://brat.nlplab.org/

AUTHORS

Thierry Hamon <thierry.hamon@limsi.fr>

LICENSE

Copyright (C) 2006 by Thierry Hamon

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.