NAME
Alvis::NLPPlatform::UserNLPWRapper - User interface for customizing the NLP wrappers used to linguistically annotating of XML documents in Alvis
SYNOPSIS
use Alvis::NLPPlatform::UserNLPWrapper;
Alvis::NLPPlatform::UserNLPWrappers->tokenize($h_config,$doc_hash);
DESCRIPTION
This module is a mere interface for allowing the cutomisation of the NLP Wrappers. Anyone who wants to integrated a new NLP tool has to overwrite the default wrapper. The aim of this module is to simplify the development a specific wrapper, its integration and its use in the platform.
Before developing a new wrapper, it is necessary to copy and modify this file in a local directory and add this directory to the PERL5LIB variable.
METHODS
tokenize()
tokenize($h_config, $doc_hash);
This method carries out the tokenisation process of the input document. $doc_hash
is the hashtable containing containing all the annotations of the input document. See documentation in Alvis::NLPPlatform::NLPWrappers
. It is not recommended to overwrite this method.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
The method returns the number of tokens.
scan_ne()
scan_ne($h_config, $doc_hash);
This method wraps the Named entity recognition and tagging step. $doc_hash
is the hashtable containing containing all the annotations of the input document. It aims at annotating semantic units with syntactic and semantic types. Each text sequence corresponding to a named entity will be tagged with a unique tag corresponding to its semantic value (for example a "gene" type for gene names, "species" type for species names, etc.). All these text sequences are also assumed to be equivalent to nouns: the tagger dynamically produces linguistic units equivalent to words or noun phrases.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
word_segmentation()
word_segmentation($h_config, $doc_hash);
This method wraps the default word segmentation step. $doc_hash
is the hashtable containing containing all the annotations of the input document.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
sentence_segmentation()
sentence_segmentation($h_config, $doc_hash);
This method wraps the default sentence segmentation step. $doc_hash
is the hashtable containing containing all the annotations of the input document.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
pos_tag()
pos_tag($h_config, $doc_hash);
The method wraps the Part-of-Speech (POS) tagging. $doc_hash
is the hashtable containing containing all the annotations of the input document. For every input word, the wrapped Part-Of-Speech tagger outputs its tag.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
lemmatization()
lemmatization($h_config, $doc_hash);
This methods wraps the lemmatizer. $doc_hash
is the hashtable containing containing all the annotations of the input document. For every input word, the wrapped lemmatizer outputs its lemma i.e. the canonical form of the word..
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
term_tag()
term_tag($h_config, $doc_hash);
The method wraps the term tagging step of the ALVIS NLP Platform. $doc_hash
is the hashtable containing containing all the annotations of the input document. This step aims at recognizing terms in the documents differing from named entities, like gene expression, spore coat cell.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
syntactic_parsing()
syntactic_parsing($h_config, $doc_hash);
This method wraps the sentence parsing. It aims at exhibiting the graph of the syntactic dependency relations between the words of the sentence. $doc_hash
is the hashtable containing containing all the annotations of the input document.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
Here is a example of how to tune the platform according to the domain. We integrated and wrapped the BioLG parser, specialized for biology text parsing.
bio_syntactic_parsing()
bio_syntactic_parsing($h_config, $doc_hash);
This method wraps the sentence parsing tuned for biology texts. As the default wrapper (syntactic_parsing
), it aims at exhibiting the graph of the syntactic dependency relations between the words of the sentence. $doc_hash
is the hashtable containing containing all the annotations of the input document.
$h_config
is the reference to the hashtable containing the variables defined in the configuration file.
We actually integrage a version of the Link Parser tuned for the biology: BioLG (Sampo Pyysalo, Tapio Salakoski, Sophie Aubin and Adeline Nazarenko. Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches. Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006). Pages 60-67. Jena, Germany, 2006).
semantic_feature_tagging()
semantic_feature_tagging($h_config, $doc_hash)
The method wraps the semantic typing step, that is the attachment of a semantic type to the words, terms and named-entities (referred to as lexical items in the following) in documents according to the conceptual hierarchies of the ontology of the domain.
$doc_hash
is the hashtable containing containing all the annotations of the input document.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
semantic_relation_tagging()
semantic_relation_tagging($h_config, $doc_hash)
This method wraps the semantic relation identification step. These semantic relation annotations give another level of semantic representation of the document that makes explicit the role that these semantic units (usually named-entities and/or terms) play with respect to each other, pertaining to the ontology of the domain.
$doc_hash
is the hashtable containing containing all the annotations of the input document.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
anaphora_resolution()
anaphora_resolution($h_config, $doc_hash)
The methods wraps the anaphora solver. $doc_hash
is the hashtable containing containing all the annotations of the input document. It aims at identifing and solving the anaphora present in a document.
$hash_config
is the reference to the hashtable containing the variables defined in the configuration file.
SEE ALSO
Alvis web site: http://www.alvis.info
AUTHORS
Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Julien Deriviere <julien.deriviere@lipn.univ-paris13.fr>
LICENSE
Copyright (C) 2005 by Thierry Hamon and Julien Deriviere
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.