NAME

Alvis::NLPPlatform - Perl extension for linguistically annotating XML documents in Alvis

SYNOPSIS

  • Standalone mode:

    use Alvis::NLPPlatform;
    
    Alvis::NLPPlatform::standalone_main(\%config, $doc_xml, \*STDOUT);
  • Distributed mode:

    # Server process
    
    use Alvis::NLPPlatform;
    
    Alvis::NLPPlatform::server($rcfile);
    
    # Client process
    
    use Alvis::NLPPlatform;
    
    Alvis::NLPPlatform::client($rcfile);

DESCRIPTION

This module is the main part of the Alvis NLP platform. It provides overall methods for the linguistic annotation of web documents. Linguistic annotations depend on the configuration variables and dependencies between linguistic steps.

Input documents are assumed to be in the ALVIS XML format (standalone_main) or to be loaded in a hashtable (client_main). The annotated document is recorded in the given descriptor (standalone_main) or returned as a hashtable (client_main).

Linguistic annotation: requirements

  1. Tokenized: this step has no dependency. It is required for
            any following annotation level.
  2. Named Entity Tagging: this step requires tokenization. 
  3. Word segmentation: this step requires tokenization.
            The  Named Entity Tagging step is recommended to improve the segmentation.
  4. Sentence segmentation: this step requires tokenization.
            The  Named Entity Tagging step is recommended to improve the segmentation. 
  5. Part-Of-Speech Tagging: this step requires tokenization, and word and
    sentence segmentation.
  6. Lemmatization: this step requires tokenization, 
    word and sentence segmentation, and Part-of-Speech tagging.
  7. Term Tagging: this step requires tokenization, 
    word and sentence segmentation, and Part-of-Speech tagging. Lemmatization is recommended to improve the term recognition.
  8. Parsing: this step requires tokenization, word and sentence
    segmentation.  Term tagging is recommended to improve the parsing of noun phrases.
  9. Semantic feature tagging: To be determined
  10. Semantic relation tagging: To be determined
  11. Anaphora resolution: To be determined

METHODS

compute_dependencies()

compute_dependencies($hashtable_config);

This method processes the configuration variables defining the linguistic annotation steps. $hash_config is the reference to the hashtable containing the variables defined in the configuration file. The dependencies of the linguistic annotations are then coded. For instance, asking for POS annotation will imply tokenization, word and sentence segmentations.

starttimer()

starttimer()

This method records the current date and time. It is used to compute the time of a processing step.

endtimer()

endtimer();

This method ends the timer and returns the time of a processing step, according to the time recorded by starttimer().

linguistic_annotation()

linguistic_annotation($h_config,$doc_hash);

This methods carries out the lingsuitic annotation according to the list of required annotations. Required annotations are defined by the configuration variables ($hash_config is the reference to the hashtable containing the variables defined in the configuration file).

The document to annotate is passed as a hash table ($doc_hash). The method adds annotation to this hash table.

standalone_main()

standalone_main($hash_config, $doc_xml, \*STDOUT);

This method is used to annotate a document in the standalone mode of the platform. The document (%doc_xml) is given in the ALVIS XML format.

The document is loaded into memory and then annotated according to the steps defined in the configuration variables ($hash_config is the reference to the hashtable containing the variables defined in the configuration file). The annotated document is printed to the file defined by the descriptor given as parameter (in the given example, the standard output). $printCollectionHeaderFooter indicates if the documentCollection header and footer have to be printed.

client_main()

client_main($doc_hash, $r_config);

This method is used to annotate a document in the distributed mode of the NLP platform. The document given in the ALVIS XML format is already is loaded into memory ($doc_hash).

The document is annotated according to the steps defined in the configuration variables. The annotated document is returned to the calling method.

load_config()

load_config($rcfile);

The method loads the configuration of the NLP Platform by reading the configuration file given in argument.

client()

sigint_handler()

sigint_handler($signal, $r_config);

This method is used to catch the INT signal and send a ABORTING message to the server.

server()

disp_log()

disp_log($hostname,$message);

This method prints the message ($message) on the standard error output, in a formatted way:

date: (client=hostname) message

split_to_docRecs()

split_to_docRecs($xml_docs);

This method splits a list of documents into a table and return it. Each element of the table is a two element table containing the document id and the document.

sub_dir_from_id()

sub_dir_from_id($doc_id)

Ths method returns the subdirectory where annotated document will stored. It computes the subdirectory from the two first characters of the document id ($doc_id).

record_id()

record_id($doc_id, $r_config);

This method records in the file $ALVISTMP/.proc_id, the id of the document that has been sent to the client.

delete_id()

delete_id($doc_id,$r_config);

This method delete the id of the document that has been sent to the client, from the file $ALVISTMP/.proc_id.

init_server()

init_server($r_config);

This method initializes the server. It reads the document id from the file $ALVISTMP/.proc_id and loads the corresponding documents i.e. documents which have been annotated but not recorded due to a server crash.

PLATFORM CONFIGURATION

The configuration file of the NLP Platform is composed of global variables and divided into several sections:

  • Global variables.

    The two mandatory variables are ALVISTMP and PRESERVEWHITESPACE. ALVISTMP defines the temporary directory used during the annotation process. It must be writable to the user the process is running as. $preserveWhiteSpace is a boolean indicating if the linguistic annotation will be done by preserving white space or not, i.e. XML blank nodes and white space at the beginning and the end of any line.

    Additional variables and environement variables can be used if they are interpolated in the configuration file. For instance, in the default configuration file, we add PLATFORM_ROOT, NLP_tools_root, and AWK.

    • ALVISTMP : temporary directory where files are recorded (XML files and input/output of the NLP tools) during the annotation step.

  • Section alvis_connection

    • HARVESTER_PORT: the port of the harverster/crawler (combine) that the platform will read from to get the documents to annotate.

    • NEXTSTEP: indicates if there is a next step in the pipeline (for instance, the indexer IdZebra). the value is 0 or 1.

    • NEXTSTEP_HOST: the host name of the component that the platform will send the annotated document to.

    • NEXTSTEP_PORT: the port of the component that the platform will send the annotated document to.

    • SPOOLDIR: the directory where the documents coming from the harvester are stored.

      It must be writable to the user the process is running as.

    • OUTDIR: the directory where are stored the annotated documents if SAVE_IN_OUTDIR (in Section NLP_misc) is set.

      It must be writable to the user the process is running as.

  • Section NLP_connection

    • SERVER: The host name where the NLP server is running, for the connections with the NLP clients.

    • PORT: The listening port of the NLP server, for the connections with the NLP clients.

    • RETRY_CONNECTION: The number of times that the clients attempts to connect to the server.

  • Section linguistic_annotation

    the section defines the NLP steps that will be used for annotating documents. The values are 0 or 1.

    • ENABLE_TOKEN: toggles the tokenization step.

    • ENABLE_NER: toggles the named entity recognition step.

    • ENABLE_WORD: toogles the word segmentation step.

    • ENABLE_SENTENCE: toogles the sentence segmentation step.

    • ENABLE_POS: toogles the Part-of-Speech tagging step.

    • ENABLE_LEMMA: toogles the lemmatization step.

    • ENABLE_TERM_TAG: toogles the term tagging step.

    • ENABLE_SYNTAX: toogles the parsing step.

  • Section NLP_misc

    the section defines miscellenous variables for NLP annotation steps.

    • NLP_resources: the root directory where NLP resources can be found.

    • SAVE_IN_OUTDIR: enable or not to save the annotated documents in the outdir directory.

    • TERM_LIST_EN: the path of the term list for English.

    • TERM_LIST_FR: the path of the term list for French.

  • Section NLP_tools

    This section defines the command line for the NLP tools integrated in the platform.

    Additional variables and environment variables can be used for interpolation.

    • NETAG_EN: command line for the Named Entity Recognizer for English.

    • NETAG_FR: command line for the Named Entity Recognizer for French.

    • WORDSEG_EN: command line for the word segmentizer for English.

    • WORDSEG_FR: command line for the word segmentizer for French.

    • POSTAG_EN: command line for the Part-of-Speech tagger for English.

    • POSTAG_FR: command line for the Part-of-Speech tagger for French.

    • SYNTACTIC_ANALYSIS_EN: command line for the parser for English.

    • SYNTACTIC_ANALYSIS_FR: command line for the parser for French.

    • TERM_TAG_EN: command line for the term tagger for English.

    • TERM_TAG_FR: command line for the term tagger for French.

DEFAULT INTEGRATED/WRAPPED NLP TOOLS

Several NLP tools have been integrated in wrappers. In this section, we summarize how to download and install the NLP tools used by default in the Alvis::NLPPlatform::NLPWrappers.pm module. We also give additional information about the tools.

Named Entity Tagger

We integrated TagEn as the default named entity tagger.

  • Form:

    sources, binaries and Perl scripts

  • Obtain:

    http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/TagEN.tar.gz

  • Install:

    untar TagEN.tar.gz in a directory
    go to  src directory
    run compile script
  • Licence:

    GPL

  • Version number required:

    any

  • Additional information:

    This named entity tagger can be run according to various mode. A mode is defined by Unitex (http://www-igm.univ-mlv.fr/~unitex/) graphs. The tagger can be used for English and French texts.

Word and sentence segmenter

The Word and sentence segmenter we use by default is a awk script sent by Gregory Grefenstette on the Corpora mailing list. We modified it to segmentize French texts.

  • Form:

    AWK script
  • Obtain:

    http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/WordSeg.tar.gz
  • Install:

    untar WordSeg.tar.gz in a directory
  • Licence:

    GPL
  • Version number required:

    any (mods for French by Paris 13)

Part-of-Speech Tagger

The default wrapper call the TreeTagger. This tool is a Part-of-Speech tagger and lemmatizer.

  • Form:

    binary+resources
  • Obtain:

    links and instructions at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
  • Install:

    Information are given on the web site. To summarize, you need to:
    • make a directory named, for instance, TreeTagger

    • Download archives in tools/TreeTagger

    • go in the directory tools/TreeTagger

    • Run install-tagger.sh

  • Licence:

    free for research only
  • Version number required:

    (by date) >= 09.04.1996

Term Tagger

We have integrated a tool developed specifically for the Alvis project. It will be available as Perl script soon.

  • Form:

    Perl script
  • Obtain:

    http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/TermTagger.tar.gz
  • Install:

    untar TermTagger.tar.gz in a directory
  • Licence:

    GPL
  • Version number required:

    any

Part-of-Speech specialized for Biological texts

GeniaTagger (POS and lemma tagger):

  • Form:

    source+resources
  • Obtain:

    links and instructions at
    http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/postagger/geniatagger-2.0.1.tar.gz
  • Install:

    untar geniatagger-2.0.1.tar.gz in a directory
    
    cd tools/geniatagger-2.0.1
    
    Run make
  • Licence:

    free for research only (and Wordnet licence for the dictionary)
  • Version number required:

    2.0.1

Parser

Link Grammar Parser:

  • Form:

    sources + resources
  • Obtain:

    http://www.link.cs.cmu.edu/link/ftp-site/link-grammar/link-4.1b/unix/link-4.1b.tar.gz
  • Install:

       untar link-4.1b.tar.gz
    
       See the Makefile for configuration
    
       run make
    
       Apply the additional patch for the Link Grammar parser (lib/Alvis/NLPPlatform/patches).
    
           cd link-4.1b
           patch -p0 < lib/Alvis/NLPPlatform/patches/link-4.1b-WithWhiteSpace.diff
    
        Similar patch exists for the version 4.1a of the Link Grammar parser
  • Licence:

    Compatible with GPL
  • Version number required:

    4.1a or 4.1b

Parser specialized for biological texts

BioLG:

  • Form:

    sources + resources
  • Obtain:

    http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/biolgForAlvis.tar.gz
  • Install:

    untar
    
    See the Makefile for configuration
    
    run make
  • Licence:

    Compatible with GPL
  • Version number required:

    1.1.7b

TUNING THE NLP PLATFORM

The main characteristic of the NLP platform is its tunability according to the domain (language specificity of the documents to be annotated) and the user requirements. The tuning can be done at two levels:

  • either resources adapted or describing more precisely the
    domain can be exploited. 

    In that respect, tuning concerns the integration of these resources in the NLP tools used in the plaform. The command line in the configuration file can be modified.

    Example of resource switching can be found at the named entity recognition step. The default Named Entity tagger can use either bio-medical resources, or more general, according to the value of the parameter -t.

  • either other NLP tools can be integrated in the NLP platform.

    In that case, new wrappers should be written. To make easier, the integration of a new NLP tools, we used the polymorphism to override default wrappers. NLP platform package is defined as a three level hierarchy. The top is the Alvis::NLPPlatform package. The Alvis::NLPPlatform::NLPWrappers package is the deeper. We define the package Alvis::NLPPlatform::UserNLPWrappers as between the both. In that respect, integrating a new NLP tool, and then writing a new wrapper requires to modify methods in the Alvis::NLPPlatform::UserNLPWrappers, and calling or not the default methods.

    NB: If the package Alvis::NLPPlatform::UserNLPWrappers is not writable to the user, the tuning can be done by copying the Alvis::NLPPlatform::UserNLPWrappers in a local directory, and by adding this local directory to the PERL5LIB variable (before the path of Alvis::NLPPlatform).

    NB: A template for the package Alvis::NLPPlatform::UserNLPWrappers can be found in Alvis::NLPPlatform::UserNLPWrappers-template.

    Example of such tuning can be fouond at the parsing level. We integrate a parser designed for biological documents in Alvis::NLPPlatform::UserNLPWrappers.

SEE ALSO

Alvis web site: http://www.alvis.info

AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Julien Deriviere <julien.deriviere@lipn.univ-paris13.fr>

LICENSE

Copyright (C) 2005 by Thierry Hamon and Julien Deriviere

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.