NAME

Treex::Tutorial::FirstSteps - First steps after installing Treex

VERSION

version 0.07297

SHORT INTRODUCTION

The elementary unit of code in Treex is called block. Each block should solve some well defined and usually lingustically motivated task, e.g. tokenization, tagging or parsing.

A sequence of blocks is called scenario and it can describe end-to-end NLP application, e.g. machine translation or preprocessing of a parallel treebank.

Treex applications can be executed from Perl. However, in this tutorial we'll start with the command line interface treex.

HELLO WORLD

We will start traditionally with the "Hello, world!" example :-).

echo 'Hello, world!' | treex Read::Text language=en Write::Text language=en

The desired output was printed to STDOUT, but there are some info messages around printed to STDERR. To filter out these messages you can either use the --quite option (-q) or the standard redirection of STDERR.

echo 'Hello, world!' | treex -q Read::Text language=en Write::Text language=en
echo 'Hello, world!' | treex Read::Text language=en Write::Text language=en 2>/dev/null
What does the syntax mean?

Read::Text language=en Write::Text language=en is a scenario definition. The scenario consists of two blocks: Read::Text and Write::Text. Each block has one parameter set, the name of the parameter is language and its value is en (which is ISO 639-1 code for English).

Why is the language parameter needed?

One Treex document can contain sentences in more language (which is useful for task like word alignment or machine translation), so it is necessary to instruct each block on which language it should operate.

Can I make the scenario description shorter?

It is not necessary to repeat the same parameter specification for every block. You can use a special block Util::SetGlobal:

echo 'Hello, world!' | treex -q Util::SetGlobal language=en Read::Text Write::Text
Can I make it even shorter?

Yes. (And I know the previous example was not actually shorter.) There is an option --language (-L) which is just a shortcut for Util::SetGlobal language=...

echo 'Hello, world!' | treex -q --language=en Read::Text Write::Text
echo 'Hello, world!' | treex -q -Len Read::Text Write::Text

The "Hello, world!" example is silly. The first block (so-called reader) read the plain text input, converted it to the Treex in-memory document representation and this document was passed to the second block (so-called writer) that converted it to plain text and printed on STDOUT. No (linguistic) processing was done.

There are readers and writers for various other formats than plain text (e.g. HTML, CoNLL, PennTB MRG, PDT PML), so you can use it for format conversions (see Treex::Tutorial::ReadersAndWriters). You can also create you own readers and writers for new formats (see Treex::Tutorial::WritingNewReaders).

For simplicity, we'll continue to use plain text format in this tutorial chapter, but we'll try to do something slightly more interesting.

SEGMENTATION TO SENTENCES

To segment a text into sentences we can use block W2A::Segment and writer Write::Sentences that prints each sentence on a separate line.

echo "Hello! Mr. Brown, how are you?" \
 | treex -Len Read::Text W2A::Segment Write::Sentences

You can see, that the text was segmented into three sentences: "Hello!", "Mr.", and "Brown, how are you?". Block W2A::Segment is language independent (at least for languages using Latin alphabet) and it finds sentence boundaries just based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a capital letter. To get the correct segmentation we must use W2A::EN::Segment which has a list of English words (tokens) that usually do not end a sentence even if they are followed by a fullstop and a capital letter. By the way, Treex is object-oriented, blocks are classes and W2A::EN::Segment is a descendant of the W2A::Segment base class.

echo "Hello! Mr. Brown, how are you?" \
 | treex -Len Read::Text W2A::EN::Segment Write::Sentences
Where can I find blocks' source code?

All Treex blocks are stored in $TMT_ROOT/treex/lib/Treex/Block/. The full name of the W2A::EN::Segment module is actually Treex::Block::W2A::EN::Segment, but since the prefix Treex::Block:: is common to all blocks, it is not written in the scenario description.

What does the name W2A::EN::Segment mean?

All Treex blocks that do shallow linguistic analysis (segmentation, tokenization, lemmatization, PoS tagging, dependency parsing) are grouped in a directory W2A (W and A are names of the two layers of language description, but this will be explained later). Language specific blocks are stored in a subdirectory with a uppercase ISO code of the given language (EN) in this case.

How to read already segmented input?

If you have sample.txt with one sentence per line, you can load it to Treex using

cat sample.txt | treex -Len Read::Sentences ...

There are many other options for segmentation, see (perldoc for) modules Treex::Block::W2A::Segment, Treex::Block::W2A::SegmentOnNewlines, and Treex::Block::W2A::ResegmentSentences.

TOKENIZATION, LEMMATIZATION, TAGGING

echo "Mr. Brown, we'll start tagging." |\
 treex -Len Read::Sentences W2A::TokenizeOnWhitespace Write::CoNLLX

echo "Mr. Brown, we'll start tagging." |\
 treex -Len Read::Sentences W2A::Tokenize Write::CoNLLX

echo "Mr. Brown, we'll start tagging." |\
 treex -Len Read::Sentences W2A::EN::Tokenize Write::CoNLLX

echo "Mr. Brown, we'll start tagging." |\
 treex -Len Read::Sentences\
            W2A::EN::Tokenize\
            W2A::TagTreeTagger\
            W2A::EN::Lemmatize\
            Write::CoNLLX

echo "Mr. Brown, we'll start tagging." |\
 treex -Len Read::Sentences\
            W2A::EN::Tokenize\
            W2A::EN::TagMorce\
            W2A::EN::Lemmatize\
            Write::CoNLLX

echo "Es tut mir leid." |\
 treex -Lde Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
echo "Lo siento" |\
 treex -Les Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
echo "Mi dispiace" |\
 treex -Lit Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
echo "Je suis desolée" |\
 treex -Lfr Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX
echo "Bohužel jsem tento tutorial nedokončil." |\
 treex -Lcs Read::Sentences W2A::CS::Tokenize W2A::CS::TagMorce Write::CoNLLX

TASKS

Task1:

You have an input plain text (TODO: add paragraphs_sample.txt) where each paragraph (including headlines) is on a separate line. Load this file into Treex and print one sentence per line. Note that headlines do not end with a fullstop, but they should be treated as separated sentences

HINT: See documentation of Treex::Block::W2A::Segment.

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 171:

Non-ASCII character seen before =encoding in 'desolée"'. Assuming UTF-8