NAME
Treex::Tutorial::Install - Installation Guide for the Treex NLP framework
VERSION
version 2.20150928
SYNOPSIS
This synopsis is just an overview of the six steps, which are described below in more detail.
- 1. Prepare your local Perl environment, so Perl modules will be installed to ~/perl5.
-
We expect no admin rights and no previous local Perl environment. If
env | grep PERL
prints something or directory ~/.cpan exists, there is a risk that your previously installed local Perl environment will be in conflict with the new one.wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib` echo '## Treex installation ##' >> ~/.bashrc echo 'eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`' >> ~/.bashrc grep bashrc ~/.bash_profile || echo 'source ~/.bashrc' >> ~/.bash_profile
- 2. Install Treex::Core and its dependencies from CPAN
-
# First, try to install XML::LibXML cpanm XML::LibXML # If it fails and the build.log contains "looking for -lxml2... no", # you are probably missing libxml2 header files (or the whole libxml2). # On Ubuntu/Debian you can install it with # sudo apt-get install libxml2-dev zlib1g-dev # Few more possibly problematic modules cpanm -n PerlIO::Util cpanm Moose moose-outdated | cpanm # and finally the Treex::Core and its dependencies cpanm Treex::Core # this may take about 10 minutes treex -h # just to check if it was installed correctly
- 3. Install Treex modules for processing English
-
cpanm Treex::EN cpanm Lingua::Interset URI::Find Cache::LRU
- 4. Install TrEd tree viewer and editor (optional)
-
See TrEd home page for details. To install Perl Tk module, you need several header files, on Ubuntu/Debian you can install them with
sudo apt-get install libx11-dev libxft-dev libfontconfig1-dev libpng12-dev patch
.# Get a script which automatically downloads and builds everything else wget http://ufal.mff.cuni.cz/tred/install_tred.bash bash install_tred.bash --tred-dir ~/tred # Instruct Treex where to find TrEd and its dependencies echo "tred_dir: $HOME/tred" >> ~/.treex/config.yaml echo "source ~/tred/bin/init_tred_environment" >> ~/.bashrc source ~/tred/bin/init_tred_environment ttred # run TrEd with Treex extension
- 5. Download the newest version of the whole Treex from GIT repository (optional)
-
git clone https://github.com/ufal/treex.git ~/treex # Add the following lines to your ~/.bashrc export PATH="$HOME/treex/bin:$PATH" export PERL5LIB="$HOME/treex/lib:$PERL5LIB" export TMT_ROOT=$HOME/.treex
- 6. Install MorphoDiTa tagger and NameTag NER (optional)
-
cpanm Ufal::MorphoDiTa Ufal::NameTag
Prerequisites
In this tutorial, we expect Linux OS with Bash shell and Perl 5.10 or higher. Also basic development tools, such as make
, patch
, and a C compiler (gcc
), are required. You can easily use different shell (e.g. csh
), just modify accordingly the shell commands. It is possible to install Treex also on MacOS and Windows+StrawberryPerl, but it is less tested so far. If you have a Perl version older than 5.10 (or if you just want to try the newest Perl), you can install your own Perl using perlbrew -- it is really simple.
Note that if you have Windows and only want to browse *.treex files, you can install TrEd and (in menu Setup - Manage Extensions - Get New Extension) select EasyTreex extension. However, for completing tutorial you need to install Treex (and setup TrEd) as described below, so EasyTreex is superfluous.
1. Prepare local Perl environment
In order to install Treex, you must be able to install Perl modules from CPAN. This step is not specific to Treex, it is a basic Perl skill. There are several ways how to achieve the goal, but I consider this the easiest one. There are two things you should be aware of:
If
env | grep PERL
prints something or directory ~/.cpan exists (ls -l ~/.cpan
), it is probable that you have already configured a local Perl environment. In such a case it is important to- either reuse the environment and skip this step
-
If you used local::lib or perlbrew to set the environment, it should be configured properly and you can continue with step 2 (if you want to use
cpanm
, install it bycpan App::cpanminus
). If you used another method, such as modifying$PERL5LIB
in your ~/.bashrc or setting PREFIX or INSTALL_BASE options incpan
configuration, there is a possibility that your previously installed local Perl environment is configured only partially and the procedure described here may fail. If you decide to reuse your previous local Perl environment, the modules will be installed to whatever path you had chosen (instead of ~/perl5) and you should skip this step 1 (otherwise the installation fails with "WHOA THERE! It looks like you've got ..." in ~/.cpanm/build.log). - or deactivate completely the environment before doing this step.
-
If you do not need/want to use your previous local Perl environment, you should delete (rename) the ~/.cpan directory and edit your shell profile (~/.bashrc, ~/.profile etc), so no Perl-related variables (such as PERL5LIB, PERL_MB_OPT, PERL_MM_OPT) are exported. After running a new shell (new ssh session),
env | grep PERL
should print nothing.
If you have a root access and really want to install Treex and its dependencies for all users to system paths (/usr/lib etc.), just skip this step (you can install
cpanm
bysudo cpan App::cpanminus
). However, in course of this tutorial you will be advised to modify some of the modules (Treex::Block::Tutorial::*), so it may be a good compromise to install only the dependencies to system paths usingsudo cpanm --installdeps Treex::Core
, but otherwise follow this local Perl setup.
Download and install locally two useful tools (Perl modules) – cpanm
and local::lib
:
wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib
App::cpanminus provides cpanm script which is a fast, dependency free, zero-configuration substitute for the standard cpan. local::lib takes care of setting all the environment variables needed to install modules without administrative privileges.
Instead of wget -0-
, you can use curl -L
or simply download cpanm from http://cpanmin.us, save it as cpanm and run perl cpanm -l ~/perl5 App::cpanminus local::lib
. Instead of ~/perl5, you can use any path you like, but ~/perl5 is a common standard used in this tutorial.
In the following steps, you can use cpan
instead of cpanm
. The advantage is that you can start an interactive cpan
shell which provides more features (I recommend to install first Bundle::CPAN and Term::ReadLine::Perl, so you can browse the history using up/down keys). The disadvantage is that you cannot use it for installing local::lib
locally before local::lib
is installed :-). Also, you will need to go through a configuration dialogue when cpan
is executed for the first time.
eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`
echo '## Treex installation ##' >> ~/.bashrc
echo 'eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`' >> ~/.bashrc
grep bashrc ~/.bash_profile || echo 'source ~/.bashrc' >> ~/.bash_profile
The first line sets up the environment variables $PERL5LIB
, $PATH
, $PERL_MB_OPT
etc. for the current shell session. It enables you to use the modules installed in ~/perl5 (without specifying this path using perl -I
) and also it ensures that new modules will be installed (using cpanm
or cpan
) to ~/perl5 (not to the system paths). The third line ensures that this setting will be applied also in other (non-login) shell sessions. The fourth line ensures that this setting will be applied also in "login" shell sessions (e.g. when you log in via ssh
). If you prefer to use ~/.profile instead of ~/.bash_profile, adapt the fourth line accordingly.
2. Install Treex::Core from CPAN
Treex is divided into several CPAN distributions. Treex::Core contains the main ("core") functionality and almost all other Treex modules depend on it. Treex::Core
itself has many dependencies, most notably Moose and Treex::PML (which have many dependencies and so on), so the installation takes several minutes. One of the most frequent problems in installation is that the Perl module XML::LibXML, which is a binding for libxml2
library, needs apart from the library also its header files (*.h). So let's check first, whether you can install XML::LibXML
:
cpanm XML::LibXML
If it fails and ~/.cpanm/build.log contains "Cannot write to /usr/lib/ ... XML/SAX.pm line 191", try to run it again and it should show that it was actually installed. If it fails and ~/.cpanm/build.log contains "looking for -lxml2... no", you are probably missing the header files or the whole library. On Ubuntu/Debian you can install it with:
sudo apt-get install libxml2-dev zlib1g-dev
If you know a simple way how to do this without admin privileges, let me know. You can check for the packages with LANG=C dpkg-query -s libxml2-dev zlib1g-dev 2>&1 | grep Package
. On other systems (e.g. RPM based), try to find similarly named packages (libxml2-devel), or look at http://xmlsoft.org.
There are few other possibly problematic modules. PerlIO::Util
has known (and reported)
cpanm -n PerlIO::Util
cpanm Moose
moose-outdated | cpanm
Now, the installation of Treex::Core should be smooth (but it takes more than 8 minutes if no dependencies were installed before):
cpanm Treex::Core
Rarely, you may encounter problems with installing some modules. In that case, you should find the first module where something went wrong. You can read the documentation of the module, check its bug tracker, try to install it manually etc. If you cannot diagnose and fix the failure, you may try to install it with --prompt
, --force
or --notest
options, but this may cause troubles later on.
treex -h
treex
is the main Treex script. treex -h
should just print the usage information and exit. Its actual usage will be described later on in this tutorial (Treex::Tutorial::FirstSteps); running the command serves here only as a check that treex
was installed and can be found in the $PATH
. The installation created a configuration file ~/.treex/config.yaml which will be described in Treex::Tutorial::Config.
3. Install Treex::EN from CPAN
Treex Core itself has no modules for any particular NLP task. There is a separate distribution Treex-Unilang
for such modules that are language independent. In this tutorial, we will mainly work with English, so you need to install a distribution Treex-EN
, which contains only modules specific to English. It is dependent on Treex-Unilang
, so both the distributions can be installed by:
cpanm Treex::EN
cpanm Lingua::Interset URI::Find Cache::LRU
4. Install TrEd
TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Although TrEd visualization of the linguistic trees produced by Treex can be very helpful, it is not required, i.e. Treex is fully functional even without installing TrEd.
To install Perl Tk module, you may need to install some header files and TrEd needs also the patch
tool. On Ubuntu/Debian you can install these prerequisites using:
sudo apt-get install libx11-dev libxft-dev libfontconfig1-dev libpng12-dev patch
Now, download a small installation script
wget http://ufal.mff.cuni.cz/tred/install_tred.bash
You can type bash install_tred.bash -h
to see the installation options. To automatically download and build the latest TrEd and its dependencies to ~/tred, use:
bash install_tred.bash --tred-dir ~/tred
You can run ~/tred/bin/start_tred
to check the GUI. When a dialog box "Manage extensions" appears, you can ignore it (click on "Later").
Treex Core contains an extension for TrEd, which enables it to open *.treex, *.treex.gz and *.streex files and use the Treex stylesheet. Treex Core also contains a simple wrapper script ttred
which runs TrEd with this extension enabled (pre-installed). We must instruct Treex where to find TrEd:
echo "tred_dir: $HOME/tred" >> ~/.treex/config.yaml
TrEd installed some of its dependencies to ~/tred/dependencies, but we want to make them permanently available for Treex (and all Perl modules):
echo "source ~/tred/bin/init_tred_environment" >> ~/.bashrc
source ~/tred/bin/init_tred_environment
Finally, you can run TrEd with the Treex extension enabled:
ttred
5. Download Treex from GIT repository
Some Treex modules are not mature enough to be released on CPAN. You may also want to test the newest Treex version or commit your own code to the repository. So let's create your local clone of Treex in ~/treex.
git clone https://github.com/ufal/treex.git ~/treex
You need to include the path to the downloaded modules in your $PERL5LIB. Add the following lines to the end of your ~/.bashrc:
export PATH="$HOME/treex/bin:$PATH"
export PERL5LIB="$HOME/treex/lib:$HOME/treex/oldlib:$PERL5LIB"
export TMT_ROOT=$HOME/.treex
It is important that these lines follow eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib` in your ~/.bashrc, so a GIT module is preferred over a CPAN modules of the same name. To apply the setting for the current bash session, type the three export commands or start a new session. You can check it with:
echo $PERL5LIB # ~/treex/lib should precede ~/perl5/...
treex -v # should print "Treex version: DEV from..."
Now you can use Perl modules that were not installed from CPAN (but were downloaded from GIT). Some of the modules may have dependencies that you do not have (installed). When you load such a module (e.g. by running treex
) it will fail with an error message like Can't locate Acme/Time/Baby.pm in @INC (@INC contains:...
You can install the missing dependencies (Acme::Time::Baby in this imaginary example) simply with
cpanm Acme::Time::Baby
If you happen to need any of the modules CzechMorpho
, Morce::Czech
and Morce::English
, you must install them manually, because these modules were not released on CPAN, but they are XS-based (involve compiling C code), so you cannot just download them.
svn --username public export $SVN_TRUNK/libs/packaged /tmp/packaged
cd /tmp/packaged/Morce-English
perl Build.PL
./Build
./Build test
./Build install --prefix $HOME/perl5/lib/perl5
In the same way, you can install CzechMorpho
and Morce-Czech
(in this order because the latter depends on the former).
6. Install MorphoDiTa tagger and NameTag NER (optional)
MorphoDiTa is an open-source tool for morphological analysis of natural language texts. Currently there is a Perl module, Ufal::MorphoDiTa
, available on CPAN providing bindings to the MorphoDiTa library. This module is necessary for running Treex::Tool::Tagger::MorphoDiTa
and consequently Treex::Block::W2A::EN::TagMorphoDiTa
.
To compile the module, C++11 compiler is needed, either g++ 4.7 or newer, alternatively clang 3.2 or newer. You may check if you have the required compiler installed on your computer.
g++ --version
# Or alternatively ...
clang --version
When not installed, install it. On Ubuntu/Debian etc. use this command:
sudo apt-get install g++
When the installed compiler version is too old, upgrade it. On Ubuntu/Debian etc. use this command:
sudo apt-get upgrade g++
Finally, you can install the module:
cpanm Ufal::MorphoDiTa
Another useful tool is Ufal::NameTag, a tool for named entity recognition. It should have similar prerequisities as Ufal::MorphoDiTa, so if you followed the previous steps, just install the module.
cpanm Ufal::NameTag
Uninstall
Although there is no standardized way to uninstall Perl modules, in most cases it is enough to delete the respective files and directories. If you followed this installation guide and you want to remove all the installed stuff and if you had nothing in ~/perl5 before, you can delete the directories ~/perl5, ~/treex, ~/.treex, ~/.tred and ~/.cpanm. You can also delete the added lines from ~/.bashrc (starting with ## Treex installation ##) and ~/.bash_profile.
AUTHOR
Martin Popel <popel@ufal.mff.cuni.cz>
Dušan Variš <varis@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.