--watch option

SYNOPSIS: touch timestamp.file treex --watch=timestamp.file my.scen & # or without & and open another terminal # after all documents are processed, treex is still running, watching timestamp.file # you can modify any modules/blocks and then touch timestamp.file # All modified modules will be reloaded (the number of reloaded modules is printed). # The document reader is restarted, so it starts reading the first file again. # To exit this "watching loop" either rm timestamp.file or press Ctrl^C.

BENEFITS: * much faster development cycles (e.g. most time of en-cs translation is spent on loading) * Now I have some non-deterministic problems with loading NER::Stanford - using --watch I get it loaded on all jobs once and then I don't have to reload it.

TODO: * modules are just reloaded, no constructors are called yet

NAME

Treex::Core::Run + treex - applying Treex blocks and/or scenarios on data

VERSION

version 2.20151210

SYNOPSIS

In bash:

> treex myscenario.scen -- data/*.treex
> treex My::Block1 My::Block2 -- data/*.treex

In Perl:

use Treex::Core::Run q(treex);
treex([qw(myscenario.scen -- data/*.treex)]);
treex([qw(My::Block1 My::Block2 -- data/*.treex)]);

DESCRIPTION

Treex::Core::Run allows to apply a block, a scenario, or their mixture on a set of data files. It is designed to be used primarily from bash command line, using a thin front-end script called treex. However, the same list of arguments can be passed by an array reference to the function treex() imported from Treex::Core::Run.

Note that this module supports distributed processing (Linux-only!), simply by adding the switch -p. The treex method then creates a Treex::Core::Parallel::Head object, which extends Treex::Core::Run by providing parallel processing functionality.

Then there are two ways to process the data in a parallel fashion. By default, SGE cluster\'s qsub is expected to be available. If you have no cluster but want to make the computation parallelized at least on a multicore machine, add the --local switch.

SUBROUTINES

treex

create new runner and runs scenario given in parameters

USAGE

usage: treex [-?dEehjLmpqSstv] [long options...] scenario [-- treex_files]
scenario is a sequence of blocks or *.scen files
options:
	-h -? --usage --help         Prints this usage information.
	-s --save                    save all documents
	-q --quiet                   Warning, info and debug messages are
	                             suppressed. Only fatal errors are
	                             reported.
	--cleanup                    Delete all temporary files.
	-e --error_level             Possible values: ALL, DEBUG, INFO, WARN,
	                             FATAL
	-L --language --lang         shortcut for adding "Util::SetGlobal
	                             language=xy" at the beginning of the
	                             scenario
	-S --selector                shortcut for adding "Util::SetGlobal
	                             selector=xy" at the beginning of the
	                             scenario
	-t --tokenize                shortcut for adding "Read::Sentences
	                             W2A::Tokenize" at the beginning of the
	                             scenario (or W2A::XY::Tokenize if used
	                             with --lang=xy)
	--watch                      re-run when the given file is changed
	                             TODO better doc
	-d --dump_scenario           Just dump (print to STDOUT) the given
	                             scenario and exit.
	--dump_required_files        Just dump (print to STDOUT) files
	                             required by the given scenario and exit.
	--cache                      Use cache. Required memory is specified
	                             in format memcached,loading. Numbers are
	                             in GB.
	-v --version                 Print treex and perl version
	-E --forward_error_level     messages with this level or higher will
	                             be forwarded from the distributed jobs
	                             to the main STDERR
	-p --parallel                Parallelize the task on SGE cluster
	                             (using qsub).
	-j --jobs                    Number of jobs for parallelization,
	                             default 10. Requires -p.
	--local                      Run jobs locally (might help with
	                             multi-core machines). Requires -p.
	--priority                   Priority for qsub, an integer in the
	                             range -1023 to 0 (or 1024 for admins),
	                             default=-100. Requires -p.
	--memory -m --mem            How much memory should be allocated for
	                             cluster jobs, default=2G. Requires -p.
	                             Translates to "qsub -hard -l
	                             mem_free=$mem -l h_vmem=2*$mem -l
	                             act_mem_free=$mem". Use --mem=0 and
	                             --qsub to set your own SGE settings
	                             (e.g. if act_mem_free is not available).
	--name                       Prefix of submitted jobs. Requires -p.
	                             Translates to "qsub -N $name-jobname".
	--qsub                       Additional parameters passed to qsub.
	                             Requires -p. See --priority and --mem.
	                             You can use e.g. --qsub="-q *@p*,*@s*"
	                             to use just machines p* and s*. Or e.g.
	                             --qsub="-q *@!(twi*|pan*)" to skip twi*
	                             and pan* machines.
	--workdir                    working directory for temporary files in
	                             parallelized processing; one can create
	                             automatic directories by using patterns:
	                             {NNN} is replaced by an ordinal number
	                             with so many leading zeros to have
	                             length of the number of Ns, {XXXX} is
	                             replaced by a random string, whose
	                             length is the same as the number of Xs
	                             (min. 4). If not specified, directories
	                             such as 001-cluster-run, 002-cluster-run
	                             etc. are created
	--survive                    Continue collecting jobs' outputs even
	                             if some of them crashed (risky, use with
	                             care!).
	--jobindex                   Not to be used manually. If number of
	                             jobs is set to J and modulo set to M,
	                             only I-th files fulfilling I mod J == M
	                             are processed.
	--outdir                     Not to be used manually. Dictory for
	                             collecting standard and error outputs in
	                             parallelized processing.
	--server                     Not to be used manually. Used to point
	                             parallel jobs to the head.

AUTHORS

Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>

Martin Popel <popel@ufal.mff.cuni.cz>

Martin Majliš

Ondřej Dušek <odusek@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011-2014 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.