--watch option
SYNOPSIS: touch timestamp.file treex --watch=timestamp.file my.scen & # or without & and open another terminal # after all documents are processed, treex is still running, watching timestamp.file # you can modify any modules/blocks and then touch timestamp.file # All modified modules will be reloaded (the number of reloaded modules is printed). # The document reader is restarted, so it starts reading the first file again. # To exit this "watching loop" either rm timestamp.file or press Ctrl^C.
BENEFITS: * much faster development cycles (e.g. most time of en-cs translation is spent on loading) * Now I have some non-deterministic problems with loading NER::Stanford - using --watch I get it loaded on all jobs once and then I don't have to reload it.
TODO: * modules are just reloaded, no constructors are called yet
NAME
Treex::Core::Run + treex - applying Treex blocks and/or scenarios on data
VERSION
version 2.20160629
SYNOPSIS
In bash:
> treex myscenario.scen -- data/*.treex
> treex My::Block1 My::Block2 -- data/*.treex
In Perl:
use Treex::Core::Run q(treex);
treex([qw(myscenario.scen -- data/*.treex)]);
treex([qw(My::Block1 My::Block2 -- data/*.treex)]);
DESCRIPTION
Treex::Core::Run
allows to apply a block, a scenario, or their mixture on a set of data files. It is designed to be used primarily from bash command line, using a thin front-end script called treex
. However, the same list of arguments can be passed by an array reference to the function treex()
imported from Treex::Core::Run
.
Note that this module supports distributed processing (Linux-only!), simply by adding the switch -p
. The treex
method then creates a Treex::Core::Parallel::Head
object, which extends Treex::Core::Run
by providing parallel processing functionality.
Then there are two ways to process the data in a parallel fashion. By default, SGE cluster\'s qsub
is expected to be available. If you have no cluster but want to make the computation parallelized at least on a multicore machine, add the --local
switch.
SUBROUTINES
- treex
-
create new runner and runs scenario given in parameters
USAGE
usage: treex [-?dEehjLmpqSstv] [long options...] scenario [-- treex_files]
scenario is a sequence of blocks or *.scen files
options:
-h -? --usage --help Prints this usage information.
-s --save save all documents
-q --quiet Warning, info and debug messages
are suppressed. Only fatal errors
are reported.
--cleanup Delete all temporary files.
-e STR --error_level STR Possible values: ALL, DEBUG,
INFO, WARN, FATAL
-L STR --language STR --lang STR shortcut for adding
"Util::SetGlobal language=xy" at
the beginning of the scenario
-S STR --selector STR shortcut for adding
"Util::SetGlobal selector=xy" at
the beginning of the scenario
-t --tokenize shortcut for adding
"Read::Sentences W2A::Tokenize"
at the beginning of the scenario
(or W2A::XY::Tokenize if used
with --lang=xy)
--watch STR re-run when the given file is
changed TODO better doc
-d --dump_scenario Just dump (print to STDOUT) the
given scenario and exit.
--dump_required_files Just dump (print to STDOUT) files
required by the given scenario
and exit.
--cache STR Use cache. Required memory is
specified in format
memcached,loading. Numbers are in
GB.
-v --version Print treex and perl version
-E STR --forward_error_level STR messages with this level or
higher will be forwarded from the
distributed jobs to the main
STDERR
-p --parallel Parallelize the task on SGE
cluster (using qsub).
-j INT --jobs INT Number of jobs for
parallelization, default 10.
Requires -p.
--local Run jobs locally (might help with
multi-core machines). Requires -p.
--priority INT Priority for qsub, an integer in
the range -1023 to 0 (or 1024 for
admins), default=-100. Requires
-p.
--memory STR -m STR --mem STR How much memory should be
allocated for cluster jobs,
default=2G. Requires -p.
Translates to "qsub -hard -l
mem_free=$mem -l h_vmem=2*$mem -l
act_mem_free=$mem". Use --mem=0
and --qsub to set your own SGE
settings (e.g. if act_mem_free is
not available).
--name STR Prefix of submitted jobs.
Requires -p. Translates to "qsub
-N $name-jobname".
--queue STR SGE queue. Translates to "qsub -q
$queue".
--qsub STR Additional parameters passed to
qsub. Requires -p. See --priority
and --mem. You can use e.g.
--qsub="-q *@p*,*@s*" to use just
machines p* and s*. Or e.g.
--qsub="-q *@!(twi*|pan*)" to
skip twi* and pan* machines.
--workdir STR working directory for temporary
files in parallelized processing;
one can create automatic
directories by using patterns:
{NNN} is replaced by an ordinal
number with so many leading zeros
to have length of the number of
Ns, {XXXX} is replaced by a
random string, whose length is
the same as the number of Xs
(min. 4). If not specified,
directories such as
001-cluster-run, 002-cluster-run
etc. are created
--survive Continue collecting jobs' outputs
even if some of them crashed
(risky, use with care!).
--jobindex INT Not to be used manually. If
number of jobs is set to J and
modulo set to M, only I-th files
fulfilling I mod J == M are
processed.
--outdir STR Not to be used manually. Dictory
for collecting standard and error
outputs in parallelized
processing.
--server STR Not to be used manually. Used to
point parallel jobs to the head.
AUTHORS
Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>
Martin Popel <popel@ufal.mff.cuni.cz>
Martin Majliš
Ondřej Dušek <odusek@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011-2014 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.