--watch option
SYNOPSIS: touch timestamp.file treex --watch=timestamp.file my.scen & # or without & and open another terminal # after all documents are processed, treex is still running, watching timestamp.file # you can modify any modules/blocks and then touch timestamp.file # All modified modules will be reloaded (the number of reloaded modules is printed). # The document reader is restarted, so it starts reading the first file again. # To exit this "watching loop" either rm timestamp.file or press Ctrl^C.
BENEFITS: * much faster development cycles (e.g. most time of en-cs translation is spent on loading) * Now I have some non-deterministic problems with loading NER::Stanford - using --watch I get it loaded on all jobs once and then I don't have to reload it.
TODO: * modules are just reloaded, no constructors are called yet
NAME
Treex::Core::Run + treex - applying Treex blocks and/or scenarios on data
VERSION
version 2.20150928
SYNOPSIS
In bash:
> treex myscenario.scen -- data/*.treex
> treex My::Block1 My::Block2 -- data/*.treex
In Perl:
use Treex::Core::Run q(treex);
treex([qw(myscenario.scen -- data/*.treex)]);
treex([qw(My::Block1 My::Block2 -- data/*.treex)]);
DESCRIPTION
Treex::Core::Run
allows to apply a block, a scenario, or their mixture on a set of data files. It is designed to be used primarily from bash command line, using a thin front-end script called treex
. However, the same list of arguments can be passed by an array reference to the function treex()
imported from Treex::Core::Run
.
Note that this module supports distributed processing (Linux-only!), simply by adding the switch -p
. The treex
method then creates a Treex::Core::Parallel::Head
object, which extends Treex::Core::Run
by providing parallel processing functionality.
Then there are two ways to process the data in a parallel fashion. By default, SGE cluster\'s qsub
is expected to be available. If you have no cluster but want to make the computation parallelized at least on a multicore machine, add the --local
switch.
SUBROUTINES
- treex
-
create new runner and runs scenario given in parameters
USAGE
usage: treex [-?dEehjLmpqSstv] [long options...] scenario [-- treex_files]
scenario is a sequence of blocks or *.scen files
options:
-h -? --usage --help Prints this usage information.
-s --save save all documents
-q --quiet Warning, info and debug messages are
suppressed. Only fatal errors are
reported.
--cleanup Delete all temporary files.
-e --error_level Possible values: ALL, DEBUG, INFO, WARN,
FATAL
-L --language --lang shortcut for adding "Util::SetGlobal
language=xy" at the beginning of the
scenario
-S --selector shortcut for adding "Util::SetGlobal
selector=xy" at the beginning of the
scenario
-t --tokenize shortcut for adding "Read::Sentences
W2A::Tokenize" at the beginning of the
scenario (or W2A::XY::Tokenize if used
with --lang=xy)
--watch re-run when the given file is changed
TODO better doc
-d --dump_scenario Just dump (print to STDOUT) the given
scenario and exit.
--dump_required_files Just dump (print to STDOUT) files
required by the given scenario and exit.
--cache Use cache. Required memory is specified
in format memcached,loading. Numbers are
in GB.
-v --version Print treex and perl version
-E --forward_error_level messages with this level or higher will
be forwarded from the distributed jobs
to the main STDERR
-p --parallel Parallelize the task on SGE cluster
(using qsub).
-j --jobs Number of jobs for parallelization,
default 10. Requires -p.
--local Run jobs locally (might help with
multi-core machines). Requires -p.
--priority Priority for qsub, an integer in the
range -1023 to 0 (or 1024 for admins),
default=-100. Requires -p.
--memory -m --mem How much memory should be allocated for
cluster jobs, default=2G. Requires -p.
Translates to "qsub -hard -l
mem_free=$mem -l h_vmem=2*$mem -l
act_mem_free=$mem". Use --mem=0 and
--qsub to set your own SGE settings
(e.g. if act_mem_free is not available).
--name Prefix of submitted jobs. Requires -p.
Translates to "qsub -N $name-jobname".
--qsub Additional parameters passed to qsub.
Requires -p. See --priority and --mem.
You can use e.g. --qsub="-q *@p*,*@s*"
to use just machines p* and s*. Or e.g.
--qsub="-q *@!(twi*|pan*)" to skip twi*
and pan* machines.
--workdir working directory for temporary files in
parallelized processing; one can create
automatic directories by using patterns:
{NNN} is replaced by an ordinal number
with so many leading zeros to have
length of the number of Ns, {XXXX} is
replaced by a random string, whose
length is the same as the number of Xs
(min. 4). If not specified, directories
such as 001-cluster-run, 002-cluster-run
etc. are created
--survive Continue collecting jobs' outputs even
if some of them crashed (risky, use with
care!).
--jobindex Not to be used manually. If number of
jobs is set to J and modulo set to M,
only I-th files fulfilling I mod J == M
are processed.
--outdir Not to be used manually. Dictory for
collecting standard and error outputs in
parallelized processing.
--server Not to be used manually. Used to point
parallel jobs to the head.
AUTHORS
Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>
Martin Popel <popel@ufal.mff.cuni.cz>
Martin Majliš
Ondřej Dušek <odusek@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011-2014 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.