NAME
Treex::Tool::Parser::MSTperl::Config
VERSION
version 0.08055
DESCRIPTION
Handles the configuration of the parser.
FIELDS
Data fields
Fields describing fields used with nodes, such as form, pos, lemma...
- field_names (ArrayRef[Str])
-
Field names (for conversion of field index to field name)
- field_names_hash (HashRef[Str])
-
1 for each field name to easily check if a field name exists
- field_indexes (HashRef[Str])
-
Index of each field name in field_names (for conversion of field name to field index)
Settings
Most of the settings are set by a config file in YAML format. However, you do not have to understand YAML to be able to change the settings provided that you keep things like formating of the file unchanged (some whitespaces are significant etc.). Actually only a subset of all all that YAML provides is used.
Contents of a line from the # character till the end of the line are comments and are ignored (if you need to actually use the # sign, you can quote it - eg. '#empty#'
is interpreted as #empty#
). Lines that contain only whitespace chars or are empty are ignored as well.
Some of the settings are ignored when in parsing mode (i.e. not training). These are use_edge_features_cache (turned off) and number_of_iterations (irrelevant).
These are settings which are acquired from the configuration file (see also its contents, the options are also richly commented there):
Basic Settings
- field_names
-
Lowercase names of fields in the input file (the data fields are to be separated by tabs in the input file). Use [a-z0-9_] only, using always at least one letter. Use unique names, i.e. devise some names even for unused fields.
- root_field_values
-
Field values to set for the (technical) root node.
- parent_ord
-
Name of field containing ord of the parent of the node (also called "head" or "governing node").
- number_of_iterations, labeller_number_of_iterations
-
How many times the trainer (Tagger::MSTperl::Trainer) should go through all the training data (default is
10
). - use_edge_features_cache, labeller_use_edge_features_cache
-
Turns on and off using the
edge_features_cache
. Default is0
.Using cache should be turned on (
1
) if training with a lot of RAM or on small training data, as it uses a lot of memory but speeds up the training greatly (approx. by 30% to 50%). If you need to save RAM, turn it off (0
).
Features Settings
- features, labeller_features
-
Features codes to use in the unlabelled/labelled parser. See Treex::Tool::Parser::MSTperl::FeaturesControl for details.
Internal technical settings
These fields cannot be set by the config file, their default values are hard-coded at beginning of the source code and they can be set on creating the Config object, eg.:
my $config = Treex::Tool::Parser::MSTperl::Config->new(
config_file => 'file.config',
VITERBI_STATES_NUM_THRESHOLD => '5',
EM_EPSILON => '0.00000001',
)
If there is a need, they might be changed to fully config-file-configurable settings in future.
- labeller_algorithm
-
Algorithm used for Viterbi labelling as well as for training. Several possibilities are being tried out; if one of them finally significantly outscores the other variants, this will become obsolete and get deleted.
- DEBUG
-
An integer specifying how much debug information you will be getting while running the program, ranging from 0 (no debug info) through 1 (progress messages - this is the default setting) through 2, 3 and 4 to 5 (more and more debug info).
If you set this value to something higher than 1, you should always redirect the output to a file as printing it to the console is very very slow (and there is so much info that you wouldn't be able to read anything anyway).
This the only read-write field in this section, it is therefore possible to change its value not only on creating a new Config instance but also while running the program (eg, if you only want to debug only a particular part).
- SEQUENCE_BOUNDARY_LABEL
-
This is only a technical thing; a label must be assigned to the (basically virtual) boundary of a sequence, different from any label used in the data. The default value is '###', so if you use this exact label as a valid label in your data, change the setting to something else. If nothing goes wrong, you should never see this label in the output; however, it is contained in the model and used for "transition scores" to score the "transition" between the sequence boundary and the first/last node (i.e. it determines the scores of labels used as the first or last label in the sequence where no actual transition takes place and the transition scores would otherwise get ignored).
- VITERBI_STATES_NUM_THRESHOLD
-
Number of states to keep when pruning. The pruning takes place after each Viterbi step (i.e. after each computation of possible labels and their scores for one edge). For more details see the
prune
subroutine. - EM_EPSILON
-
Stopping criterion of EM algorithm which is used to compute smoothing parameters for linear combination smoothing of transition probabilities in some variants of the Labeller. (when the sum of change of smoothing parameters is lower than the epsilon, the algorithm stops).
- EM_heldout_data_at
-
A number between 0 and 1 specifying where in training data do heldout data for EM algorithm start (eg. 0.75 means that first 75% of sentences are training data and the last 25% are heldout data).
The training/heldout data division only affects computation of transition probabilities by MLE, it does not affect MIRA training or MLE for emission probabilities.
If EM is not used for smoothing, all data are used as training data.
Technical fields
Provide access to things needed in more than one of the other packages.
- unlabelledFeaturesControl
-
Provides access to unlabelled features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.
- labelledFeaturesControl
-
Provides access to labeller features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.
METHODS
Settings
The best source of information about all the possible settings is the configuration file itself (usually called config.txt
), as it is richly commented and accompanied by real examples at the same time.
- my $config = Treex::Tool::Parser::MSTperl::Config->new(config_file => 'file.config')
-
Reads the configuration file (in YAML format) and applies the settings.
See file
samples/sample.config
. - field_name2index ($field_name)
-
Fields are referred to by names in the config files but by indexes in the code. Therefore this conversion function is necessary; the other direction of the conversion is ensured by the
field_names
field.
AUTHORS
Rudolf Rosa <rosa@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.