NAME
Treex::Tool::Parser::MSTperl::FeaturesControl
VERSION
version 0.08055
DESCRIPTION
Controls the features used in the model.
Features
TODO: outdated, superceded by use of config file -> rewrite
Each feature has a form code:value
. The code desribes the information which is relevant for the feature, and the value is the information retained from the dependency edge (and possibly other parts of the sentence (Treex::Tool::Parser::MSTperl::Sentence) stored in sentence
field).
For example, the feature L|l:být|pes
means that the lemma of the parent node (the governing word) is "být" and the lemma of its child node (the dependent node) is "pes".
Each (proper) feature is composed of several simple features. In the aforementioned example, the simple feature codes were L
and l
and their values "být" and "pes", respectively. Each simple feature code is a string (case sensitive) and its value is also a string. The simple feature codes are joined together by the |
sign to form the code of the proper feature, and similarly, the simple feature values joined by |
form the proper feature value. Then, the proper feature code and value are joined together by :
. (Therefore, the codes and values of the simple features must not contain the |
and the :
signs.)
By a naming convention, if the same simple feature can be computed for both the parent node and its child node, their codes are the same but for the case, which is upper for the parent and lower for the child. If this is not applicable, an uppercase code is used.
For higher effectiveness the simple feature codes are translated to integers (see simple_feature_codes
).
In reality the feature codes are translated to integers as well (see feature_codes
), but this is only an internal issue. You can see these numbers in the model file if you use the default Data::Dumper format (see load
and store
). However, if you use the tsv format (see load_tsv
, store_tsv
), you will see the real string feature codes.
Currently the following simple features are available. Any subset of them can be used to form a proper feature, but their order should follow their order of appearance in this list (still, this is only a cleanliness and readability thing, it does not affect the function of the parser in any way).
- Distance (D)
-
Distance of the two nodes in the sentence, computed as order of the parent minus the order of the child. Eg. for the sentence "To je prima pes ." and the feature D computed on nodes "je" and "pes" (parent and child respectively), the order of "je" is 2 and the order of "pes" is 4, yielding the feature value of 2 - 4 = -2. This leads to a feature
D:-2
. - Form (F, f)
-
The form of the node, i.e. the word exactly as it appears in the sentence text.
Currently not used as it has not lead to any improvement in the parsing.
- Lemma (L, l)
-
The morphological lemma of the node.
- preceding tag (S, s)
-
The morphological tag (or POS tag if you like) of the node preceding (ord-wise) the node.
- Tag (T, t)
-
The morphological tag of the node.
- following tag (U, u)
-
The morphological tag of the node following (ord-wise) the node.
- between tag (B)
-
The morphological tag of each node between (ord-wise) the parent node and the child node. This simple feature returns (a reference to) an array of values.
Some of the simple features can return an empty string in case they are not applicable (eg. U
for the last node in the sentence), then the whole feature is not present for the edge.
Some of the simple features return an array of values (eg. the B
simple feature). This can result in several instances of the feature with the same code for one edge to appear in the result.
FIELDS
Features
TODO: slightly outdated
The examples used here are consistent throughout this part of documentation, i.e. if several simple features are listed in simple_feature_codes
and then simple feature with index 9 is referred to in array_simple_features
, it really means the B
simple feature which is on the 9th position in simple_feature_codes
.
- feature_count (Int)
-
Alias of
scalar @{feature_codes}
(but the integer is really stored in the field for faster access). - feature_codes (ArrayRef[Str])
-
Codes of all features to be computed. Their indexes in this array are used to refer to them in the code. Eg.:
feature_codes ( [( 'L|T', 'l|t', 'L|T|l|t', 'T|B|t')] )
- feature_codes_hash (HashRef[Str])
-
1 for each feature code to easily check if a feature exists
- feature_indexes (HashRef[Str])
-
Index of each feature code in feature_codes (for conversion of feature code to feature index)
- feature_simple_features_indexes (ArrayRef[ArrayRef[Int]])
-
For each feature contains (a reference to) an array which contains all its simple feature indexes (corresponding to positions in
simple_feature_codes
). Eg. for the 4 features (0 to 3) listed infeature_codes
and the 10 simple features listed insimple_feature_codes
(0 to 9):feature_simple_features_indexes ( [( [ (1, 5) ], [ (2, 6) ], [ (1, 5, 2, 6) ], [ (5, 9, 6) ], )] )
- array_features (HashRef)
-
Indexes of features containing array simple features (see
array_simple_features
). Eg.:array_features( { 3 => 1} )
as the feature with index 3 (
'T|B|t'
) contains theB
simple feature which is an array simple feature.
Simple features
- simple_feature_count (Int)
-
Alias of
scalar @{simple_feature_codes}
(but the integer is really stored in the field for faster access). - simple_feature_codes (ArrayRef[Str])
-
Codes of all simple features to be computed. Their order is important as their indexes in this array are used to refer to them in the code, especially in the
get_simple_feature
method. Eg.:simple_feature_codes ( [('D', 'L', 'l', 'S', 's', 'T', 't', 'U', 'u', 'B')])
- simple_feature_codes_hash (HashRef[Str])
-
1 for each simple feature code to easily check if a simple feature exists
- simple_feature_indexes (HashRef[Str])
-
Index of each simple feature code in simple_feature_codes (for conversion of simple feature code to simple feature index)
- simple_feature_sub_arguments (ArrayRef)
-
For each simple feature (on the corresponsing index) contains the index of the field (in
field_names
), which is used to compute the simple feature value (together with a subroutine fromsimple_feature_subs
).If the simple feature takes more than one argument (called a multiarg feature here), then instead of a single field index there is a reference to an array of field indexes.
If the simple feature takes other arguments than fields (especially integers), then these arguments are stored here insted of field indexes.
- simple_feature_subs (ArrayRef)
-
For faster run, the simple features are internally not represented by their string codes, which would have to be parsed repeatedly. Instead their codes are parsed once only (in
set_simple_feature
) and they are represented as an integer index of the field which is used to compute the feature (it is the actual index of the field in the input file line, accessible through "fields" in Treex::Tool::Parser::MSTperl::Node) and a reference to a subroutine (one of thefeature_*
subs, see below) which computes the feature value based on the field index and the edge (Treex::Tool::Parser::MSTperl::Edge). The references subroutine is then invoked inget_simple_feature_values_array
. - array_simple_features (HashRef[Int])
-
Indexes of simple features that return an array of values instead of a single string value. Eg.:
array_simple_features( { 9 => 1} )
because in the aforementioned example the
B
simple feature returns an array of values and has the index9
.
Other
- edge_features_cache (HashRef[ArrayRef[Str])
-
If caching is turned on (see below), all features of any edge computed by the
get_feature_simple_features_indexes
method are computed once only, stored in this cache and then retrieved when needed.The key of the hash is the edge signature (see "signature" in Treex::Tool::Parser::MSTperl::Edge), the value is (a reference to) an array of fetures and their values.
Feature functions
In the features
field of the config file all features to be used by the model are set. Use the input file field names to use the field of the (child) node, uppercase them to use the field of the parent, prefix them by 1.
or 2.
to use the field on the first or second node in the sentence (i.e. based on order in sentence, regardless of which is parent and which is child).
You can also make use of several functions. Again, you can usually (i.e. when it makes sense) write their names in lowercase to invoke them on the child field, uppercase for parent, or prefixed by 1.
or 2.
for first or second node. The argument of a function must always be a (child) field name.
- distance(ord_field)
-
Bucketed ord-wise distance of child and parent (ORD minus ord)
- preceding(field)
-
Value of the specified field on the ord-wise preceding node
- following(field)
-
The same for ord-wise following node
- between(field)
-
Value of the specified field for each node which is ord-wise between the child node and the parent node
- equals(field1,field2), equalspc(field1,field2)
-
Returns
1
if the values of the fields equal (if they have multiple values, returns 1 if at least for one pair of their values the values equal),0
if they don't and-1
if at least one of the values is undefined.equalspc
usesfield1
of the parent node andfield2
of the child node.
METHODS
Settings
The best source of information about all the possible settings is the configuration file itself (usually called config.txt
), as it is richly commented and accompanied by real examples at the same time.
- my $featuresControl = Treex::Tool::Parser::MSTperl::FeaturesControl->new( 'config' => $config, 'feature_codes_from_config' => $feature_codes_array_reference, 'use_edge_features_cache' => $use_edge_features_cache, )
-
Parses feature codes and creates their in-memory representations.
- set_feature ($feature_code)
-
Parses the feature code and (if no errors are encountered) creates its representation in the fields of this package (all
feature_
* fields and possibly also thearray_features
field). - set_simple_feature ($simple_feature_code)
-
Parses the simple feature code and creates its representation in the fields of this package (all
simple_feature_
* fields and possibly also thearray_simple_features
field).
Computing (proper) features
- my $features_array_rf = $model->get_all_features($edge)
-
Returns (a reference to) an array which contains all features of the edge (according to settings).
If caching is turned on, tries to look the features up in the cache before computing them. If they are not cached yet, they are computed and stored into the cache.
The value of a feature is computed by
get_feature_value
. Values of simple features are precomputed (by callingget_simple_feature_values_array
) and passed to theget_feature_value
method. - my $feature_value = get_feature_value(3, $simple_feature_values)
-
Returns the value of the feature with the given index.
If it is an array feature (see
array_features
), its value is (a reference to) an array of all (string) values of the feature (a reference to an empty array if there are no values).If it is not an array feature, its value is composed from the simple feature values. If some of the simple features do not have a value defined, an empty string (
''
) is returned. - my $feature_value = get_array_feature_value ($simple_features_indexes, $simple_feature_values, $start_from)
-
Recursively calls itself to compose an array of all values of the feature (composed of the simple features given in
$simple_features_indexes
array reference), which is a cartesian product on all values of the simple features. The$start_from
variable should be0
when this method is called and is incremented in the recursive calls.
Computing simple features
- my $simple_feature_values = get_simple_feature_values_array($edge)
-
Returns (a reference to) an array of values of all simple features (see
simple_feature_codes
). For each simple feature, its value can be found on the position in the returned array corresponding to its position insimple_feature_codes
. - my $sub = get_simple_feature_sub_reference ('distance')
-
Translates the feature funtion string name (eg.
distance
) to its reference (eg.\&feature_distance
). - my $value = get_simple_feature_value ($edge, 9)
-
Returns the value of the simple feature with the given index by calling an appropriate
feature_*
method on the edge (see Treex::Tool::Parser::MSTperl::Edge). If the feature cannot be computed, an empty string (''
) is returned (or a reference to an empty array for array simple features - seearray_simple_features
). - feature_distance
- feature_child
- feature_parent
- feature_first
- feature_second
- feature_preceding_child
- feature_preceding_parent
- feature_following_child
- feature_following_parent
- feature_preceding_first
- feature_preceding_second
- feature_following_first
- feature_following_second
- feature_between
- feature_foreach
- feature_equals, feature_equals_pc, feature_equals_pc_at
-
# from config:
equals(field1,field2) - returns 1 if the value of field1 is the same as the value of field2; for fields with multiple values (eg. with aligned nodes), it has the meaning of an "exists" operator: it returns 1 if there is at least one pair of values of each field that are the same. returns 0 if no values match, -1 if (at least) one of the fields is undef (may be also represented by an empty string) equalspc(field1,field2) - like equals but first field is taken from parent and second from child
- a simple feature function equals(field_1,field_2) with xquery-like "at least once" semantics for multiple values (there can be multiple alignments) with a special output value if one of the fields is unknown (maybe it suffices to emmit an undef, as this would occur iff at least one of the arguments is undef; but maybe not and eg. "-1" should be given)
This makes it possible to have a simple feature which behaves like this:
- returns 1 if the edge between child and parent is also present in the English tree
- returns 0 if not
- returns -1 if cannot decide (alignment info is missing for some of the nodes)
Because if the parser has (the ord of the en child node and) the ord of en child's parent and the ord of the en parent node (and the ord of the en parent's parent), the feature can check whether en_parent->ord = en_child->parentOrd
equalspc(en-
ord, en->parent->ord)>
AUTHORS
Rudolf Rosa <rosa@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.