NAME
Lingua::Align::Features - Feature extraction for tree alignment
SYNOPSIS
use Lingua::Align::Features;
my $FeatString = 'catpos:treespansim:parent_catpos';
my $extractor = new Lingua::Align::Features(
-features => $FeatString);
my %features = $extractor->features(\%srctree,\%trgtree,
$srcnode,$trgnode);
my $FeatString2 = 'giza:gizae2f:gizaf2e:moses';
my $extractor2 = new Lingua::Align::Features(
-features => $FeatString2,
-lexe2f => 'moses/model/lex.0-0.e2f',
-lexf2e => 'moses/model/lex.0-0.f2e',
-moses_align => 'moses/model/aligned.intersect');
my %features = $extractor2->features(\%srctree,\%trgtree,
$srcnode,$trgnode);
DESCRIPTION
Extract features from a pair of nodes from two given syntactic trees (source and target language). The trees should be complex hash structures as produced by Lingua::Align::Corpus::Treebank::TigerXML. The returned features are given as simple key-value pairs (%features)
Features to be used are specified in the feature string given to the constructor ($FeatString). Default is 'inside2:outside2' which refers to 2 features, the inside score and the outside score as defined by the Dublin Sub-Tree Aligner (see http://www2.sfs.uni-tuebingen.de/ventzi/Home/Software/Software.html, http://ventsislavzhechev.eu/Downloads/Zhechev%20MT%20Marathon%202009.pdf). For this you will need the probabilistic lexicons as created by Moses (http://statmt.org/moses/); see the -lexe2f and -lexf2e parameters in the constructor of the second example.
Features in the feature string are separated by ':'. Here is an example of a feature string including feature types, tree-level similarity scores, tree span similarity scores and category/POS label pairs (more information about supported feature types can be found below):
treelevelsim:treespansim:catpos
You can also refer to contextual features, meaning that you can extract all possible feature types from connected nodes. This is done by specifying how the contextual node is connected to the current one. For example, you can refer to parent nodes on source and/or target language side. A feature with the prefix 'parent_' makes the feature extractor to take the corresponding values from the first parent nodes in source and target language trees. The prefix 'srcparent_' takes the values from the source language parent (but the current target language node) and 'trgparent_' takes the target language parent but not the source language parent. For example 'parent_catpos' gets the labels of the parent nodes. These feature types can again be combined with others as described above (product, mean, concatenation). We can also use 'sister_' features 'children_' features which will refer to the feature with the maximum value among all sister/children nodes, respectively.
Finally, there is also a way to address neighbor nodes using the prefix 'neighborXY_' where X and Y refer to the distance from the current node (single digits only!). X gives the distance of the source language neighbor and Y the distance of the target language neighbor. Negative values refer to left neighbors and positive values (do not use '+' to indicate positive values!) refer to neighbors to the right. For terminal nodes all surface words are considered to retrieve neighbors. For all other nodes only neighbors that are connected via the same parent node will be retrieved. Here are some examples of contextual features:
# category/POS label pairs from the left neighbor on the source side
# and the current node at the target side
neighbor-10_catpos
# Moses word alignment feature from the left source language neighbor
# and the neighbor 2 positions to the left on the target side
neighbor-1-2_moses
# tree level similarity between the source parent and the current target node
srcparent_treelevelsim
# average gizae2f score of all children of the current node pair
children_gizae2f
# category/POS label pair of the grandparent on the source side
# and the parent of the target side
parent_srcparent_catpos
# lexical "inside2" score feature from the left source neighbor
# and the right target neighbor
neighbor-11_inside2
# category/POS label pairs of ALL combinations of sister nodes
# of the current node pair (from both sides)
sister_catpos
Feature types can also be combined to form complex features. Possible combinations are:
- product (*)
-
multiply the value of 2 or more feature types, e.g. 'inside2*outside2' would refer to the product of inside2 and outside2 scores
- average (+)
-
compute the average (arithmetic mean) of 2 or more features, e.g. 'inside2+outside2' would refer to the mean of inside2 and outside2 scores
- concatenation (.)
-
merge 2 or more feature keys and compute the average of their scores. This can especially be useful for "nominal" feature types that have several instantiations. For example, 'catpos' refers to the labels of the nodes (category or POS label) and the value of this feature is either 1 (present). This means that for 2 given nodes the feature might be 'catpos_NN_NP => 1' if the label of the source tree node is 'NN' and the label of the target tree node is 'NP'. Such nominal features can be combined with real valued features such as inside2 scores, e.g. 'catpos.inside2' means to concatenate the keys of both feature types and to compute the arithmetic mean of both scores. Here are some more examples of complex features:
# product of inside and outside scores inside2*outside2 # product of tree level similarity score between source parent and # current target and Moses score for parent nodes on both sides srcparent_treelevelsim*parent_moses
FEATURES
The following feature types are implemented in the Tree Aligner:
lexical equivalence features
Lexical equivalence features evaluate the relations between words dominated by the current subtree root nodes (alignment candidates). They all use lexical probabilities usually derived from automatic word alignment (other types of probabilistic lexica could be used as well). The notion of inside words refers to terminal nodes that are dominated by the current subtree root nodes and outside words refer to terminal nodes that are not dominated by the current subtree root nodes. Various variants of scores are possible:
- inside1 (insideST1*insideTS1)
-
This is the unnormalized score of words inside of the current subtrees (see http://ventsislavzhechev.eu/Downloads/Zhechev%20MT%20Marathon%202009.pdf). Lexical probabilities are taken from automatic word alignment (lex-files). NULL links are also taken into account. It is actually the product of insideST1 (probabilities from source-to-target lexicon) and insideTS1 (probabilities from target-to-source lexicon) which also can be used separately (as individual features).
- outside1 (outsideST1*outsideTS1)
-
The same as inside1 but for word pairs outside of the current subtrees. NULL links are counted and scores are not normalized.
- inside2 (insideST2*insideTS2)
-
This refers to the normalized inside scores as defined in the Dublin Subtree Aligner.
- outside2 (outsideST1*outsideTS1)
-
The normalized scores of word pairs outside of the subtrees.
- inside3 (insideST3*insideTS3)
-
The same as inside1 (unnormalized) but without considering NULL links (which makes feature extraction much faster)
- outside3 (outsideST1*outsideTS1)
-
The same as outside1 but without NULL links.
- inside4 (insideST4*insideTS4)
-
The same as inside2 but without NULL links.
- outside4 (insideST4*insideTS4)
-
The same as outside2 but without NULL links.
- maxinside (maxinsideST*maxinsideTS)
-
This is basically the same as inside4 but using "max P(x|y)" instead of "1/|y \SUM P(x|y)" as in the original definition. maxinsideST is using the source-to-target scores and maxinsideTS is using the target-to-source scores.
- maxoutside (maxoutsideST*maxoutsideTS)
-
The same as maxinside but for outside word pairs
- avgmaxinside (avgmaxinsideST*avgmaxinsideTS)
-
This is the same as maxinside but computing the average (1/|x|\SUM_x max P(x|y)) instead of the product (\PROD_x max P(x|y))
- avgmaxoutside (avgmaxoutsideST*avgmaxoutsideTS)
-
The same as avgmaxinside but for outside word pairs
- unioninside (unioninsideST*unioninsideTS)
-
Add all lexical probabilities using the addition rule of independent but not mutually exclusive probabilities (P(x1|y1)+P(x2|y2)-P(x1|y1)*P(x2|y2))
- unionoutside (unionoutsideST*unionoutsideTS)
-
The same as unioninside but for outside word pairs.
word alignment features
Word alignment features use the automatic word alignment directly. Again we distinguish between words that are dominated by the current subtree root nodes (inside) and the ones that are outside. Alignment is binary (1 if two words are aligned and 0 if not) and as a score we usuallty compute the proportion of interlinked inside word pairs among all links involving either source or target inside words. One exception is the moselink feature which is only defined for terminal nodes.
- moses
-
The proportion of interlinked words (from automatic word alignment) inside of the current subtree among all links involving either source or target words inside of the subtrees.
- moseslink
-
Only for terminal nodes: is set to 1 if the twwo words are linked in the automatic word alignment derived from GIZA++/Moses.
- gizae2f
-
Link proportion as for moses but now using the assymmetric GIZA++ alignments only (source-to-target).
- gizaf2e
-
Link proportion as for moses but now using the assymmetric GIZA++ alignments only (target-to-source).
- giza
-
Links from gizae2f and gizaf2e combined.
sub-tree features
Sub-tree features refer to features that are related to the structure and position of the current subtrees.
- treespansim
-
This is a feature for measuring the "horizontal" similarity of the subtrees under consideration. It is defined as the 1 - the relative position difference of the subtree spans. The relative position of a subtree is defined as the middle of the span of a subtree (begin+end/2) divided by the length of the sentence.
- treelevelsim
-
This is a feature measuring the "vertical" similarity of two nodes. It is defined as 1 - the relative tree level difference. The relative tree level is defined as the distance to the sentence root node divided by the size of the tree (which is the maximum distance of any node in the tree to the sentence root).
- nrleafsratio
-
This is the ratio of the number of leaf nodes dominated by the two candidate nodes. The ratio is defined as the minimum(nr_src_leafs/nr_trg_leafs,nr_trg_leafs/nr_src_leafs).
annotation/label features
catpos
-
This feature type extracts node label pairs and gives them the value 1. It uses the "cat" attribute if it exists, otherwise it uses the "pos" attribute if that one exists.
edge
-
This feature refers to the pairs of edge labels (relations) of the current nodes to their immediate parent (only the first parent is considered if multiple exist). This is a binary feature and is set to 1 for each observed label pair.
co-occurrence features
Measures of co-occurrence can also be used as features. Currently Dice scores are supported that will be computed "on-the-fly" from co-occurrence frequency counts. Frequencies should be stored in plain text files using a simple format:
Source/target language frequencies: First line starts with '#' and specifies the node features used (for example '# word' means that the actual surface words will be used). All other lines should contain three items, the actual token, a unique token ID and the frequency (all separated by one TAB character). A file should look like this:
# word
learned 682 4
stamp 722 3
hat 1056 5
what 399 20
again 220 14
Co-occurrence frequencies are also stored in plain text files with the following format: The first two lines specify the files which are used to store source and target language frequencies. All other lines contain source token ID, target token ID and the corresponding co-occurrence frequency. An example could look like this:
# source frequencies: word.src
# target frequencies: word.trg
127 32 4
127 898 3
127 31 3
798 64 3
798 861 4
The easiest way to produce such frequency files is to use the script coocfreq
in the bin
directory of Lingua-Align. Look at the header of this script for possible options.
Features for the frequency counts can be quite complex. Any node attribute can be used. Special features are suffix=X
, prefix=X
which refer to word-suffix resp. word-prefix of length X (number of characters). Another special feature is edge
which refers to the relation to the head of the current node. Features can also be combined (separate each feature with one ':' character). You may also use features of context nodes using 'parent_', 'children_' and 'sister_' as for the alignment features. Here is an example of a complex feature:
word:pos:parent_suffix=3:parent_cat
This refers to the surface word at the current node, its POS label, the 3-letter-suffix and the category label of the parent node. All these feature values will be concatenated with ':' and frequencies refer to those concatenated strings.
diceNAME=COOCFILE
-
Using features that start with the prefix 'dice' you can use Dice scores as features which will be computed from the frequencies in COOCFILE (and the source/target frequencies in the files specified in COOCFILE). You have to give each dice-feature a unique name (NAME) if you want to use several Dice score features. For example
dicePOS=pos.coocfreq
enables Dice score features over POS-label co-occurrence frequencies stored in pos.coocfreq (if that's what you've stored in pos.coocfreq).You may again use context features in the same way as for all other features, for example, 'sister_dicePOS=pos.coocfreq'. Note that co-occurrence features not always exist for all nodes in the tree (for example POS labels do not exist for non-terminal nodes).
"orthographic" features
You can also use features that are based on the comparison and combination of strings. There are (sub)string features, string similarity features, string class features and length comparison features.
lendiff
-
This is the absolute character length difference of the source and the target language strings dominated by the current nodes.
lenratio
-
This is the character-length ratio of the source and the target language strings dominated by the current nodes (shorter string divided by longer string)
word
-
This is the pair of words at the current node (leaf nodes only).
suffix=X
-
This is the pair of suffixes of length X from both source and target language words (leaf nodes only).
prefix=X
-
This is the pair of prefixes of length X from both source and target language words (leaf nodes only).
isnumber
-
This feature is set to 1 if both strings match the pattern /^[\d\.\,]+\%?$/
hasdigit
-
This feature is set to 1 if both strings contain at least one digit.
haspunct
-
This feature is set to 1 if both strings contain punctuations.
ispunct
-
This feature is set to 1 if both strings are single character punctuations.
punct
-
This feature is set to the actual pair of strings if both strings are single character punctuations.
identical=minlength
-
This feature is 1 if both strings are longer than
minlength
and are identical. lcsr=minlength
-
This feature is the longest subsequence ratio between the two strings if they are both longer than
minlength
characters. lcsrlc=minlength
-
This is the same as
lcsr
but using lowercased strings. lcsrascii=minlength
-
This is the same as
lcsr
but using only the ASCII characters in both strings. lcsrcons=minlength
-
This is the same as
lcsr
but uses a simple regex to remove all vowels (using a fixed set of characters to match).
SEE ALSO
For the tree structure see Lingua::Align::Corpus::Treebank. For information on the tree aligner look at Lingua::Align::Trees
AUTHOR
Joerg Tiedemann
COPYRIGHT AND LICENSE
Copyright (C) 2009 by Joerg Tiedemann
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.