NAME

Lingua::Align::Features - Feature extraction for tree alignment

SYNOPSIS

use Lingua::Align::Features;

my $FeatString = 'catpos:treespansim:parent_catpos';
my $extractor = new Lingua::Align::Features(
                        -features => $FeatString);

my %features = $extractor->features(\%srctree,\%trgtree,
                                    $srcnode,$trgnode);



my $FeatString2 = 'giza:gizae2f:gizaf2e:moses';
my $extractor2 = new Lingua::Align::Features(
                    -features => $FeatString2,
                    -lexe2f => 'moses/model/lex.0-0.e2f',
                    -lexf2e => 'moses/model/lex.0-0.f2e',
                    -moses_align => 'moses/model/aligned.intersect');

my %features = $extractor2->features(\%srctree,\%trgtree,
                                     $srcnode,$trgnode);

DESCRIPTION

Extract features from a pair of nodes from two given syntactic trees (source and target language). The trees should be complex hash structures as produced by Lingua::Align::Corpus::Treebank::TigerXML. The returned features are given as simple key-value pairs (%features)

Features to be used are specified in the feature string given to the constructor ($FeatString). Default is 'inside2:outside2' which refers to 2 features, the inside score and the outside score as defined by the Dublin Sub-Tree Aligner (see http://www2.sfs.uni-tuebingen.de/ventzi/Home/Software/Software.html, http://ventsislavzhechev.eu/Downloads/Zhechev%20MT%20Marathon%202009.pdf). For this you will need the probabilistic lexicons as created by Moses (http://statmt.org/moses/); see the -lexe2f and -lexf2e parameters in the constructor of the second example.

Features in the feature string are separated by ':'. Here is an example of a feature string including feature types, tree-level similarity scores, tree span similarity scores and category/POS label pairs (more information about supported feature types can be found below):

treelevelsim:treespansim:catpos

You can also refer to contextual features, meaning that you can extract all possible feature types from connected nodes. This is done by specifying how the contextual node is connected to the current one. For example, you can refer to parent nodes on source and/or target language side. A feature with the prefix 'parent_' makes the feature extractor to take the corresponding values from the first parent nodes in source and target language trees. The prefix 'srcparent_' takes the values from the source language parent (but the current target language node) and 'trgparent_' takes the target language parent but not the source language parent. For example 'parent_catpos' gets the labels of the parent nodes. These feature types can again be combined with others as described above (product, mean, concatenation). We can also use 'sister_' features 'children_' features which will refer to the feature with the maximum value among all sister/children nodes, respectively.

Finally, there is also a way to address neighbor nodes using the prefix 'neighborXY_' where X and Y refer to the distance from the current node (single digits only!). X gives the distance of the source language neighbor and Y the distance of the target language neighbor. Negative values refer to left neighbors and positive values (do not use '+' to indicate positive values!) refer to neighbors to the right. For terminal nodes all surface words are considered to retrieve neighbors. For all other nodes only neighbors that are connected via the same parent node will be retrieved. Here are some examples of contextual features:

# category/POS label pairs from the left neighbor on the source side
# and the current node at the target side
neighbor-10_catpos

# Moses word alignment feature from the left source language neighbor
# and the neighbor 2 positions to the left on the target side
neighbor-1-2_moses

# tree level similarity between the source parent and the current target node
srcparent_treelevelsim

# average gizae2f score of all children of the current node pair
children_gizae2f

# category/POS label pair of the grandparent on the source side
# and the parent of the target side
parent_srcparent_catpos

# lexical "inside2" score feature from the left source neighbor
# and the right target neighbor
neighbor-11_inside2

# category/POS label pairs of ALL combinations of sister nodes
# of the current node pair (from both sides)
sister_catpos

Feature types can also be combined to form complex features. Possible combinations are:

product (*)

multiply the value of 2 or more feature types, e.g. 'inside2*outside2' would refer to the product of inside2 and outside2 scores

average (+)

compute the average (arithmetic mean) of 2 or more features, e.g. 'inside2+outside2' would refer to the mean of inside2 and outside2 scores

concatenation (.)

merge 2 or more feature keys and compute the average of their scores. This can especially be useful for "nominal" feature types that have several instantiations. For example, 'catpos' refers to the labels of the nodes (category or POS label) and the value of this feature is either 1 (present). This means that for 2 given nodes the feature might be 'catpos_NN_NP => 1' if the label of the source tree node is 'NN' and the label of the target tree node is 'NP'. Such nominal features can be combined with real valued features such as inside2 scores, e.g. 'catpos.inside2' means to concatenate the keys of both feature types and to compute the arithmetic mean of both scores. Here are some more examples of complex features:

# product of inside and outside scores
inside2*outside2

# product of tree level similarity score between source parent and
# current target and Moses score for parent nodes on both sides
srcparent_treelevelsim*parent_moses  

FEATURES

The following feature types are implemented in the Tree Aligner:

lexical equivalence features

Lexical equivalence features evaluate the relations between words dominated by the current subtree root nodes (alignment candidates). They all use lexical probabilities usually derived from automatic word alignment (other types of probabilistic lexica could be used as well). The notion of inside words refers to terminal nodes that are dominated by the current subtree root nodes and outside words refer to terminal nodes that are not dominated by the current subtree root nodes. Various variants of scores are possible:

inside1 (insideST1*insideTS1)

This is the unnormalized score of words inside of the current subtrees (see http://ventsislavzhechev.eu/Downloads/Zhechev%20MT%20Marathon%202009.pdf). Lexical probabilities are taken from automatic word alignment (lex-files). NULL links are also taken into account. It is actually the product of insideST1 (probabilities from source-to-target lexicon) and insideTS1 (probabilities from target-to-source lexicon) which also can be used separately (as individual features).

outside1 (outsideST1*outsideTS1)

The same as inside1 but for word pairs outside of the current subtrees. NULL links are counted and scores are not normalized.

inside2 (insideST2*insideTS2)

This refers to the normalized inside scores as defined in the Dublin Subtree Aligner.

outside2 (outsideST1*outsideTS1)

The normalized scores of word pairs outside of the subtrees.

inside3 (insideST3*insideTS3)

The same as inside1 (unnormalized) but without considering NULL links (which makes feature extraction much faster)

outside3 (outsideST1*outsideTS1)

The same as outside1 but without NULL links.

inside4 (insideST4*insideTS4)

The same as inside2 but without NULL links.

outside4 (insideST4*insideTS4)

The same as outside2 but without NULL links.

maxinside (maxinsideST*maxinsideTS)

This is basically the same as inside4 but using "max P(x|y)" instead of "1/|y \SUM P(x|y)" as in the original definition. maxinsideST is using the source-to-target scores and maxinsideTS is using the target-to-source scores.

maxoutside (maxoutsideST*maxoutsideTS)

The same as maxinside but for outside word pairs

avgmaxinside (avgmaxinsideST*avgmaxinsideTS)

This is the same as maxinside but computing the average (1/|x|\SUM_x max P(x|y)) instead of the product (\PROD_x max P(x|y))

avgmaxoutside (avgmaxoutsideST*avgmaxoutsideTS)

The same as avgmaxinside but for outside word pairs

unioninside (unioninsideST*unioninsideTS)

Add all lexical probabilities using the addition rule of independent but not mutually exclusive probabilities (P(x1|y1)+P(x2|y2)-P(x1|y1)*P(x2|y2))

unionoutside (unionoutsideST*unionoutsideTS)

The same as unioninside but for outside word pairs.

word alignment features

Word alignment features use the automatic word alignment directly. Again we distinguish between words that are dominated by the current subtree root nodes (inside) and the ones that are outside. Alignment is binary (1 if two words are aligned and 0 if not) and as a score we usuallty compute the proportion of interlinked inside word pairs among all links involving either source or target inside words. One exception is the moselink feature which is only defined for terminal nodes.

moses

The proportion of interlinked words (from automatic word alignment) inside of the current subtree among all links involving either source or target words inside of the subtrees.

Only for terminal nodes: is set to 1 if the twwo words are linked in the automatic word alignment derived from GIZA++/Moses.

gizae2f

Link proportion as for moses but now using the assymmetric GIZA++ alignments only (source-to-target).

gizaf2e

Link proportion as for moses but now using the assymmetric GIZA++ alignments only (target-to-source).

giza

Links from gizae2f and gizaf2e combined.

sub-tree features

Sub-tree features refer to features that are related to the structure and position of the current subtrees.

treespansim

This is a feature for measuring the "horizontal" similarity of the subtrees under consideration. It is defined as the 1 - the relative position difference of the subtree spans. The relative position of a subtree is defined as the middle of the span of a subtree (begin+end/2) divided by the length of the sentence.

treelevelsim

This is a feature measuring the "vertical" similarity of two nodes. It is defined as 1 - the relative tree level difference. The relative tree level is defined as the distance to the sentence root node divided by the size of the tree (which is the maximum distance of any node in the tree to the sentence root).

nrleafsratio

This is the ratio of the number of leaf nodes dominated by the two candidate nodes. The ratio is defined as the minimum(nr_src_leafs/nr_trg_leafs,nr_trg_leafs/nr_src_leafs).

annotation/label features

catpos

This feature type extracts node label pairs and gives them the value 1. It uses the "cat" attribute if it exists, otherwise it uses the "pos" attribute if that one exists.

edge

This feature refers to the pairs of edge labels (relations) of the current nodes to their immediate parent (only the first parent is considered if multiple exist). This is a binary feature and is set to 1 for each observed label pair.

co-occurrence features

Measures of co-occurrence can also be used as features. Currently Dice scores are supported that will be computed "on-the-fly" from co-occurrence frequency counts. Frequencies should be stored in plain text files using a simple format:

Source/target language frequencies: First line starts with '#' and specifies the node features used (for example '# word' means that the actual surface words will be used). All other lines should contain three items, the actual token, a unique token ID and the frequency (all separated by one TAB character). A file should look like this:

# word
learned 682     4
stamp   722     3
hat     1056    5
what    399     20
again   220     14

Co-occurrence frequencies are also stored in plain text files with the following format: The first two lines specify the files which are used to store source and target language frequencies. All other lines contain source token ID, target token ID and the corresponding co-occurrence frequency. An example could look like this:

# source frequencies: word.src
# target frequencies: word.trg
127     32      4
127     898     3
127     31      3
798     64      3
798     861     4

The easiest way to produce such frequency files is to use the script coocfreq in the bin directory of Lingua-Align. Look at the header of this script for possible options.

Features for the frequency counts can be quite complex. Any node attribute can be used. Special features are suffix=X, prefix=X which refer to word-suffix resp. word-prefix of length X (number of characters). Another special feature is edge which refers to the relation to the head of the current node. Features can also be combined (separate each feature with one ':' character). You may also use features of context nodes using 'parent_', 'children_' and 'sister_' as for the alignment features. Here is an example of a complex feature:

word:pos:parent_suffix=3:parent_cat

This refers to the surface word at the current node, its POS label, the 3-letter-suffix and the category label of the parent node. All these feature values will be concatenated with ':' and frequencies refer to those concatenated strings.

diceNAME=COOCFILE

Using features that start with the prefix 'dice' you can use Dice scores as features which will be computed from the frequencies in COOCFILE (and the source/target frequencies in the files specified in COOCFILE). You have to give each dice-feature a unique name (NAME) if you want to use several Dice score features. For example dicePOS=pos.coocfreq enables Dice score features over POS-label co-occurrence frequencies stored in pos.coocfreq (if that's what you've stored in pos.coocfreq).

You may again use context features in the same way as for all other features, for example, 'sister_dicePOS=pos.coocfreq'. Note that co-occurrence features not always exist for all nodes in the tree (for example POS labels do not exist for non-terminal nodes).

"orthographic" features

You can also use features that are based on the comparison and combination of strings. There are (sub)string features, string similarity features, string class features and length comparison features.

lendiff

This is the absolute character length difference of the source and the target language strings dominated by the current nodes.

lenratio

This is the character-length ratio of the source and the target language strings dominated by the current nodes (shorter string divided by longer string)

word

This is the pair of words at the current node (leaf nodes only).

suffix=X

This is the pair of suffixes of length X from both source and target language words (leaf nodes only).

prefix=X

This is the pair of prefixes of length X from both source and target language words (leaf nodes only).

isnumber

This feature is set to 1 if both strings match the pattern /^[\d\.\,]+\%?$/

hasdigit

This feature is set to 1 if both strings contain at least one digit.

haspunct

This feature is set to 1 if both strings contain punctuations.

ispunct

This feature is set to 1 if both strings are single character punctuations.

punct

This feature is set to the actual pair of strings if both strings are single character punctuations.

identical=minlength

This feature is 1 if both strings are longer than minlength and are identical.

lcsr=minlength

This feature is the longest subsequence ratio between the two strings if they are both longer than minlength characters.

lcsrlc=minlength

This is the same as lcsr but using lowercased strings.

lcsrascii=minlength

This is the same as lcsr but using only the ASCII characters in both strings.

lcsrcons=minlength

This is the same as lcsr but uses a simple regex to remove all vowels (using a fixed set of characters to match).

SEE ALSO

For the tree structure see Lingua::Align::Corpus::Treebank. For information on the tree aligner look at Lingua::Align::Trees

AUTHOR

Joerg Tiedemann

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Joerg Tiedemann

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.