NAME

coocfreq - count co-occurrence frequencies for arbitrary features of nodes in a parallel treebank

SYNOPSIS

coocfreq [OPTIONS]

# count co-occurrence frequencies between category labels
# in the parallel treebank of Sophie's World (Smultron)
# and print the results in plain text files 

coocfreq -a sophie.xml -A sta -x cat -y cat -f cat.src -e cat.trg -c cat.cooc

# count co-occurrences of 3-letter-suffix + category label of the parent node
# of the source language tree with words from the target language tree
# results will be stored in src.freq, trg.freq and cooc.freq

coocfreq -a sophie.xml -A sta -x suffix=3:parent_cat -y word

DESCRIPTION

This script counts frequencies and co-occurrence frequencies of source and target language features. It runs through the sentence aligned treebank and combines all node pairs. Note that co-occurrence frequencies in a sentence are max( srcfreq(srcfeature) , trgfreq(trgfeature) ) to ensure Dice scores between 0 and 1!

OPTIONS

-f src.freq

Specify the name for the source language frequencies. The file will start with a line specifying the source language features used (starting with an initial '#'). All other lines have three TAB separated items: the feature string, a unique ID, and finally the frequency.

# word
learned 682     4
stamp   722     3
hat     1056    5
what    399     20
again   220     14
of      27      118
-e trg.freq

Specify the name for the target language frequencies. The format is the same as for the source language.

-c cooc.freq

Specify the name for the co-occurrence frequencies. The first two lines specify the names of the files with the source and the target language frequencies and all other lines contain TAB separated source feature ID, target feature ID and co-occurrence frequency. Here is an example:

# source frequencies: word.src
# target frequencies: word.trg
127     32      4
127     898     3
127     31      3
127     11      5
127     138     6
798     9       4
1250    1367    3
-a align-file

Name of the alignment file (needs to include sentence alignment information). Parallel corpora without explicit sentence alignment files can also be used. For example, you can leave out this parameter if your parallel corpus is a plain text corpus with two separate files for source and target language and corresponding lines are aligned.

-A align-file-format

This argument specifies the format of the sentence alignment file. For example, it can be OPUS (XCES format used in OPUS) or STA (Stockholm Tree Aligner format).

-s src-file

Source language file of your parallel corpus.

-S src-file-format

Format of the source language file. Default will be "plain text".

-s trg-file

Target language file of your parallel corpus.

-T trg-file-format

Format of the target language file. Default will be "plain text".

-x srcfeatures

Features in the source language. Default feature is 'word' = surface words at each terminal node. All kinds of node attributes and combinations of features and contextual features can be used.

-y trgfeatures

The same as -x but for the target language trees.

-m freq-threshold

The frequency threshold. Default is 2.

-D

A flag that enables storing the source and target language vocabulary in DB_FILE database files on disk to save memory when counting. This can be useful especially for complex (long) feature strings. Otherwise it doesn't save that much. The co-occurrence matrix is the big problem .....

SEE ALSO

Lingua::treealign, Lingua::Align::Trees, Lingua::Align::Features

AUTHOR

Joerg Tiedemann, <jorg.tiedemann@lingfil.uu.se>

COPYRIGHT AND LICENSE

Copyright (C) 2009 by Joerg Tiedemann

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.