NAME
subalign - script for aligning the OpenSubtitlesXXXX corpora
SYNOPSIS
subalign [OPTIONS] <srcdir> <trgdir>
srcdir
and trgdir
are directories in the subtitle corpus from the source and the target language. The script creates a corresponding sub-dir for the aligned data. For example
subalign en/2001/209475 et/2001/209475
aligns files in the English and Estonian collection of subtitles for a movie from 2001 with the ID 209475. The resulting files will be created in en-et/2001/209475
.
OPTIONS
Command line arguments for subalign:
-A ................... store alternative alignments in outdir/alt
-a accept-threshold .. accept alternative subtitle pairs > score
(default=0.75)
-D duration-thr ...... min duration similarity (default=0.8)
-M max ............... max nr of subtitle file pairs to try
-L ................... skip symbolic links (when looking for files)
-x score-threshold ... threshold for overlap + metascore
(before aligning, default = 0.2)
Command line arguments related to srt-alignment
-S source-lang . source language ID
-T target-lang . target language ID
-c score ....... use cognates with LCSR>=score
-r score-range . use cognates in a certain range 1..score and take best
-l length ...... set minimal length of cognates (if used)
-i len ......... use identical strings with length>=len
-w size ........ set size for sliding window
-d dic ......... use dictionary in file 'dic'
-u ............. cognates/identicals that start with upper case only
-r char_set .... define a set of characters to be used for matching
-q ............. normalize length scores with (current) word frequencies
-b ............. use "best" alignment (least empty alignments)
-p nr .......... stop after <nr> candidates (when using -b)
-m MAX ......... in "best" alignment: use only MAX first & MAX last
(default = 10; 0 = all)
-f uplug-conf .. use fallback aligner if necessary
-v ............. verbose output
DESCRIPTION
subalign looks at pairs of movie subtitle files in the OpenSubtitle corpora and tries to find the best pair that aligns with the least empty translation units among all alternative subtitle files. It uses a score that combines the proportion of non-empty alignment units and a score based on metadata. The latter requires meta-information stored in the subtitle files (which is available from OpenSubtitles2016)