NAME
cwb-align-export - Export existing sentence alignment from a CWB corpus
SYNOPSIS
cwb-align-export [options] <SOURCE> <TARGET> <grid> <keyspec>
<SOURCE> CWB name of source corpus
<TARGET> CWB name of target corpus
<grid> s-attribute containing the alignment grid (usually "s")
<keyspec> pattern used to construct unique IDs for grid regions
Options:
-r <dir>, --registry=<dir> use registry directory <dir>
-o <file>, --output=<file> write output to <file>
-nh, --no-header write alignment file without header
-f, --force skip alignment beads with errors rather than stopping
-v, --verbose show progress messages during processing
-h, --help display short help page
DESCRIPTION
This script exports an encoded sentence-level alignment between two CWB corpora (SOURCE and TARGET) as a text file compatible with cwb-align-import. In the output, alignment beads are specified by (sets of) unique sentence IDs in the source and target corpus. Unique IDs are computed from one or more s-attributes according to the pattern keyspec. Alignments at other granularities (such as paragraph or clause) are also supported; the corresponding s-attribute is specified by the command-line argument grid.
It is recommended to read the cwb-align-import manpage first, in order to get a better understanding of the export file format and its correspondence to a CWB alignment attribute. An illustrative example can be found in the CWB Corpus Encoding Tutorial.
ARGUMENTS
- SOURCE
-
CWB corpus ID of the source language corpus.
- TARGET
-
CWB corpus ID of the target language corpus.
- grid
-
CWB attribute representing the alignment grid, i.e. each alignment bead links n consecutive grid regions in the source language to m consecutive grid regions in the target language. It is an error if the start or end of an alignment bead region doesn't match a corresponding grid boundary.
For the most common case of sentence alignment, grid will usually be set to
s
. Note that the same grid attribute is used for both source and target language corpus. - keyspec
-
Pattern used to construct unique sentence IDs (both in the source and target corpus). If sentences are directly annotated with IDs, say in the s-attribute
s_id
, you can simply specify{s_id}
or{id}
for short (the name of the grid attribute is automatically prepended). Note the curly braces around the attribute name!In more complex situations, keyspec is an arbitrary character string that interpolates s-attributes enclosed in curly braces. For example, if paragraphs and sentences are numbered (s-attributes
p_num
ands_num
), you can construct IDs of the formid_p4_s2
(second sentence in fourth paragraph) with the patternid_p{p_num}_s{s_num}
.
OPTIONS
- --registry=dir, -r dir
-
Locate corpora in CWB registry directory dir, overriding the default directory and the environment variable
CORPUS_REGISTRY
. - --output=file, -o file
-
Write export data to file rather than standard output. Files with extension
.gz
or.bz2
are compressed automatically. - --no-header, -nh
-
Write alignment file without the optional header line (see "EXPORT FILE FORMAT" below).
- --force, -f
-
Ignore errors and continue exporting. If the start or end point of an alignment bead doesn't match grid boundaries, the bead will be skipped with an error message.
- --verbose, -v
-
Verbose mode (shows progress messages during processing).
- --help, -h
-
Show usage and options summary.
EXPORT FILE FORMAT
The exported alignment file starts with an optional header line specifying the CWB names of source and target corpus, the s-attribute containing sentence regions (or other regions used as an alignment grid), and the key pattern for constructing unique sentence IDs. The four items are separated by TAB characters. Specify --no-header
) to omit the header line from the export file.
Each of the remaining lines in the export file corresponds to a single alignment bead. It consists of the ID of a sentence in the source corpus (or multiple IDs separated by blanks), followed by a TAB character and the ID of the aligned sentence in the target corpus (or multiple IDs separated by blanks).
See the cwb-align-import manpage for a more detailed description of the file format and the specification of unique IDs.
COPYRIGHT
Copyright (C) 2007-2022 Stephanie Evert [https://purl.org/stephanie.evert]
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.