NAME

cwb-convert-to-utf8 - Convert existing CWB corpus to UTF-8 encoding

SYNOPSIS

cwb-convert-to-utf8 [options] <CORPUS> <new_datadir>

Options:

-r <dir>,              registry directory for old corpus [system default]
  --registry=<dir>
-or <dir>,             registry directory for new corpus [same as old corpus]
  --output-registry=<dir>
-n <id>, -on <id>,     CWB name of new corpus [<CORPUS>_UTF8]
  --name=<id>, --output-name=<id>
-M <n>, --memory=<n>   use ca. <n> MBytes of RAM for corpus indexing [1000]
-f, --force            overwrite existing registry entry and data directory
-v, --verbose          show progress message for each attribute
-h, --help             display short help page

DESCRIPTION

This script provides a convenient method to convert existing CWB-indexed corpora in legacy encodings (latin = ISO-8859-1 etc.) into UTF-8 encoding for use with recent versions of CWB (v3.4.2 and newer).

cwb-convert-to-utf8 requires two arguments, the CWB name of the old (non-UTF-8) corpus and a data directory for the binary index files of the new (UTF-8) corpus. It will then automatically convert and re-index all corpus attributes and create a new registry file in the same registry directory, appending _UTF8 to the corpus name (unless these default choices are overridden by command-line options). For example,

cwb-convert-to-utf8 TIGER /Corpora/CWB/TigerUTF

would locate the corpus TIGER (presumably a copy of the German Tiger Treebank, encoded in latin1) somewhere in the default registry path, create a new registry entry TIGER_UTF8 in the same directory, and store the re-encoded index files in the directory /Corpora/CWB/TigerUTF/.

OPTIONS

--registry=dir, -r dir

Search the input corpus in registry directory dir rather than the default registry path.

--output-registry=dir, -or dir

Create registry entry for the new corpus in dir rather than the same directory as the old corpus.

--name=id, -n id
--output-name=id, -on id

Set the CWB name of the new corpus to id. The default setting append _UTF8 to the CWB name of the input corpus.

--memory=n, -M n

Allow cwb-make to use approx. n MBytes of RAM for indexing.

--force, -f

Silently overwrite an existing registry entry and/or data directory. Use with caution, as this will remove all files from an existing data directory.

--verbose, -v

Show progress message for each individual attribute (recommended for large corpora).

--help, -h

Display short help page.

PREREQUISITES

cwb-convert-to-utf8 requires a recent version of CWB with Unicode support, viz. CWB v3.4.2 or newer. If you have installed multiple CWB releases on your computer, make sure that the CWB/Perl modules are configured to use an appropriate CWB version.

For efficiency reasons, character encodings are converted with the external iconv utility, which must be installed somewhere in the system path. Your version of iconv must support command line options -f (source encoding), -t (target encoding) and -c (ignore conversion errors); it also needs to understand CWB style encoding names such as utf8 and latin1. Suitable versions of iconv are provided by Linux and Mac OS X, and by any POSIX-conformant system.

BUGS

Feature set attributes (|feat1|feat2|...|) containing non-ASCII characters may no longer be sorted correctly after the conversion. This will only affect queries involving the built-in unify() function, though, which is rarely used in practice.

COPYRIGHT

Copyright (C) 2007-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.