NAME

cwb-make-subcorpus - Materialize subcorpus as separately indexed corpus

SYNOPSIS

cwb-make-subcorpus [options] <CORPUS> <SUBCORPUS> <datadir> '<query>'

DESCRIPTION

This script creates a physical copy of a virtual subcorpus of a CWB-indexed corpus. It is often more convenient to access such a subcorpus as a separately indexed CWB corpus, and may be required for software packages that are not designed to operate on subsets of a corpus. For relatively small subcorpora, working with the physical copy will also be much more efficient.

The virtual subcorpus is a collection of textual units (any s-attribute, specified with option -S). It is defined by a CQP query and consists of all units that contain at least one match of the query. This approach ensures great flexibility, allowing subcorpora to be defined in terms of metadata, lexical items and even grammatical features.

cwb-make-subcorpus automatically copies all positional and structural attributes, adjusting s-attribute regions as needed. In particular, any regions outside the subcorpus are dropped, while regions spanning one or more text units from the subcorpus as well as other material are narrowed down to the subcorpus. The script also convert the physical copy to a different character encoding, but it is better to use cwb-convert-to-utf8 for upgrading corpora to UTF-8 format.

There are some important limitations:

  • The script does not copy alignment attributes (because it relies on cwb-decode, which cannot handle a-attributes). Any alignments will be absent from the subcorpus.

  • Re-encoding to a different character set silently deletes invalid characters, so the content of the physical copy may no longer be identical to the virtual subcorpus.

ARGUMENTS

CORPUS

CWB ID of the original corpus

SUBCORPUS

New CWB ID for the physical copy to be created

datadir

New directory for CWB index files of the physical copy. This directory must not yet exist (unless overwritten with --force).

query

A CQP query that identifies text units to be included in the virtual subcorpus. Usually enclosed in single quotation marks on the command line.

OPTIONS

--registry=dir, -r dir

Search the original corpus in registry directory dir rather than the default registry path.

--output-registry=dir, -or dir

Create registry entry for the new corpus in dir. [default: same registry directory as the original corpus]

--by=att, -S att

S-attribute defining basic textual units for the virtual subcorpus, which consists of all such units that contain at least one match of the CQP query. [default: text]

--charset=enc, -C enc

Character encoding of the physical copy. Any of the character encodings supported by CWB 3.5 may be specified. If different from the character encoding of the original corpus, all attributes will automatically be converted, silently deleting invalid characters. [default: same as original corpus]

--memory=n, -M n

Allow cwb-make to use approx. n MBytes of RAM for indexing.

--force, -f

Silently overwrite an existing registry entry and/or data directory. Use with caution, as this will remove all files from an existing data directory.

--verbose, -v

Show progress message for each individual attribute (recommended for large corpora).

--help, -h

Display short help page.

PREREQUISITES

cwb-make-subcorpus requires a recent version of CWB with special support in the cwb-encode utility, viz. CWB v3.4.15 or newer. If you have installed multiple CWB releases on your computer, make sure that the CWB/Perl modules are configured to use an appropriate CWB version.

For efficiency reasons, character encodings are converted with the external iconv utility, which must be installed somewhere in the system path. Your version of iconv must support command line options -f (source encoding), -t (target encoding) and -c (ignore conversion errors); it also needs to understand CWB style encoding names such as utf8 and latin1. Suitable versions of iconv are provided e.g. by Linux and Mac OS X.

COPYRIGHT

Copyright (C) 2018-2022 Stephanie Evert [https://purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.