NAME
cwb-split-vrt - Split CWB input data (.vrt) into multiple parts
SYNOPSIS
cwb-split-vrt [options] <basename> [file1.vrt.gz file2.vrt.gz ...]
-n <size> maximum size (# tokens) of each part [default: -n CL_MAX_CORPUS_SIZE]
-by <tag> XML tag delimiting independent units for split [default: -by text]
-v show progress information
-h display this help page
DESCRIPTION
More and more corpora are becoming available that exceed the maximum CWB corpus size of 2.1 billion tokens. In order to index them with CWB, they have to be split into smaller parts. This script helps to automate the splitting procedure. It reads an arbitrary number of CWB input files .vrt
format and divides the complete data into blocks of less than 2.1 billion tokens each (or a lower limit specified by the user). The script also ensures that individual texts in the corpus (indicated by the XML tag <text>
or another tag specified by the user) are not broken across multiple parts.
Input files with extension .gz
, .bz2
or .xz
are automatically decompressed. Output files are always GZip-compressed and are named basename-1.vrt.gz
, basename-2.vrt.gz
, etc.
OPTIONS
- --size=limit, -n limit
-
Split corpus into parts of up to limit tokens each. The default CL_MAX_CORPUS_SIZE = 2^32-1 is guaranteed to work for 64-bit CWB 3.5, but older CWB releases may have a slightly lower limit. It is recommended to set this option to
-n 2147000000
or lower for best compatibility. - --by=tag, -S tag
-
cwb-split-vrt takes care not to break textual units in the corpus - indicated by XML elements named tag - across multiple parts. Following CQPweb conventions, the default setting is
-by text
, i.e. individual corpus texts are delimited by XML tags<text>
and</text>
. See "DETAILS" below. - --verbose, -v
-
Show progress information during splitting procedure (recommended since this will typically take a very long time).
- --help, -h
-
Display short help page.
DETAILS
cwb-split-vrt assumes that a corpus is a collection of individual texts (or other units) delimited by the XML tags specified with the -by option. It reads each text unit into memory, starts a new corpus part if text does not fit into the current one, and then writes the text to the output file. Any extraneous material before the start tag (e.g. <text>
) as well as trailing end tags (after e.g. </text>
) are included in the text unit.
This implementation strategy has two important consequences:
Text units must be sufficiently small so that the Perl script can fit them comfortably into RAM. It is probably not a good idea to split e.g. a newspaper collection on yearly volumes.
There must not be any XML regions spanning multiple text units. cwb-split-vrt is not aware of such regions and thus cannot replicate the corresponding start and end tags if they are broken across multiple parts. In other words, the XML elements specified with
-by
must delimit completely independent chunks of the corpus.
COPYRIGHT
Copyright (C) 2007-2022 Stephanie Evert [https://purl.org/stephanie.evert]
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.