NAME
CWB::Encoder - Perl tools for encoding and indexing CWB corpora
SYNOPSIS
use CWB::Encoder;
$bnc = new CWB::Indexer "BNC";
$bnc = new CWB::Indexer "/path/to/registry:BNC";
$bnc->group("corpora"); # optional: group and access
$bnc->perm("640"); # permissions for newly created files
$bnc->memory(400); # use up to 400 MB of RAM (default: 75)
$bnc->validate(0); # disable validation for faster indexing
$bnc->verbose(1); # print some progress information
$bnc->debug(1); # enable debugging output
$bnc->make("word", "pos"); # build index & compress
$bnc->makeall; # process all p-attributes
$bnc = new CWB::Encoder "BNC";
$bnc->registry("/path/to/registry"); # will try to guess otherwise
$bnc->dir("/path/to/data/directory"); # directory for corpus data files
$bnc->overwrite(1); # may overwrite existing files / directories
$bnc->longname("British National Corpus"); # optional
$bnc->info("Line1.\nLine2.\n..."); # optional multi-line info text
$bnc->charset("latin1"); # defaults to latin1
$bnc->language("en"); # defaults to ??
$bnc->group("corpora"); # optional: group and access permissions
$bnc->perm("640"); # for newly created files & directories
$bnc->p_attributes("word"); # declare postional atts (no default!)
$bnc->p_attributes(qw<pos lemma>); # may be called repeatedly
$bnc->null_attributes("teiHeader"); # declare null atts (ignores XML tags)
$bnc->auto_null(1); # ignore all undeclared XML tags
$bnc->s_attributes("s"); # s-attributes in cwb-encode syntax
$bnc->s_attributes(qw<div0* div1*>);# * = store annotations (-V)
$bnc->s_attributes("bncDoc:0+id"); # recursion & XML attributes
$bnc->decode_entities(0); # don't decode XML entities (with -x flag)
$bnc->undef_symbol("__UNDEF__"); # mark missing values like cwb-encode
$bnc->memory(400); # use up to 400 MB of RAM (default: 75)
$bnc->validate(0); # disable validation for faster indexing
$bnc->encode_options("-C"); # pass arbitrary options to cwb-encode
$bnc->verbose(1); # print some progress information
$bnc->debug(1); # enable debugging output
$bnc->encode(@files); # encoding, indexing, and compression
$pipe = $bnc->encode_pipe; # can also feed input text from Perl script
while (...) {
print $pipe "$line\n";
}
$bnc->close_pipe;
DESCRIPTION
This package contains modules for the automatic encoding and indexing of CWB corpora.
CWB::Indexer builds indices for some or all positional attributes of an existing corpus (using the cwb-makeall tool). In addition, these attributes are automatically compressed (using the cwb-huffcode and cwb-compress-rdx tools). Compression and indexing is interleaved to minimise the required amount of temporary disk space, and a make-like system ensures that old index files are automatically updated.
CWB::Encoder automates all steps necessary to encode a CWB corpus (which includes cleaning up old files, running cwb-encode, editing the registry entry, indexing & compressing positional attributes, and setting access permissions). Both modules can be set up with a few simple method calls. Full descriptions are given separately in the following sections.
CWB::Indexer METHODS
- $idx = new CWB::Indexer $corpus;
- $idx = new CWB::Indexer "$registry_path:$corpus";
-
Create a new CWB::Indexer object for the specified corpus. If $corpus is not registered in the default registry path (the built-in default or the
CORPUS_REGISTRY
environment variable), the registry directory has to be specified explicitly, separated from the corpus name by a:
character. $registry_path may contain multiple directories separated by:
characters. - $idx->group($group);
- $idx->perm($permission);
-
Optional group membership and access permissions for newly created files (otherwise, neither chgrp nor chmod will be called). Note that $permission must be a string rather than an octal number (as for the built-in chmod function). Indexing will fail if the specified group and/or permissions cannot be set.
- $idx->memory($mbytes);
-
Set approximate memory limit for cwb-makeall command, in MBytes. The memory limit defaults to 75 MB, which is a reasonable value for systems with at least 128 MB of RAM.
- $idx->validate(0);
-
Turn off validation of index and compressed files, which may give substantial speed improvements for larger corpora.
- $idx->verbose(1);
-
Display some progress messages (on STDOUT).
- $idx->debug(1);
-
Activate debugging output (on STDERR).
- $idx->make($att1, $att2, ...);
-
Process one or more positional attributes. An index is built for each attribute and the data files are compressed. Missing files are re-created (if possible) and old files are updated automatically.
- $idx->makeall;
-
Process all positional attributes of the corpus.
CWB::Encoder METHODS
- $enc = new CWB::Encoder $corpus;
-
Create a new CWB::Encoder object for the specified corpus. Note that the registry directory cannot be passed directly to the constructor (use the registry method instead).
- $enc->name($corpus);
-
Change the CWB name of a corpus after the encoder object $enc has been created. Has to be used if the constructor was called without arguments.
- $enc->longname($descriptive_name);
-
Optional long, descriptive name for a corpus (single line).
- $enc->info($multiline_text);
-
Multi-line text that will be written to the
.info
file of the corpus. - $enc->charset($code);
-
Set corpus character set (as a corpus property in the registry entry). In CWB release 3.0, only
latin1
is fully supported, but character setslatin2
, ...,latin9
andutf8
can also be declared. In CWB release 3.5, the following character sets are supported:ascii
,latin1
, ...,latin9
,arabic
,greek
,hebrew
andutf8
. Any other $code will raise a warning. - $enc->language($code);
-
Set corpus language (as an informational corpus property in the registry entry). Use of a two-letter ISO code (
de
,en
,fr
, ...) is recommended, and any other formats will raise a warning. - $enc->registry($registry_dir);
-
Specify registry directory $registry_dir, which must be a single directory rather than a path. If the registry directory is not set explicitly, CWB::Encoder attempts to determine the standard registry directory, and will fail if there is no unique match (e.g. when the
CORPUS_REGISTRY
environment variable specifies multiple directories). - $enc->dir($data_dir);
-
Specify directory $data_dir for corpus data files. The directory is automatically created if it does not exist.
- $enc->p_attributes($att1, $att2, ...);
-
Declare one or more positional attributes. This method can be called repeatedly with additional attributes. Note that all positional attributes, including
word
, have to be declared explicitly. - $enc->s_attributes($att1, $att2, ...);
-
Declare one or more structural attributes. $att1 etc. are either simple attribute names or complex declarations using the syntax of the
-S
and-V
flags in cwb-encode. See the CWB Corpus Encoding Tutorial for details on the attribute declaration syntax for nesting depth and XML tag attributes. By default, structural attributes are encoded without annotation strings (-S
flag). In order to store annotations (-V
flag), append an asterisk (*
) to the attribute name or declaration. The CWB Corpus Encoding Tutorial explains when to use-S
and when to use-V
. The s_attributes method can be called repeatedly to add further attributes. - $enc->null_attributes($att1, $att2, ...);
-
Declare one or more null attributes. XML start and end tags with these names will be ignored (and not inserted as
word
tokens). This method can be called repeatedly. - $enc->auto_null(1);
-
Ignore XML tags that haven't been declared as s-attributes rather than inserting them as ordinary tokens. Such XML tags are automatically declared as null attributes. Requires CWB v3.4.21 or newer.
- $enc->group($group);
- $enc->perm($permission);
-
Optional group membership and access permissions for newly created files (otherwise, neither chgrp nor chmod will be called). Note that $permission must be a string rather than an octal number (as for the built-in chmod function). Encoding will fail if the specified group and/or permissions cannot be set. If the data directory has to be created, its access permissions and group membership are set accordingly.
- $enc->overwrite(1);
-
Allow CWB::Encoder to overwrite existing files. This is required when either the registry entry or the data directory exists already. When overwriting is enabled, the registry entry and all files in the data directory are deleted before encoding starts.
- $enc->memory($mbytes);
-
Set approximate memory limit for cwb-makeall command, in MBytes. The memory limit defaults to 75 MB, which is a reasonable value for systems with at least 128 MB of RAM. The memory setting is only used when building indices for positional attributes, not during the initial encoding process.
- $enc->validate(0);
-
Turn off validation of index and compressed files, which may give substantial speed improvements for larger corpora.
- $enc->decode_entities(0);
-
Whether cwb-encode is allowed to decode XML entities and skip XML comments (with the
-x
option). Set this option to false if you want an HTML-compatible encoding of the CWB corpus that does not need to be converted before display in a Web browser. - $enc->undef_symbol("__UNDEF__");
-
Symbol inserted for missing values of positional attributes (either because there are too few columns in the input or because attribute values are explicit empty strings). By default, no special symbol is inserted (i.e. missing values are encoded as empty strings
""
). Use the command shown above to mimic the standard behaviour of cwb-encode. - $enc->encode_options($string, ...);
-
This options allows users to pass arbitrary further command-line options to the cwb-encode program. Use with caution!
Note that each option (and option argument) must be passed as a separate argument to encode_options because they will not be parsed by the shell (and so additional quotes are not needed).
- $enc->verbose(1);
-
Print some progress information (on STDOUT).
- $enc->debug(1);
-
Activate debugging output (on STDERR).
- $enc->encode(@files_or_directories);
-
Encode one or more input files as a CWB corpus, using the parameter settings of the $enc object. The encode method performs the full encoding cycle, including indexing, compression, and setting access permissions. All input files must be specified at once as subsequent encode calls would overwrite the new corpus. Input files may be compressed with GZip (
.gz
), as supported by cwb-encode.The argument list may also contain directories. In this case, all files with extensions
.vrt
or.vrt.gz
in those directories will automatically be added to the corpus. Note that no recursive search of subdirectories is performed: only files located in the specified directories will be included. - $pipe = $enc->encode_pipe;
-
Open a pipe to the cwb-encode command and return its file handle. This allows some pre-processing of the input by the Perl script (perhaps reading from another pipe), which should print to $pipe in one-word-per-line format. Note that the file handle $pipe must not be closed by the Perl script (see the close_pipe method below).
- $enc->close_pipe;
-
After opening an encode pipe with the encode_pipe method and printing the input text to this pipe, the close_pipe method has to be called to close the pipe and trigger the post-encoding steps (indexing, compression, and access permissions). When the close_pipe method returns, the corpus has been encoded successfully.
COPYRIGHT
Copyright (C) 2002-2022 Stephanie Evert [https://purl.org/stephanie.evert]
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.