NAME
DiaColloDB::Corpus::Compiled - collocation db, source corpus (pre-compiled)
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Corpus::Compiled;
##========================================================================
## Constructors etc.
$corpus = $CLASS_OR_OBJECT->new(%args);
##========================================================================
## Persistent API
@keys = $obj->headerKeys();
@files = $obj->diskFiles();
$bool = $obj->unlink(%opts);
##========================================================================
## Corpus API
##-- Corpus API: open/close
$bool = $corpus->open([$dbdir], %opts); ##-- compat;
$bool = $corpus->close();
##-- Corpus API: iteration
$nfiles = $corpus->size();
$bool = $corpus->iok();
$label = $corpus->ifile();
$doc_or_undef = $corpus->idocument();
##========================================================================
## Compiled API
$ccorpus = $CLASS_OR_OBJECT->create($src_corpus, %opts);
$ccorpus = $CLASS_OR_OBJECT->union(\@sources, %opts);
##========================================================================
## Convenience Methods
$bool = $corpus->opened();
$bool = $corpus->flush();
$corpus = $corpus->reopen(%opts);
$dirname = $corpus->datadir();
$bool = $corpus->truncate();
$filters = $ccorpus->filters();
DESCRIPTION
DiaColloDB::Corpus::Compiled is an intermediate abstraction layer for storing pre-filtered corpus data in a format suitable for fast I/O. It should not be necessaray for end users to use this class directly, since the DiaColloDB::create() method should implicitly create a (temporary) DiaColloDB::Corpus::Compiled
object whenever required.
Globals & Constants
- Variable: @ISA
-
DiaColloDB::Corpus::Compiled
inherited from DiaColloDB::Corpus and supports all DiaColloDB::Corpus methods.
Constructors etc.
- new
-
$corpus = $CLASS_OR_OBJECT->new(%args);
%args, object structure:
( ##-- NEW in DiaColloDB::Corpus::Compiled dbdir => $dbdir, ##-- data directory for compiled corpus flags => $flags, ##-- open mode flags (fcntl flags or perl-style; default='r') filters => \%filters, ##-- corpus filters ( DiaColloDB::Corpus::Filters object or HASH-ref ) njobs => $njobs, ##-- number of parallel worker jobs for create(); default=-1 (= nCores) temp => $bool, ##-- implicitly unlink() on exit? logThreads => $level ##-- log-level for thread stuff (default='off') ## ##-- INHERITED from DiaColloDB::Corpus #files => \@files, ##-- source files (OVERRIDE: unused) #dclass => $dclass, ##-- DiaColloDB::Document subclass for loading (OVERRIDE forces 'DiaColloDB::Document::JSON') dopts => \%opts, ##-- options for $dclass->fromFile() (override default={}) cur => $i, ##-- index of current file logOpen => $level, ##-- log-level for open(); default='info' )
Implicitly calls calls the open() method if the
dbdir
property is defined. - DESTROY
-
Destructor implicitly calls the close() method, and may also implicitly call unlink() if the
temp
property is true.
Persistent API
- headerKeys
-
@keys = $obj->headerKeys();
Override filters out more object-specific keys.
- diskFiles
-
@files = $obj->diskFiles();
Returns disk storage files; override retuns singleton list
$obj->{dbdir}
. - unlink
-
$bool = $obj->unlink(%opts);
Removes all disk file(s) associated with the object. Override accepts additional %opts:
close => $bool, ##-- mall $obj->close() before unlinking? (default=1)
Corpus API: open/close
- open
-
$bool = $corpus->open([$dbdir], %opts); ##-- compat $bool = $corpus->open($dbdir, %opts); ##-- new
Opens compiled corpus directory
$dbdir
, which must be specified as either a simple scalar or a singleton ARRAY-ref, or must already be defined as$corpus->{dbdir}
or$opts{dbdir}
.Superclass %opts accepted by DiaColloDB::Corpus:
compiled => $bool, ##-- implicitly true here glob => $bool, ##-- (ignored here) whether to glob arguments list => $bool, ##-- (ignored here) whether arguments are file-lists
- close
-
$bool = $corpus->close();
Close currently opened corpus if any. Override implicitly calls $corpus->flush() if
$corpus
is opened in write-mode.
Corpus API: iteration
- size
-
$nfiles = $corpus->size();
Returns total number of file(s) in the corpus (constant time).
- iok
-
$bool = $corpus->iok();
True if corpus file-iterator is valid.
- ifile
-
$label = $corpus->ifile(); $label = $corpus->ifile($pos);
Get current iterator filename (first form), or filename at index
$pos
(second form). Override always returns filenames of the form"$corpus->{dbdir}/$pos.json"
. - idocument
-
$doc_or_undef = $corpus->idocument(); $doc_or_undef = $corpus->idocument($pos);
Gets current document (first form) or document at index
$pos
(second form).
Corpus::Compiled API
- create
-
$ccorpus = $CLASS->create($src_corpus, %opts); $ccorpus = $ccorpus->create($src_corpus, %opts);
Compile or append a single
$src_corpus
to the compiled corpus directory$opts{dbdir}
. If specified%opts
, overrides%$ccorpus
properties. Returns a (possibly new) DiaColloDB::Corpus::Compiled object $ccorpus. Honors perl- or fcntl-style$opts{flags}
for append and truncate.Parses all document file(s) from
$src_corpus
, applies the corpus content filters specified by the HASH-ref or DiaColloDB::Corpus::Filters object specified by$ccorpus->{filters}
, and saves the compiled data to the compiled corpus directory$ccorpus->{dbdir}
. If the threads module is available, compilation may use multiple parallell threads as specified by the$DiaColloDB::NJOBS
variable; see DiacolloDB::Utils::nJobs() for details. - union
-
$ccorpus = $CLASS->union(\@sources, %opts); $ccorpus = $ccorpus->union(\@sources, %opts);
Merges pre-compiled corpora
\@sources
to the output directory$opts{dbdir}
. If specified%opts
, overrides%$ccorpus
properties. Returns a (possibly new) DiaColloDB::Corpus::Compiled object $ccorpus representing the union over@sources
. Honors$ccorpus->{flags}
for append and truncate.Each $src in \@sources is either a DiaColloDB::Corpus::Compiled object or a simple scalar (which is interpreteed as the
dbdir
of a DiaColloDB::Corpus::Compiled object). No content filters are applied, and output data files are created as links to the input data-files from @sources (hard-links if possible, otherwise symbolic links).
Convenience Methods: disk files etc.
- datadir
-
$dirname = $corpus->datadir(); $dirname = $corpus->datadir($dir);
Wrapper for
$corpus->{dbdir}
. - truncate
-
$bool = $corpus->truncate();
Removes all disk data (including header) and resets
$corpus->{size}
to 0 (zero). - filters
-
$filters = $ccorpus->filters();
Return corpus content filters as a DiaColloDB::Corpus::Filters object.
Convenience Methods: open/close
- opened
-
$bool = $corpus->opened();
Returns true iff $corpus is currently opened.
- flush
-
$bool = $corpus->flush();
Writes any pending corpus data (e.g. header) to disk.
- reopen
-
$corpus = $corpus->reopen(%opts);
Closes and re-opened corpus, e.g. with different
flags
.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
dcdb-corpus-compile.per(1), dcdb-create.per(1), DiaColloDB::Corpus::Filters(3pm), DiaColloDB::Corpus(3pm), DiaColloDB(3pm), perl(1), ...