NAME
DiaColloDB::Document - diachronic collocation db, source document (base class)
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Document;
##========================================================================
## Constructors etc.
$doc = CLASS_OR_OBJECT->new(%args);
##========================================================================
## API: I/O
$bool = $doc->fromFile($filename_or_fh);
$label = $doc->label();
DESCRIPTION
DiaColloDB::Document provides an abstract base-class for corpus documents from which a DiaColloDB database can be created. Support for alternative corpus formats can be be added by implementing a DiaColloDB::Document subclass for each required format.
Globals & Constants
- Variable: @ISA
-
DiaColloDB::Document inherits from DiaColloDB::Logger.
Constructors etc.
- new
-
$doc = CLASS_OR_OBJECT->new(%args);
%args, object structure:
label => $label, ##-- document label (e.g. filename; optional) date =>$date, ##-- year tokens =>\@tokens, ##-- tokens, including undef for eos meta =>\%meta, ##-- document metadata (e.g. author, title, collection, ...)
Each token in @tokens is one of the following:
undef : EOS (default, for collocation profiling) a HASH-ref : normal token: {w=>$word,p=>$pos,l=>$lemma,...} a string "#BREAK" : block boundary / "break" of type BREAK, e.g. "#s": sentence-break, "#p": paragraph-break, ...
API: I/O
- fromFile
-
$bool = $doc->fromFile($filename_or_fh);
parse tokens from $filename_or_fh
- label
-
$label = $doc->label();
return a string label for $doc; default just returns "$doc".
SUBCLASSES
The DiaColloDB distribution provides the following built-in DiaColloDB::Document subclasses:
- DiaColloDB::Document::DDCTabs
-
Full support for DDC tab-dump files as produced by
ddc_dump --full --tabs
; see http://odo.dwds.de/~moocow/software/ddc/ddc_tabs.html. - DiaColloDB::Document::JSON
-
Supports input files in JSON format, assuming the stored JSON data maps 1:1 onto the required DiaColloDB::Document structure described above under new().
- DiaColloDB::Document::TCF
-
Basic handling for input files in CLARIN-D TCF format as used by WebLicht.
- DiaColloDB::Document::TEI
-
Rudimentary handling for TEI-like XML input files, which must at least include token boundaries encoded as
<w>
elements.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
DiaColloDB::Document::DDCTabs(3pm), DiaColloDB::Document::JSON(3pm), DiaColloDB::Document::TCF(3pm), DiaColloDB::Document::TEI(3pm), DiaColloDB::Corpus(3pm), DiaColloDB(3pm), perl(1), ...