NAME

DTA::TokWrap::Document - DTA tokenizer wrappers: document wrapper

SYNOPSIS

use DTA::TokWrap::Document;

##========================================================================
## Constructors etc.

$doc = $CLASS_OR_OBJECT->new(%args);
%defaults = $CLASS->defaults();
$doc = $doc->init();
$doc->DESTROY();

##========================================================================
## Methods: Pseudo-I/O

$newdoc = CLASS_OR_OBJECT->open($xmlfile,%docNewOptions);
$bool = $doc->close();
@notempkeys = $doc->notempkeys();
@tempfiles = $doc->tempfiles();

##========================================================================
## Methods: pseudo-pseudo-make

$bool = $doc->genKey($key);
$keyval_or_undef = $doc->makeKey($key);

##========================================================================
## Methods: Low-Level: generator-subclass wrappers

$doc_or_undef = $doc->mkindex();
$doc_or_undef = $doc->mkbx0();
$doc_or_undef = $doc->mkbx();
$doc_or_undef = $doc->tokenize();
$doc_or_undef = $doc->tok2xml();
$doc_or_undef = $doc->txmlanno();

##========================================================================
## Methods: Member I/O

$bx0doc_or_undef    = $doc->loadBx0File();
$cxdata_or_undef    = $doc->loadBxFile();
$cxdata_or_undef    = $doc->loadCxFile();
\$tokdata_or_undef  = $doc->loadTokFile();
\$xtokdata_or_undef = $doc->loadXtokFile();
$xtokDoc            = $doc->xtokDoc();
\$xmlbuf_or_undef   = $doc->loadXmlData();
\$txtbuf_or_undef   = $doc->loadTxtData();

$file_or_undef = $doc->saveBx0File();
$file_or_undef = $doc->saveBxFile();
$file_or_undef = $doc->saveTxtFile();
$file_or_undef = $doc->saveTokFile();
$file_or_undef = $doc->saveXtokFile();
$file_or_undef = $doc->saveTcfFile();
 
##========================================================================
## Methods: Profiling

$ntoks_or_undef = $doc->nTokens();
$nxbytes_or_undef = $doc->nXmlBytes();

DESCRIPTION

DTA::TokWrap::Document provides a perl class for representing a single DTA base-format XML file and associated indices. Together with the DTA::TokWrap module, this class comprises the top-level API of the DTA::TokWrap distribution.

Globals

@ISA

DTA::TokWrap::Document inherits from DTA::TokWrap::Base.

$TOKENIZE_CLASS

$TOKENIZE_CLASS

Default tokenizer sub-processor class (default='DTA::TokWrap::Processor::tokenize').

Variables: ($CX_ID,$CX_XOFF,$CX_XLEN,$CX_TOFF,$CX_TLEN,$CX_TEXT,$CX_ATTRS)

Field indices in .cx files generated by the mkindex() method.

Constructors etc.

new
$doc = $CLASS_OR_OBJECT->new(%args);

Low-level constructor for document wrapper object. You should probably use either DTA::TokWrap->open() or DTA::TokWrap::Document->open() instead of calling this constructor directly.

%args, %$doc:

##-- Document class
class => $class,      ##-- delegate call to $class->new(%args)
##
##-- Source data
xmlfile => $xmlfile,  ##-- source filename
xmlbase => $xmlbase,  ##-- xml:base for generated files (default=basename($xmlfile))
xmldata => $xmldata,  ##-- source buffer (for addws, tcfencode)
##
##-- pseudo-make options
traceMake => $level,  ##-- log-level for makeKey() trace (e.g. 'debug'; default=undef (none))
traceGen  => $level,  ##-- log-level for genKey() trace (e.g. 'trace'; default=undef (none))
traceProc => $level,  ##-- log-level for document-called processor calls (default=none)
traceLoad => $level,  ##-- log-level for load* trace (default=none)
traceSave => $level,  ##-- log-level for save* trace (default=none)
genDummy  => $bool,   ##-- if true, generator will not actually run (a la `make -n`)
##
##-- generator data (optional)
tw => $tw,              ##-- a DTA::TokWrap object storing individual generators
traceOpen  => $leve,    ##-- log-lvel for open() trace (e.g. 'info'; default=undef (none))
traceClose => $level,   ##-- log-level for close() trace (e.g. 'trace'; default=undef (none))
##
##-- generated data (common)
outdir => $outdir,    ##-- output directory for generated data (default=.)
tmpdir => $tmpdir,    ##-- temporary directory for generated data (default=$ENV{DTATW_TMP}||$outdir)
keeptmp => $bool,     ##-- if true, temporary document-local files will be kept on $doc->close()
notmpre => $regex,    ##-- non-temporary filename regex
notmpkeys => $keys,   ##-- non-temporary keys, space-separated list
outbase => $filebase, ##-- output basename (default=`basename $xmlbase .xml`)
format => $level,     ##-- default formatting level for XML output
##
##-- mkindex data (see DTA::TokWrap::Processor::mkindex)
cxfile => $cxfile,    ##-- character index file (default="$tmpdir/$outbase.cx")
cxdata => $cxdata,    ##-- character index data (see loadCxFile() method)
sxfile => $sxfile,    ##-- structure index file (default="$tmpdir/$outbase.sx")
txfile => $txfile,    ##-- raw text index file (default="$tmpdir/$outbase.tx")
##
##-- mkbx0 data (see DTA::TokWrap::Processor::mkbx0)
bx0doc  => $bx0doc,   ##-- pre-serialized block-index XML::LibXML::Document
bx0file => $bx0file,  ##-- pre-serialized block-index XML file (default="$outbase.bx0"; optional)
##
##-- mkbx data (see DTA::TokWrap::Processor::mkbx)
bxdata  => \@bxdata,  ##-- block-list, see DTA::TokWrap::mkbx::mkbx() for details
bxfile  => $bxfile,   ##-- serialized block-index CSV file (default="$tmpdir/$outbase.bx"; optional)
txtfile => $txtfile,  ##-- serialized & hinted text file (default="$tmpdir/$outbase.txt"; optional)
txtdata => $txtdata,  ##-- serialized & hinted text file (used by tcfencode, must be loaded explicitly with loadTxtData())
##
##-- tokenize data (see DTA::TokWrap::Processor::tokenize, DTA::TokWrap::Processor::tokenize::dummy)
tokdata0 => $tokdata0,  ##-- tokenizer output data (slurped string)
tokfile0 => $tokfile0,  ##-- tokenizer output file (default="$tmpdir/$outbase.t0"; optional)
##
##-- post-tokenize data (see DTA::TokWrap::Processor::tokenize1)
tokdata1 => $tokdata1,  ##-- post-tokenizer output data (slurped string)
tokfile1 => $tokfile1,  ##-- post-tokenizer output file (default="$tmpdir/$outbase.t1"; optional)
##
##-- tokenizer xml data (see DTA::TokWrap::Processor::tok2xml)
xtokdata => $xtokdata,  ##-- XML-ified tokenizer output data
xtokfile => $xtokfile,  ##-- XML-ified tokenizer output file (default="$outdir/$outbase.t.xml")
xtokdoc  => $xtokdoc,   ##-- XML::LibXML::Document for $xtokdata (parsed from string)
##
##-- tokenizer xml annotations (see DTA::TokWrap::Processor::txmlanno)
axtokdata => $axtokdata,  ##-- optional external XML annotation data (for splicing into $xtokdata)
axtokfile => $axtokfile,  ##-- optional external XML annotation file (for splicing into $xtokfile; default="$outdir/$outbase.ta.xml")
xtokfile0 => $xtokfile0,  ##-- XML-ified tokenizer output file (default=none or "$outdir/$outbase.t0.xml" if {keeptmp} is true)
##
##-- ws-splice (see DTA::TokWrap::Processor::addws)
#cwsdata => $cwsdata,    ##-- ws-spliced output data (xmlfile with <s> and <w> elements)
cwsfile => $cwsfile,    ##-- ws-spliced output file (default="$outdir/$outbase.cws.xml")
##
##-- property-splice (see DTA::TokWrap::Processor::idsplice)
## cwstbasebufr => \$bdata,  ##-- base data-ref for idsplice (xml with //*/@id) [default=\$cwsdata if defined]
## cwstbasefile => $bfile,   ##-- source file for $bdata [default=$cwsfile]
## cwstsobufr   => \$sodata, ##-- standoff data-ref for idsplice (xml with //*/@id, additional attributes and content) [default=\$xtokdata]
## cwstsofile   => $sofile,  ##-- source file for $sodata [default=$xtokfile]
## cwstbufr     => $wstbufr, ##-- idsplice output buffer (base + id-spliced attributes, content) -- available for override, not used by default
## cwstfile     => $wstfile, ##-- idsplice output file [default="$outdir/$outbase.cwst.xml"]
##
##-- tcfencode data (see DTA::TokWrap::Processor::tcfencode)
tcfdoc   => $tcfdoc,     ##-- XML::LibXML::Document representing TCF-encoded data
tcffile  => $tcffile,    ##-- TCF file
tcflang  => $lang,       ##-- TCF language attribute (default: 'de')
##
##-- tcftokenize data (see DTA::TokWrap::Processor::tcftokenize)
tcftokdoc => $tcftokdoc,    ##-- XML::LibXML::Document representing tokenized TCF data (== $tcfdoc)
tcftokfile => $tcftokfile,  ##-- tcf-tokenized file
##
##-- tcfdecode0 data (see DTA::TokWrap::Processor::tcfdecode0)
tcfxfile => $tcfxfile,   ##-- tcf-decoded base xml file [default="$tmpdir/$outbase.tcfx"]
tcfxdata => $tcfxdata,   ##-- tcf-decoded base xml data
tcftfile => $tcftfile,   ##-- tcf-decoded serial text file [default="$tmpdir/$outbase.tcft"]
tcftdata => $tcftdata,   ##-- tcf-decoded serial txt data
tcfwdata => $tcfwdata,   ##-- tcf-decoded token data, tt-format: "TEXT\tSID/WID\n"
tcfwfile => $tcfwfile,   ##-- tcf-decoded token file, tt-format [default="$tmpdir/$outbase.tcfw"]
tcfadata => $tcfadata,   ##-- tcf-decoded token attributes for idsplice, data
tcfafile => $tcfafile,   ##-- tcf-decoded token attributes for idsplice, file [default="$tmpdir/$outbase.tcfa"]
##
##-- tcfalign data (PROXIED, see DTA::TokWrap::Processor::tcfalign : uses tokdata1,tokfile1)
##-- tcf2txml data (PROXIED, see DTA::TokWrap::Processor::tok2xml : uses tokfile1,cxfile,bxfile,xtokdata)
##-- tcfdecode data
tcfcwsfile => $tcfcwsfile, ##-- tcf-decoded+aligned+ws-spliced output file (default="$outdir/$outbase.tcfws.xml")
defaults
%defaults = CLASS->defaults();

Static object defaults.

init
$doc = $doc->init();

Set computed object defaults.

DESTROY
$doc->DESTROY();

Destructor. Implicitly calls close().

Methods: Pseudo-I/O

open
$newdoc = $CLASS_OR_OBJECT->open($xmlfile,%docNewOptions);

Wrapper for $CLASS_OR_OBJECT->new(), with some additional sanity checks.

close
$bool = $doc->close();
$bool = $doc->close($is_destructor);

"Closes" document $doc, adding profiling information to $doc->{tw} if present.

Unlinks any temporary files in $doc unless $doc->{keeptmp} is true. All %$doc keys ending in 'file' are considered 'temporary' files, except: xmlfile, xtokfile, sosfile, sowfile, soafile

If $is_destructor is false (default), resets all keys in %$doc to default values (thus making $doc essentially unuseable).

notempkeys
@notempkeys = $doc->notempkeys();

Returns list of document keys ending 'file' which are not considered "temporary" Used by $doc->tempfiles().

tempfiles
@tempfiles = $doc->tempfiles();

Returns list of temporary filenames which have been generated by $doc, or an empty list if $doc->{keeptmp} is true. Used by $doc->close().

Checks $doc->{"${filekey}_stamp"} to determine whether this document generated the file named by $doc->{"$filekey"}.

Implementation: returns values of all %$doc keys ending with 'file' except for those returned by $doc->notempkeys()

Methods: pseudo-pseudo-make

%KEYGEN
%KEYGEN = ($dataKey => $generatorSpec, ...)

Low-level hash mapping data keys to the generating processes (subroutines, classes, ...).

$generatorSpec is one of:

$key      : calls $doc->can($key)->($doc)
\&coderef : calls &coderef($doc)
\@array   : array of atomic $generatorSpecs (keys or CODE-refs)
genKey
$bool = $doc->genKey($key);
$bool = $doc->genKey($key,\%KEYGEN)

(Re-)generate a data key (single step only, ignoring dependencies). An argument $key without a value $KEYGEN{$key} triggers an error.

makeKey
$keyval_or_undef = $doc->makeKey($key);

Just an alias for $doc->genKey($key) here, but see DTA::TokWrap::Document::Maker for a more sophisticated implementation

Methods: Low-Level: generator-subclass wrappers

mkindex
$doc_or_undef = $doc->mkindex($mkindex);
$doc_or_undef = $doc->mkindex();

see DTA::TokWrap::Processor::mkindex::mkindex().

mkbx0
$doc_or_undef = $doc->mkbx0($mkbx0);
$doc_or_undef = $doc->mkbx0();

see DTA::TokWrap::Processor::mkbx0::mkbx0()

mkbx
$doc_or_undef = $doc->mkbx($mkbx);
$doc_or_undef = $doc->mkbx();

see DTA::TokWrap::Processor::mkbx::mkbx().

tokenize
$doc_or_undef = $doc->tokenize($tokenize);
$doc_or_undef = $doc->tokenize();

see DTA::TokWrap::Processor::tokenize::tokenize(), DTA::TokWrap::Processor::tokenize::http::tokenize(), DTA::TokWrap::Processor::tokenize::tomasotath::tokenize(), DTA::TokWrap::Processor::tokenize::dummy::tokenize().

Default tokenizer subclass is given by package-global $TOKENIZE_CLASS.

tokenize1
$doc_or_undef = $doc->tokenize1($tokenize1);
$doc_or_undef = $doc->tokenize1();

see DTA::TokWrap::Processor::tokenize1::tokenize1().

tok2xml
$doc_or_undef = $doc->tok2xml($tok2xml);
$doc_or_undef = $doc->tok2xml();

see DTA::TokWrap::Processor::tok2xml::tok2xml().

txmlanno
$doc_or_undef = $doc->txmlanno($txmlanno);
$doc_or_undef = $doc->txmlanno();

see DTA::TokWrap::Processor::txmlanno::txmlanno().

addws
$doc_or_undef = $doc->addws($addws);
$doc_or_undef = $doc->addws();

see DTA::TokWrap::Processor::addws::addws().

idsplice
$doc_or_undef = $doc->idsplice($addws);
$doc_or_undef = $doc->idsplice();

see DTA::TokWrap::Processor::idsplice::idsplice().

tcfencode
$doc_or_undef = $doc->tcfencode($tcfencode)
$doc_or_undef = $doc->tcfencode()

see DTA::TokWrap::Processor::tcfencode::tcfencode().

Methods: Member I/O

loadBx0File
$bx0doc_or_undef = $doc->loadBx0File($filename_or_fh);
$bx0doc_or_undef = $doc->loadBx0File();

loads $doc->{bx0doc} from $filename_or_fh (default=$doc->{bx0file})

loadBxFile
$cxdata_or_undef = $doc->loadBxFile($bxfile_or_fh,$txtfile_or_fh);
$cxdata_or_undef = $doc->loadBxFile();

loads $doc->{bxdata} from @$doc{qw(bxfile txtfile)}

requires $doc->{txfile}

loadCxFile
$cxdata_or_undef = $doc->loadCxFile($filename_or_fh);
$cxdata_or_undef = $doc->loadCxFile();

loads $doc->{cxdata} from $filename_or_fh (default=$doc->{cxfile}).

$doc->{cxdata} = [ $cx0, ... ], where:

  • each $cx = [ $id, $xoff,$xlen, $toff,$tlen, $text, @attrs ]

  • package globals $CX_ID, $CX_XOFF, etc. are indices for $cx arrays

loadTokFileN
\$tokdata_or_undef = $doc->loadTokFileN($n,$filename_or_fh);
\$tokdata_or_undef = $doc->loadTokFileN($n);

loads $doc->{"tokdata${n}"} from $filename_or_fh (default=$doc->{"tokfile${n}"})

loadTokFile0
\$tokdata0_or_undef = $doc->loadTokFile0(@args)

Wrapper for $doc->loadTokFileN(0,@args)

loadTokFile1
\$tokdata1_or_undef = $doc->loadTokFile1(@args)

Wrapper for $doc->loadTokFileN(1,@args)

loadXtokFile
\$xtokdata_or_undef = $doc->loadXtokFile($filename_or_fh);
\$xtokdata_or_undef = $doc->loadXtokFile();

loads $doc->{xtokdata} from $filename_or_fh (default=$doc->{xtokfile})

see also $doc->xtokDoc().

xtokDoc
$xtokDoc = $doc->xtokDoc(\$xtokdata);
$xtokDoc = $doc->xtokDoc();

parse \$xtokdata (default: \$doc->{xtokdata}) string into $doc->{xtokdoc}

warning: may call $doc->tok2xml()

loadXmlData
$xmlbuf_or_undef = $doc-E<gt>loadXmlData($filename_or_fh)
$xmlbuf_or_undef = $doc-E<gt>loadXmlData()

loads $doc->{xmldata} from $filename_or_fh (default=$doc->{xmlfile}).

loadCwsData
\$xmlbuf_or_undef = $doc->loadCwsData($filename_or_fh)
\$xmlbuf_or_undef = $doc->LoadCwsData()

DEPRECATED

loads $doc->{cwsdata} from $filename_or_fh (default=$doc->{cwsfile}).

loadTxtData
\$txtbuf_or_undef = $doc->loadTxtData($filename_or_fh)
\$txtbuf_or_undef = $doc->loadTxtData()

loads $doc->{txtdata} from $filename_or_fh (default=$doc->{txtfile})

saveBx0File
$file_or_undef = $doc->saveBx0File($filename_or_fh,$bx0doc,%opts);
$file_or_undef = $doc->saveBx0File($filename_or_fh);
$file_or_undef = $doc->saveBx0File();

Saves $bx0doc (default=$doc->{bx0doc}) to $filename_or_fh (default=$doc>{bx0file}="$doc->{outdir}/$doc->{outbase}.bx0"), and sets both $doc>{bx0file} and $doc->{bx0file_stamp}.

%opts:

format => $level,  ##-- output format (default=$doc-E<gt>{format})
saveBxFile
$file_or_undef = $doc->saveBxFile($filename_or_fh,\@blocks);
$file_or_undef = $doc->saveBxFile($filename_or_fh);
$file_or_undef = $doc->saveBxFile();

Saves text-block data \@blocks (default=$doc->{bxdata}) to $filename_of_fh (default=$doc->{bxfile}), and sets both $doc->{bxfile} and $doc->{bxfile_stamp}.

saveTxtFile
$file_or_undef = $doc->saveTxtFile($filename_or_fh,\@blocks,%opts);
$file_or_undef = $doc->saveTxtFile($filename_or_fh);
$file_or_undef = $doc->saveTxtFile();

Saves serialized text extracted from \@blocks (default=$doc->{bxdata}) to $filename_or_fh (default=$doc->{txtfile}="$doc->{outdir}/$doc->{outbase}.txt"), and sets both $doc->{txtfile} and $doc->{txtfile_stamp}.

%opts:

debug=>$bool,  ##-- if true, debugging text will be printed (and saveBxFile() offsets will be wrong)
saveTokFileN
$file_or_undef = $doc->saveTokFileN($n,$filename_or_fh,\$tokdata);
$file_or_undef = $doc->saveTokFileN($n,$filename_or_fh);
$file_or_undef = $doc->saveTokFileN($n);

Saves tokenizer output data string $tokdata (default=$doc->{"tokdata${n}"}) to $filename_or_fh (default=$doc->{"tokfile${n}"}="$doc->{outdir}/$doc->{outbase}.t${n}"), and sets both $doc->{"tokfile${n}"} and $doc->{"tokfile_stamp${n}"}.

saveTokFile0
$file_or_undef = $doc->saveTokFile0(@args)

Wrapper for $doc->saveTokFileN(0,@args)

saveTokFile1
$file_or_undef = $doc->saveTokFile1(@args)

Wrapper for $doc->saveTokFileN(1,@args)

saveXtokFile
$file_or_undef = $doc->saveXtokFile($filename_or_fh,\$xtokdata,%opts);
$file_or_undef = $doc->saveXtokFile($filename_or_fh);
$file_or_undef = $doc->saveXtokFile();

Saves XML-ified master tokenizer data string $xtokdata (default=$doc->{xtokdata}) to $filename_or_fh (default=$doc->{xtokfile}="$doc->{outdir}/$doc->{outbase}.t.xml"), and sets both $doc->{xtokfile} and $doc->{xtokfile_stamp}.

saveTcfFile
$file_or_undef = $doc->saveTcfFile($filename_or_fh,$tcfdoc,%opts)
$file_or_undef = $doc->saveTcfFile($filename_or_fh)
$file_or_undef = $doc->saveTcfFile()

known %opts:

format => $level, ##-- formatting level (default=1)

Saves TCF-encoded document $tcfdoc (default=$doc->{tcfdoc}) to $filename_or_fh (default=$doc->{tcffile}="$doc->{outdir}/$doc->{outbase}.t.xml"), and sets $doc->{tcffile_stamp}.

Methods: Profiling

nTokens
$ntoks_or_undef = $doc->nTokens();

Returns number of tokens in the currently opened document, if known.

nXmlBytes
$nxbytes_or_undef = $doc->nXmlBytes();

Returns the number of bytes in the base-format XML file, if known (and it should always be known!).

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2018 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.