The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

DTA::CAB::Format::TCF - Datum parser|formatter: CLARIN-D TCF (selected features only)

SYNOPSIS

##========================================================================
## PRELIMINARIES

use DTA::CAB::Format::TCF;

##========================================================================
## Constructors etc.

$fmt = CLASS_OR_OBJ->new(%args);

##========================================================================
## Methods: Input: Generic API

$doc = $fmt->parseDocument();

##========================================================================
## Methods: Output: MIME & HTTP stuff

$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();

##========================================================================
## Methods: Output: output selection

$fmt = $fmt->flush();

##========================================================================
## Methods: Output: Generic API

$fmt = $fmt->putDocument($doc);

DESCRIPTION

Globals

Variable: @ISA

DTA::CAB::Format::TCF inherits from DTA::CAB::Format::XmlCommon.

Constructors etc.

new
$fmt = CLASS_OR_OBJ->new(%args);

object structure: HASH ref

{
 ##-- new in TCF
 tcfbufr => \$buf,                       ##-- raw TCF buffer, for spliceback mode
 textbufr => \$text,                     ##-- raw text buffer, for spliceback mode
 tcflog  => $level,		       ##-- debugging log-level (default: 'off')
 spliceback => $bool,                    ##-- (output) if true (default), splice data back into 'tcfbufr' if available; otherwise create new TCF doc
 tcflayers => $tcf_layer_names,          ##-- layer names to include, space-separated list; known='tei text tokens sentences postags lemmas orthography'
 tcftagset => $tagset,                   ##-- tagset name for POStags element (default='stts')
 logsplice => $level,		       ##-- log level for spliceback messages (default:'none')
 trimtext => $bool,                      ##-- if true (default), waste tokenizer hints will be trimmed from 'text' layer
 ##-- input: inherited from XmlCommon
 xdoc => $xdoc,                          ##-- XML::LibXML::Document
 xprs => $xprs,                          ##-- XML::LibXML parser
 ##-- output: inherited from XmlCommon
 level => $level,                        ##-- output formatting level (OVERRIDE: default=1)
 output => [$how,$arg]                   ##-- either ['fh',$fh], ['file',$filename], or ['str',\$buf]
}

Methods: Input: Generic API

parseDocument
$doc = $fmt->parseDocument();

parse buffered XML::LibXML::Document from $fmt->{xdoc}

Methods: Output: MIME & HTTP stuff

shortName
$short = $fmt->shortName();

returns "official" short name for this format; override returns "tcf".

mimeType
$type = $fmt->mimeType();

override returns text/xml

defaultExtension
$ext = $fmt->defaultExtension();

returns default filename extension for this format; override returns ".tcf.xml".

Methods: Output: output selection

flush
$fmt = $fmt->flush();

flush any buffered output to selected output source

Methods: Output: Generic API

putDocument
$fmt = $fmt->putDocument($doc);

override respects local 'spliceback' and 'tcflayers' flags

EXAMPLE

An example file in the format accepted/generated by this module is:

<?xml version="1.0" encoding="UTF-8"?>
<D-Spin xmlns="http://www.dspin.de/data" version="0.4">
 <MetaData xmlns="http://www.dspin.de/data/metadata"/>
 <TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de">
   <text>wie oede!</text>
   <tokens>
     <token ID="w1">wie</token>
     <token ID="w2">oede</token>
     <token ID="w3">!</token>
   </tokens>
   <sentences>
     <sentence ID="s1" tokenIDs="w1 w2 w3"/>
   </sentences>
   <lemmas>
     <lemma tokenIDs="w1">wie</lemma>
     <lemma tokenIDs="w2">öde</lemma>
     <lemma tokenIDs="w3">!</lemma>
   </lemmas>
   <POStags tagset="stts">
     <tag tokenIDs="w1">PWAV</tag>
     <tag tokenIDs="w2">ADJD</tag>
     <tag tokenIDs="w3">$.</tag>
   </POStags>
   <orthography>
     <correction tokenIDs="w2" operation="replace">öde</correction>
   </orthography>
 </TextCorpus>
</D-Spin>

If the input contains a 'text' layer but no 'tokens' or 'sentences' layers, the 'text' layer will be tokenized using the DTA::CAB::Format::Raw class.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2019 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dta-cab-analyze.perl(1), dta-cab-convert.perl(1), dta-cab-http-server.perl(1), dta-cab-http-client.perl(1), dta-cab-xmlrpc-server.perl(1), dta-cab-xmlrpc-client.perl(1), DTA::CAB::Server(3pm), DTA::CAB::Client(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...