NAME
DTA::CAB - "Cascaded Analysis Broker" for robust linguistic analysis
SYNOPSIS
use DTA::CAB;
DESCRIPTION
The DTA::CAB suite provides an object-oriented API for error-tolerant linguistic analysis of tokenized text. The DTA::CAB package itself just loads the common API from DTA::CAB::Common and attempts to load the common analysis modules from DTA::CAB::Analyzer::Common if present.
Earlier versions of the DTA::CAB suite used the DTA::CAB package to represent a default analyzer class. The corresponding class now lives in DTA::CAB::Chain::DTA.
Package Constants
- $VERSION
-
Module version, imported from DTA::CAB::Version.
- $SVNVERSION
-
SVN version from which this module was built, imported from DTA::CAB::Version.
Data Model
DTA::CAB is designed for processing natural language data which are represented internally by objects descended from the class DTA::CAB::Datum. Currently, the DTA::CAB data model explicitly supports the following datum classes:
- DTA::CAB::Token
-
Represents a single word token as a HASH-ref with at least a 'text' key, whose value should be a string representing the literal word text. Additional keys may be defined by IO formats and/or analyzers.
- DTA::CAB::Sentence
-
Represents a single sentence as a HASH-ref with at least a 'tokens' key, whose value should be an ARRAY-ref of DTA::CAB::Token structures. Additional keys may be defined by IO formats and/or analyzers.
- DTA::CAB::Document
-
Represents a text document as a HASH-ref with at least a 'body' key, whose value should be an ARRAY-ref of DTA::CAB::Sentence structures. Additional keys may be defined by IO formats and/or analyzers.
See the subclass documentation for details.
I/O Formats
DTA::CAB supports a number of different I/O formats for document data, including "CSV", "JSON", "Raw", "Text", "TT", "YAML", and "XML". See DTA::CAB::Format for details on the I/O format API, and see "SUBCLASSES" in DTA::CAB::Format for a list of currently implemented format subclasses.
The command-line utility dta-cab-convert.perl(1) is provided for converting between supported I/O formats.
Processing Model
Input documents are processed by one or more DTA::CAB::Analyzer objects, each of which may insert, modify, and/or remove arbitrary properties of the analyzed data, e.g. a morphological analyzer (DTA::CAB::Analyzer::Morph) might insert a token property 'morph' which could be read in turn by a part-of-speech tagger (DTA::CAB::Analyzer::Moot).
See DTA::CAB::Analyzer for a specification of the basic analysis API, see DTA::CAB::Analyzer::Common for some common analyzers, see DTA::CAB::Chain and/or DTA::CAB::Chain::Multi for abstract encapsulations of serial analysis "pipelines", and see DTA::CAB::Chain::DTA for the analysis chains used in the Deutsches Textarchiv project.
dta-cab-analyze.perl(1) is a command-line utility for invoking a local persistent analyzer on a document in some supported format.
Server/Client Architectures
The DTA::CAB suite implements two different server/client architectures in order to facilitate shared use of common processing pipelines, as well as to avoid extraneous overhead for analyzers which require excessive initialization times. DTA::CAB::Server and DTA::CAB::Client define the abstract server/client API.
XML-RPC Server/Client Protocol
DEPRECATED in favor of raw HTTP.
DTA::CAB::Server::XmlRpc implements a simple XML-RPC HTTP server which can be used to handle analysis requests for one of a user-specified set of DTA::CAB::Analyzer objects formulated as XML-RPC procedure calls. DTA::CAB::Client::XmlRpc provides a wrapper class for querying such a server. See DTA::CAB::XmlRpcProtocol for an brief overview of the procedures available and an XML-RPCish rehash of the DTA::CAB data model.
The command-line scripts dta-cab-xmlrpc-server.perl(1) and dta-cab-xmlrpc-client.perl(1) implement the (deprecated) XML-RPC server/client protocol.
HTTP Server/Client Protocol
DTA::CAB::Server::HTTP implements a simple HTTP server which can be used to handle analysis requests for one of a user-specified set of DTA::CAB::Analyzer objects. The analysis requests themselves are handled by the DTA::CAB::Server::HTTP::Handler::Query handler class, which interprets incoming GET and/or POST requests as conventional HTTP form data, invokes the specified analyzer on the query document, and returns a formatted document in the HTTP response. DTA::CAB::Client::HTTP provides a wrapper class for querying such a server. Additionally, both HTTP servers and clients support a backwards-compatible XML-RPC mode.
The command-line scripts dta-cab-http-server.perl(1) and dta-cab-http-client.perl(1) implement the HTTP server/client protocol.
CLARIN-D WebLicht Protocol
A running DTA::CAB::Server::HTTP server can be used directly as a CLARIN-D WebLicht web-service by using the "tcf" or "tcf-orth" formats. The "CAB historical text analysis" and "CAB orthographic canonicalizer" WebLicht chain components are implemented in this fashion; see http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/ for details.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2008-2019 by Bryan Jurish
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.