NAME
DTA::CAB::Analyzer - generic analyzer API
SYNOPSIS
use DTA::CAB::Analyzer;
##========================================================================
## Constructors etc.
$obj = $CLASS_OR_OBJ->new(%args);
undef = $anl->initialize();
undef = $anl->dropClosures();
$label = $anl->defaultLabel();
$class = $anl->analysisClass();
@keys = $anl->typeKeys(\%opts);
##========================================================================
## Methods: I/O
$bool = $anl->ensureLoaded();
$bool = $anl->prepare();
##========================================================================
## Methods: Persistence: Perl
@keys = $class_or_obj->noSaveKeys();
$loadedObj = $CLASS_OR_OBJ->loadPerlRef($ref);
@keys = $class_or_obj->noSaveBinKeys();
$loadedObj = $CLASS_OR_OBJ->loadBinRef($ref);
##========================================================================
## Methods: Analysis: Utils
$bool = $anl->canAnalyze();
$bool = $anl->doAnalyze(\%opts, $name);
$bool = $anl->enabled(\%opts);
$bool = $anl->autoEnable();
undef = $anl->initInfo();
\@analyzers = $anl->subAnalyzers();
##========================================================================
## Methods: Analysis: API
$doc = $anl->analyzeDocument($doc,\%opts);
$doc = $anl->analyzeTypes($doc,\%types,\%opts);
$doc = $anl->analyzeTokens($doc,\%opts);
$doc = $anl->analyzeSentences($doc,\%opts);
$doc = $anl->analyzeLocal($doc,\%opts);
$doc = $anl->analyzeClean($doc,\%opts);
##========================================================================
## Methods: Analysis: Type-wise
\%types = $anl->getTypes($doc);
$doc = $anl->expandTypes($doc,\%types,\%opts);
$doc = $anl->clearTypes($doc);
##========================================================================
## Methods: Analysis: Wrappers
$tok = $anl->analyzeToken($tok_or_string,\%opts);
$tok = $anl->analyzeSentence($sent_or_array,\%opts);
$rpc_xml_base64 = $anl->analyzeData($data_str,\%opts);
##========================================================================
## Methods: Analysis: Closure Utilities
\&closure = $anl->analyzeClosure($which);
\&closure = $anl->getAnalyzeClosure($which);
$closure = $anl->accessClosure( $methodName);
PACKAGE::_am_xlit($tokvar);
PACKAGE::_am_lts($tokvar);
PACKAGE::_am_tt_list($ttvar);
PACKAGE::_am_tt_fst($ttvar);
PACKAGE::_am_id_fst($tokvar, $wvar);
PACKAGE::_am_tt_fst_list($ttvar);
PACKAGE::_am_fst_sort($listvar);
PACKAGE::_am_fst_clean($hashvar);
##========================================================================
## Methods: XML-RPC
\%opts = $anl->mergeOptions(\%defaultOptions,\%userOptions);
@procedures = $anl->xmlRpcMethods();
DESCRIPTION
DTA::CAB::Analyzer is an abstract class and API specification for representing arbitrary semi-independent document analysis algorithms. Each analyzer sub-class should define at least one of the analyzeXYZ() methods (analyzeTypes(), analyzeTokens(), etc.), and each analyzer instance should set a 'name' key. Analyzer objects are assumed to be HASH refs, and should define at least a 'label' key to identify the analyzer object e.g. in a multi-analyzer processing chain.
DTA::CAB::Analyzer inherits from DTA::CAB::Persistent (and thus indirectly from DTA::CAB::Logger), and provides some basic hooks for extending the DTA::CAB::Persistent functionality. These routines are especially useful e.g. for defining analyzer parameters in a configuration file which can be passed to the dta-cab-analyze.perl comman-line script via the "-config" option.
See DTA::CAB::Analyzer::Common for a list of common analyzer sub-classes.
See DTA::CAB::Chain for an abstract analyzer class representing simple linear analysis chains (aka "pipelines"), and see DTA::CAB::Chain::Multi for an abstract analyzer class representing a set of named analysis pipelines. Since analysis chains are themselves implemented as subclasses of DTA::CAB::Analyzer, analysis chains may be nested to arbitrary depth (at least in theory).
Constructors etc.
- new
-
$obj = CLASS_OR_OBJ->new(%args);
%$obj, %args:
label => $label, ##-- analyzer label (default: from class name) aclass => $class, ##-- analysis class (optional; see $anl->analysisClass() method; default=undef) typeKeys => \@keys, ##-- analyzer type keys for $anl->typeKeys() enabled => $bool, ##-- set to false, non-undef value to disable this analyzer initQuiet => $bool, ##-- if true, initInfo() will not print any output
- initialize
-
undef = $anl->initialize();
Initialize the analyzer. Default implementation does nothing
- dropClosures
-
undef = $anl->dropClosures();
OBSOLETE: drops '_analyze*' closures. This method is a relic of an obsolete API, and should go away. The method name is still used with (basically) its original semantics by the (unmaintained) subclass DTA::CAB::Analyzer::Dyn.
Currently does nothing.
- defaultLabel
-
$label = $anl->defaultLabel();
Returns default label for this class. Default implementation returns the final segment of the Perl class-name.
- analysisClass
-
$class = $anl->analysisClass();
DEPRECATED: Gets cached $anl->{aclass} if exists, otherwise returns undef. Really just an ugly wrapper for $anl->{aclass}.
This method is an (unused) relic of an abandoned attempt to force all analysis outputs to be bless()ed Perl objects. Try to avoid it.
- typeKeys
-
@keys = $anl->typeKeys(\%opts);
Returns list of type-wise keys to be expanded for this analyzer by expandTypes(). Default returns @{$anl->{typeKeys}} if defined, otherwise ($anl->{label}).
The default is really annoying and potentially dangerous if you're not writing a type-wise analyzer, but most of the current analyzers do operate type-wise, so it was convenient. Override if necessary.
Methods: I/O
- ensureLoaded
-
$bool = $anl->ensureLoaded(); $bool = $anl->ensureLoaded(\%opts);
Ensures analysis data is loaded from default files, or that no data is available to be loaded. Should return false only if user has requested data to be loaded and that data cannot be loaded. "Empty" analyzers should return true here.
Default implementation always returns true.
This method is poorly named, and almost entirely useless, since some analyzers require it to be called very early, before other potentially relevant options have been evaluated. Returning false here may cause a host application (e.g. dta-cab-analyze.perl) to die(). Such behavior may not be desirable however if no analysis source data (e.g. dictionary files) was found (perhaps because it was undefined); see the canAnalyze() and autoDisable() methods for workarounds.
- prepare
-
$bool = $anl->prepare(); $bool = $anl->prepare(\%opts)
Wrapper for ensureLoaded(), autoEnable(), initInfo(). Should probably replace top-level calls to ensureLoaded() in host applications.
Methods: Persistence
- noSaveKeys
-
@keys = $class_or_obj->noSaveKeys();
Returns list of keys not to be saved. Default implementation just greps for CODE-refs.
- loadPerlRef
-
$loadedObj = $CLASS_OR_OBJ->loadPerlRef($ref);
Default implementation just clobbers $CLASS_OR_OBJ with $ref and blesses.
- noSaveBinKeys
-
@keys = $class_or_obj->noSaveBinKeys();
Returns list of keys not to be saved for binary mode Default just greps for CODE-refs.
- loadBinRef
-
$loadedObj = $CLASS_OR_OBJ->loadBinRef($ref);
Implicitly calls $OBJ->dropClosures().
Methods: Analysis: Utils
- canAnalyze
-
$bool = $anl->canAnalyze(); $bool = $anl->canAnalyze(\%opts);
Returns true iff analyzer can perform its function (e.g. data is loaded & non-empty). Default implementation always returns true.
- doAnalyze
-
$bool = $anl->doAnalyze(\%opts, $name);
Alias for $anl->can("analyze${name}") && (!exists($opts{"doAnalyze${name}"}) || $opts{"doAnalyze${name}"}).
- enabled
-
$bool = $anl->enabled(\%opts);
Returns true if analyzer SHOULD operate, acording to %opts. Default returns:
(!defined($anl->{enabled}) || $anl->{enabled}) ##-- globally enabled && (!$opts || !defined($opts{"${lab}_enabled"} || $opts{"${lab}_enabled"}) ##-- ... and locally enabled
- autoEnable
-
$bool = $anl->autoEnable(); $bool = $anl->autoEnable(\%opts);
Sets $anl->{enabled} flag if not already defined. Calls $anl->canAnalyze(\%opts). Returns new value of $anl->{enabled}. Implicitly calls autoEnable() on all sub-analyzers.
- autoDisable
-
Alias for autoEnable().
- initInfo
-
undef = $anl->initInfo();
Logs initialization info. Default method reports values of {label}, enabled(). Sets $anl->{initQuiet}=1 (don't report multiple times).
- subAnalyzers
-
\@analyzers = $anl->subAnalyzers(); \@analyzers = $anl->subAnalyzers(\%opts)
Returns a list of all sub-analyzers for this object. Default returns all DTA::CAB::Analyzer subclass instances in values(%$anl).
Methods: Analysis: API
- analyzeDocument
-
$doc = $anl->analyzeDocument($doc,\%opts);
Top-level API routine: analyze a DTA::CAB::Document $doc. Default implementation just calls:
$doc = toDocument($doc); if ($anl->doAnalyze('Types')) { $types = $anl->getTypes($doc); $anl->analyzeTypes($doc,$types,\%opts); $anl->expandTypes($doc,$types,\%opts); $anl->clearTypes($doc); } $anl->analyzeTokens($doc,\%opts) if ($anl->doAnalyze(\%opts,'Tokens')); $anl->analyzeSentences($doc,\%opts) if ($anl->doAnalyze(\%opts,'Sentences')); $anl->analyzeLocal($doc,\%opts) if ($anl->doAnalyze(\%opts,'Local')); $anl->analyzeClean($doc,\%opts) if ($anl->doAnalyze(\%opts,'Clean'));
- analyzeTypes
-
$doc = $anl->analyzeTypes($doc,\%types,\%opts);
Perform type-wise analysis of all (text) types in \%types (default is $doc->{types}). Default implementation does nothing.
- analyzeTokens
-
$doc = $anl->analyzeTokens($doc,\%opts);
Perform token-wise analysis of all tokens $doc->{body}[$si]{tokens}[$wi]. Default implementation does nothing.
- analyzeSentences
-
$doc = $anl->analyzeSentences($doc,\%opts);
Perform sentence-wise analysis of all sentences $doc->{body}[$si]. Default implementation does nothing.
- analyzeLocal
-
$doc = $anl->analyzeLocal($doc,\%opts);
Perform analyzer-local document-level analysis of $doc. Default implementation does nothing.
- analyzeClean
-
$doc = $anl->analyzeClean($doc,\%opts);
Cleanup any temporary data associated with $doc. Default implementation does nothing.
Methods: Analysis: Type-wise
- getTypes
-
\%types = $anl->getTypes($doc);
Returns a hash
\%types = ($typeText => $typeToken, ...)
mapping token text to basic token objects (with only 'text' key defined). Default implementation just calls $doc->types().
- expandTypes
-
$doc = $anl->expandTypes($doc,\%types,\%opts);
Expands \%types into $doc->{body} tokens. Default implementation just calls $doc->expandTypeKeys(\@typeKeys,\%types), where \@typeKeys is derived from $anl->typeKeys().
- clearTypes
-
$doc = $anl->clearTypes($doc);
Clears cached type->object map in $doc->{types}. Default just calls $doc->clearTypes().
Methods: Analysis: Wrappers
- analyzeToken
-
$tok = $anl->analyzeToken($tok_or_string,\%opts);
Compatibility wrapper: perform type- and token- analyses on $tok_or_string. Really just a wrapper for $anl->analyzeDocument().
- analyzeSentence
-
$tok = $anl->analyzeSentence($sent_or_array,\%opts);
Compatibility wrapper: perform type- and token-, and sentence- analyses on $sent_or_array. Really just a wrapper for $anl->analyzeDocument().
- analyzeData
-
$rpc_xml_base64 = $anl->analyzeData($data_str,\%opts);
Analyze a raw (formatted) data string $data_str with internal parsing & formatting. Really just a wrapper for $anl->analyzeDocument().
Methods: Analysis: Closure Utilities (optional)
- analyzeClosure
-
\&closure = $anl->analyzeClosure($which);
Optional utility for closure-based analysis. Returns cached $anl->{"_analyze${which}"} if present; otherwise calls $anl->getAnalyzeClosure($which) & caches result.
- getAnalyzeClosure
-
\&closure = $anl->getAnalyzeClosure($which);
Returns closure \&closure for analyzing data of type "$which" (e.g. Word, Type, Token, Sentence, Document, ...). Default implementation calls $anl->getAnalyze"${which}"Closure() if available, otherwise croak()s.
- accessClosure
-
$closure = $anl->accessClosure(\&codeRef, %opts); $closure = $anl->accessClosure( $methodName, %opts); $closure = $anl->accessClosure( $codeString, %opts);
Returns accessor-closure $closure for $anl. Passed argument can be one of the following:
- $codeRef
-
a CODE ref resolves to itself
- $methodName
-
a method name resolves to $anl->can($methodName)
- $codeString
-
any other string resolves to 'sub { $codeString }'; which may reference the closure variable $anl
Additional options for $codeString pseudo-accessors can be passed in %opts:
pre => $prefix, ##-- compiles as "${prefix}; sub {$code}" vars => \@vars, ##-- compiles as 'my ('.join(',',@vars).'); '."sub {$code}"
Methods: Analysis: Closure Utilities: Macros
In order to facilitate development of analyzer-local accessor code in string form, the following "macros" are defined as exportable functions. Their arguments and return values are strings suitable for inclusion in acccessor macros. These macros are exported by the tags ':access', ':child', and ':all'.
- _am_xlit
-
PACKAGE::_am_xlit($tokvar='$_');
access-closure macro: get xlit or text for token $$tokvar; evaluates to a string: ($$tokvar->{xlit} ? $$tokvar->{xlit}{latin1Text} : $$tokvar->{text})
- _am_lts
-
PACKAGE::_am_lts($tokvar='$_');
access-closure macro for first LTS analysis of token $$tokvar; evaluates to string: ($$tokvar->{lts} && @{$$tokvar->{lts}} ? $$tokvar->{lts}[0]{hi} : $$tokvar->{text})
- _am_tt_list
-
PACKAGE::_am_tt_list($ttvar='$_');
access-closure macro for a TT-style list of strings $$ttvar; evaluates to a list: split(/\\t/,$$ttvar)
- _am_tt_fst
-
PACKAGE::_am_tt_fst($ttvar='$_');
(formerly mutliply defined in sub-packages as SUBPACKAGE::parseFstString())
access-closure macro for a single TT-style FST analysis $$ttvar; evaluates to a FST-analysis hash {hi=>$hi,w=>$w,lo=>$lo,lemma=>$lemma}:
( $$ttvar =~ /^(?:(.*?) \: )?(?:(.*?) \@ )?(.*?)(?: \<([\d\.\+\-eE]+)\>)?$/ ? {(defined($1) ? (lo=>$1) : qw()), (defined($2) ? (lemma=>$2) : qw()), hi=>$3, w=>($4||0)} : {hi=>$$ttvar} )
- _am_id_fst
-
PACKAGE::_am_id_fst($tokvar='$_', $wvar='0');
access-closure macro for a identity FST analysis; evaluates to a single fst analysis hash: {hi=>_am_xlit($tokvar), w=>$$wvar}
- _am_tt_fst_list
-
PACKAGE::_am_tt_fst_list($ttvar='$_');
access-closure macro for a list of TT-style FST analyses $$ttvar; evaluates to a list of fst analysis hashes: (map {_am_tt_fst('$_')} split(/\t/,$$ttvar))
- _am_tt_fst_eqlist
-
PACKAGE::_am_tt_fst_eqlist($ttvar='$tt', $tokvar='$_', $wvar='0');
access-closure macro for a list of TT-style FST analyses $$ttvar; evaluates to a list of fst analysis hashes: (_am_id_fst($tokvar,$wvar), _am_tt_fst_list($ttvar))
- _am_fst_sort
-
PACKAGE::_am_fst_sort($listvar='@_');
access-closure macro to sort a list of FST analyses $$listvar by weight; evaluates to a sorted list of fst analysis hashes: (sort {($a->{w}||0) <=> ($b->{w}||0) || ($a->{hi}||"") cmp ($b->{hi}||"")} $$listvar)
- _am_fst_clean
-
PACKAGE::_am_fst_clean($hashvar='$_->{$lab}');
access-closure macro to delete undefined hash entries; evaluates to: delete($$hashvar) if (!defined($$hashvar));
Methods: XML-RPC
- mergeOptions
-
\%opts = $anl->mergeOptions(\%defaultOptions,\%userOptions);
Returns options hash like (%defaultOptions,%userOptions) [user clobbers default].
- xmlRpcMethods
-
@procedures = $anl->xmlRpcMethods(); @procedures = $anl->xmlRpcMethods($prefix,\%opts);
returns a list of procedures suitable for passing to RPC::XML::Server::add_proc()
additional keys recognized in procedure specs: see DTA::CAB::Server::XmlRpc::prepareLocal()
"${prefix}." is appended to procedure 'name' key if $prefix is specified
\%opts are passed to analyze methods if defined
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2009-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
dta-cab-analyze.perl(1), DTA::CAB::Analyzer::Common(3pm), DTA::CAB::Chain(3pm), DTA::CAB::Chain::Multi(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...