NAME

DiaColloDB::Relation::TDF - collocation db, profiling relation: (term x document) raw-frequency matrix

SYNOPSIS

##========================================================================
## PRELIMINARIES

use DiaColloDB::Relation::TDF;

##========================================================================
## Constructors etc.

$rel = CLASS_OR_OBJECT->new(%args);

##========================================================================
## TDF API: Utils

$vtype = $rel->vtype();
$itype = $rel->itype();
$packas = $rel->vpack();
$packas = $rel->ipack();

##========================================================================
## Persistent API: disk usage

@files = $rel->diskFiles();

##========================================================================
## Persistent API: header

@keys = $rel->headerKeys();
$hdr = $rel->headerData();

##========================================================================
## Relation API: open/close

$rel_or_undef = $rel->open($base);
$rel_or_undef = $rel->close();
$bool = $rel->opened();

##========================================================================
## Relation API: creation

$rel = $CLASS_OR_OBJECT->create($coldb,$tokdat_file,%opts);
$rel = CLASS_OR_OBJECT->union($coldb, \@dbargs, %opts);

##========================================================================
## Relation API: info

\%info = $rel->dbinfo($coldb);

##========================================================================
## Relation API: profiling

$mprf   = $rel->profile($coldb, %opts);
$mprf   = $rel->extend($coldb, %opts);
$mpdiff = $rel->compare($coldb, %opts);

##========================================================================
## Profile: Utils: PDL-based profiling

$mprf = $rel->vprofile($coldb, \%opts);

##========================================================================
## Profile: Utils: domain sizes

$NT = $rel->nTerms();
$ND = $rel->nDocs();
$NC = $rel->nFiles();
$NA = $rel->nAttrs();
$NM = $rel->nMeta();

##========================================================================
## Profile: Utils: attribute positioning

\%tpos = $rel->tpos();
\%mpos = $rel->mpos();

##========================================================================
## Profile: Utils: query parsing & evaluation

$idPdl    = $rel->idpdl($idPdl);
$tupleIds = $rel->tupleIds($attrType, $attrName, $valIdsPdl);
$ti       = $rel->termIds($tattrName, $valIdsPDL);
$ci       = $rel->catIds($mattrName, $valIdsPDL);

$bool          = $rel->hasMeta($attr);
$enum_or_undef = $rel->metaEnum($mattr);

$cats          = $rel->catSubset($terms);

\%groupby      = $rel->groupby($coldb, $groupby_request, %opts);

##========================================================================
## Relation API: default: query info

\%qinfo = $rel->qinfo($coldb, %opts);

DESCRIPTION

DiaColloDB::Relation::TDF is a DiaColloDB::Relation subclass for document-level co-occurrence frequencies using PDL to efficiently store and query a sparse underlying (term x document) frequency matrix via the PDL::CCS package.

Supports Boolean expressions over both term- and document-level conditions (the latter via DDC #has[ATTRIBUTE,VALUE] or #has[ATTRIBUTE,/REGEX/] syntax) as well as grouping via literal indexed term- and/or document-level attributes.

An earlier version of this module was implemented as DiaColloDB::Relation::Vsem ("vector-space distributional semantic index").

Globals & Constants

Variable: @ISA

DiaColloDB::Relation::TDF inherits from DiaColloDB::Relation.

Constructors etc.

new
$rel = CLASS_OR_OBJECT->new(%args);

%args, object structure:

##-- user options
base   => $basename,   ##-- relation basename
flags  => $flags,      ##-- i/o flags (default: 'r')
mgood  => $regex,      ##-- positive filter regex for metadata attributes
mbad   => $regex,      ##-- negative filter regex for metadata attributes
submax => $submax,     ##-- choke on requested tdm cross-subsets if dense subset size ($NT_sub * $ND_sub) > $submax; default=2**29 (512M)
mquery => \%mquery,    ##-- qinfo templates for meta-fields (default: textClass hack for genre): ($mattr=>$TEMPLATE, ...)
##
##-- logging options
logvprofile => $level, ##-- log-level for vprofile() (default=undef:none)
logio => $level,       ##-- log-level for low-level I/O operations (default=undef:none)
##
##-- modelling options (formerly via DocClassify)
minFreq    => $fmin,   ##-- minimum total term-frequency for model inclusion (default=undef:use $coldb->{tfmin})
minDocFreq => $dfmin,  ##-- minimim "doc-frequency" (#/docs per term) for model inclusion (default=4)
minDocSize => $dnmin,  ##-- minimum doc size (#/tokens per doc) for model inclusion (default=4; formerly $coldb->{vbnmin})
maxDocSize => $dnmax,  ##-- maximum doc size (#/tokens per doc) for model inclusion (default=inf; formerly $coldb->{vbnmax})
vtype      => $vtype,  ##-- PDL::Type for storing compiled values (default=float; auto-promoted if required)
itype      => $itype,  ##-- PDL::Type for storing compiled integers (default=long)
##
##-- guts: aux: info
N => $tdm0Total,       ##-- total number of (doc,term) frequencies counted
dbreak => $dbreak,     ##-- inherited from $coldb on create()
##
##-- guts: aux: term-tuples ($NA:number of term-attributes, $NT:number of term-tuples)
attrs  => \@attrs,       ##-- known term attributes
tvals  => $tvals,        ##-- pdl($NA,$NT) : [$apos,$ti] => $avali_at_term_ti
tsorti => $tsorti,       ##-- pdl($NT,$NA) : [,($apos)]  => $tvals->slice("($apos),")->qsorti
tpos   => \%a2pos,       ##-- term-attribute positions: $apos=$a2pos{$aname}
##
##-- guts: aux: metadata ($NM:number of metas-attributes, $NC:number of cats (source files))
meta => \@mattrs         ##-- known metadata attributes
meta_e_${ATTR} => $enum, ##-- metadata-attribute enum
mvals => $mvals,         ##-- pdl($NM,$NC) : [$mpos,$ci] => $mvali_at_ci
msorti => $msorti,       ##-- pdl($NC,$NM) : [,($mpos)]  => $mvals->slice("($mpos),")->qsorti
mpos  => \%m2pos,        ##-- meta-attribute positions: $mpos=$m2pos{$mattr}
##
##-- guts: model (formerly via DocClassify dcmap=>$dcmap)
tdm => $tdm,             ##-- term-doc matrix : PDL::CCS::Nd ($NT,$ND): [$ti,$di] -> f($ti,$di)
tym => $tym,             ##-- term-year matrix: PDL::CCS::Nd ($NT,$NY): [$ti,$yi] -> f($ti,$yi)
cf  => $cf_pdl,          ##-- cat-freq pdl:     dense:       ($NC)    : [$ci]     -> f($ci)
c2date => $c2date,       ##-- cat-dates   : dense ($NC)   : [$ci]   -> $date
c2d    => $c2d,          ##-- cat->doc map: dense (2,$NC) : [*,$ci] -> [$di_off,$di_len]
d2c    => $d2c,          ##-- doc->cat map: dense ($ND)   : [$di]   -> $ci
#...

TDF API: Utils

vtype
$vtype = $rel->vtype();

get PDL::Type value type for storing compiled values.

itype
$itype = $rel->itype();

get PDL::Type integer type for storing compiled indices.

vpack
$packas = $rel->vpack();

pack-template for $rel->vtype(), e.g. "f*"

ipack
$packas = $rel->ipack();

pack-template for $rel->itype(), e.g. "l*"

Persistent API: disk usage

diskFiles
@files = $rel->diskFiles();

returns disk storage files, used by du() and timestamp()

Persistent API: header

headerKeys
@keys = $rel->headerKeys();

keys to save as header; override includes qw(meta attrs vtype itype) and excludes logging and i/o keys.

headerData
$hdr = $rel->headerData();

returns reference to object header data; override stringifies {itype} and {vtype} keys.

Relation API: open/close

open
$rel_or_undef = $rel->open($base);
$rel_or_undef = $rel->open($base,$flags);
$rel_or_undef = $rel->open();

Opens underlying index files.

close
$rel_or_undef = $rel->close();

Closes underlying index files.

opened
$bool = $rel->opened();

Returns true iff index is opened. Really just checks for $rel->{tdm}.

Relation API: creation

create
$rel = $CLASS_OR_OBJECT->create($coldb,$tokdat_file,%opts);

Populates relation index for $coldb. Requires:

  • (temporary, tied) doc-arrays @$coldb{qw(docmeta docoff)}

  • temp file "$coldb->{dbdir}/vtokens.bin": pack($coldb->{pack_w}, @wattrs)

    OR

    wdmfile=>$wdmfile option

%opts: clobber %$rel, also:

docmeta =>\@docmeta, ##-- for union(): override $coldb->{docmeta}
                     ##   $docmeta[$ci] = {id=>$id, nsigs=>$nsigs, file=>$rawfile, date=>$date, label=>$label, meta=>\%meta}
wdmfile =>$wdmfile,  ##-- for union(): txt ~ "$ai0 $ai1 ... $aiN $doci $f"; default is generated from 'vtokens.bin'
ivalmax =>$imax,     ##-- for union(): maximum integer value (for auto-promotion)
reusedir=>$bool,     ##-- for union(): set to true if we're running in a "clean" directory
logas   =>$logas,    ##-- log label (default: 'create()')
union
$rel = CLASS_OR_OBJECT->union($coldb, \@dbargs, %opts);

merge multiple tdf indices into new object. \@dbargs is an ARRAY-ref of DiaColloDB sub-objects ($coldb,...) containing {tdf} relations to be merged.

%opts: clobber %$rel

Current implementation just creates temp-files utdm0.dat and udocmeta.tmp and then calls create().

Relation API: info

dbinfo
\%info = $rel->dbinfo($coldb);

embedded info-hash for $coldb->dbinfo()

Relation API: profiling

profile
$mprf = $rel->profile($coldb, %opts);

Get a relation profile for selected items as a DiaColloDB::Profile::Multi object. %opts are as for DiaColloDB::Relation::profile(). Really just a wrapper for the vprofile() method.

extend

Get independent f2 frequencies for $opts{slice2keys} as a DiaColloDB::Profile::Multi object.

compare
$mpdiff = $rel->compare($coldb, %opts);

Get a relation comparison profile for selected items as a DiaColloDB::Profile::MultiDiff object. %opts are as for DiaColloDB::Relation::compare(), which this method calls after parsing the groupby option via $rel->groupby($coldb, $opts{groupby}, relax=>0).

Profile: Utils: PDL-based profiling

vprofile
\@pprfs = $rel->vprofile($coldb, \%opts);

Guts for the profile() method. User options in %opts are as for DiaColloDB::Relation::profile(). Additional keys are populated and used in the course of the computation (so don't set them):

vq      => $vq,        ##-- parsed query, DiaColloDB::Relation::TDF::Query object
groubpy => \%groupby,  ##-- as returned by $rel->groupby($coldb, \%opts)
dlo     => $dlo,       ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
dhi     => $dhi,       ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
dslo    => $dslo,      ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
dshi    => $dshi,      ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);

Profile: Utils: domain sizes

nTerms
$NT = $rel->nTerms();

returns number of indexed terms.

nDocs
$ND = $rel->nDocs();

returns number of indexed documents (breaks).

nFiles
$NC = $rel->nFiles();

returns number of indexed categories (original source files).

nAttrs
$NA = $rel->nAttrs();

returns number of indexed term-attributes.

nMeta
$NM = $rel->nMeta();

returns number of indexed meta-attributes.

Profile: Utils: attribute positioning

tpos
\%tpos = $rel->tpos();
$tpos  = $rel->tpos($tattr);

In the first form, get or build the term-attribute position lookup hash. In the second form, get the index position along dimension $NA of the term-attribute named $tattr, or undef if $tattr is not a known term attribute.

mpos
\%mpos = $rel->mpos();
 $mpos  = $rel->mpos($mattr);

In the first form, get or build the meta-attribute position lookup hash. In the second form, get the index position along dimension $NM of the meta-attribute named $mattr, or undef if $mattr is not a known metadata attribute.

Profile: Utils: query parsing & evaluation

idpdl
$idPdl = $rel->idpdl($idPdl);
$idPdl = $rel->idpdl(\@ids);
$idPdl = $rel->idpdl($id);

Ensure PDL-ness of a set of integer IDs.

tupleIds
$tupleIds = $rel->tupleIds($attrType, $attrName, $valIds);

Returns a PDL representing the set of index items of type $attrType whose value for the $attrName attribute is contained in the ID-set $valIds, which may be specified in any of the forms accepted by the idpdl() method.

$attrType is either 't' for a term-attribute (in which case the returned $tupleIds are term indices), or 'm' for a metadata attribute (in which case the returned $tupleIds are "category" indices). The returned $tupleIds are always sorted in ascending order.

Could use some optimization.

termIds
$ti = $rel->termIds($tattrName, $valIds);

wraps $rel->tupleIds('t',$tattrName,$valIds).

catIds
$ci = $rel->catIds($mattrName, $valIds);

wraps $rel->tupleIds('m',$mattrName,$valIds).

hasMeta
$bool = $rel->hasMeta($mattr);

returns true iff $rel supports metadata attribute $mattr.

metaEnum
$enum_or_undef = $rel->metaEnum($mattr);

returns metadata attribute enum for $attr, or undef if $mattr is not supported.

catSubset
$cats = $rel->catSubset($termIds);
$cats = $rel->catSubset($termIds,$catIds)

Get a (sorted) cat-subset for the (sorted) term-set $termIds: the set of all "categories" (original source files) which contain at least one instance of any of the terms in $termIds, optionally restricted to the (sorted and unique) set $catIds. The returned category-IDs are sorted and unique.

groupby
\%groupby = $rel->groupby($coldb, $groupby_request, %opts);
\%groupby = $rel->groupby($coldb, \%groupby,        %opts);

Modified version of DiaColloDB::groupby() suitable for pdl-ized TDF relation. $grouby_request is as for DiaColloDB::parseRequest(). Returns a HASH-ref:

##-- COMPAT: equivalent to DiaColloDB::groupby() return values
req => $request,    ##-- save request
areqs => \@areqs,   ##-- parsed attribute requests ([$attr,$ahaving, \%ainfo],...)
                    ##   + new: %ainfo = ( aname=>$enum_name, atype=>$t_or_m, apos=>$apos )
attrs => \@attrs,   ##-- like $coldb->attrs($groupby_request), modulo "having" parts
titles => \@titles, ##-- like map {$coldb->attrTitle($_)} @attrs
##
##-- NEW: for DiaColloDB::Relation::TDF
how      => $ghow,     ##-- one of  't':groupby terms-only, 'c':groupby cats-only, 'tc':groupby terms+cats
gatype   => $gatype,   ##-- pdl ($NG)         : attribute types $ai : 0 if $areqs->[$ai] is a term attribute, 1 if meta-attribute
gapos    => $gapos,    ##-- pdl ($NG)         : term- or meta-attribute position indices $ai : $rel->mpos($attrs[$ai]) or $rel->tpos($attrs[$ai])
ghavingt => $ghavingt, ##-- pdl ($NHavingTOk) : term indices $ti s.t. $ti matches groupby "having" requests, or undef
ghavingc => $ghavingc, ##-- pdl ($NHavingCOk) : cat  indices $ci s.t. $ci matches groupby "having" requests, or undef
g2s      => \&g2s,     ##-- stringification object suitable for DiaColloDB::Profile::stringify() [CODE,enum, or undef]
gpack    => $packas,   ##-- pack template for groupby-keys

%opts:

warn  => $level,    ##-- log-level for unknown attributes (default: 'warn')
relax => $bool,     ##-- allow unsupported attributes (default=0)

Relation API: default: query info

qinfo
\%qinfo = $rel->qinfo($coldb, %opts);

get query-info hash for profile administrivia (ddc hit links). %opts: as for profile() method. returned hash \%qinfo should have keys:

fcoef     => $fcoef,     ##-- frequency coefficient (constant 1 here)
qtemplate => $qtemplate, ##-- query template with __W1.I1__ rsp __W2.I2__ replacing groupby fields

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Relation(3pm), DiaColloDB::Relation::TDF::Query(3pm), DiaColloDB::Relation::Cofreqs(3pm), DiaColloDB::Relation::Unigrams(3pm), DiaColloDB::Relation::DDC(3pm), DiaColloDB(3pm), perl(1), ...