NAME
DiaColloDB::Client::list - diachronic collocation db: client: distributed
DESCRIPTION
DiaColloDB::Client::list is a subclass of DiaColloDB::Client for accessing a set of distributed DiaColloDB databases via a list://
URL whose path part is a space- or colon-separated list of sub-URLs supported by DiaColloDB::Client. It supports the DiaColloDB::Client API by calling the relevant methods on each of its sub-clients.
new() options and object structure:
##-- DiaColloDB::Client: options
url => $url, ##-- list url (sub-urls, separated by whitespace, "+SCHEME://", or "+://")
##
##-- DiaColloDB::Client::list
urls => \@urls, ##-- sub-urls
opts => \%opts, ##-- sub-client options
fudge => $fudge, ##-- get ($fudge*$kbest) items from sub-clients (-1:all; 0|1:none; default=10)
fork => $bool, ##-- run each subclient query in its own fork? (default=if available)
lazy => $bool, ##-- use temporary on-demand sub-clients (true,default) or persistent sub-clients (false)
extend => $boo, ##-- use extend() queries to acquire correct f2 counts? (default=true)
logFudge => $level, ##-- log-level for fudge-coefficient debugging (default='debug')
logThread => $level, ##-- log-level for thread operations (default='none')
##
##-- guts
#clis => \@clis, ##-- per-url sub-clients for "busy" (non-"lazy") mode
The most important client parameter is the fudge-coefficient option fudge=>$fudge
, which requests that up to $fudge*$kbest
items be retrieved from sub-clients for each profile() call. If $fudge < 0
, all collocates will be retrieved from each sub-client, and trimming will be performed exclusively by the superordinate DiaColloDB::Client::list object. If $fudge == 0
, only the $kbest
collocates from each sub-client will be retrieved. The default value of 10 should return reasonable results without too large of a performance penalty in most cases, but be aware that the results for $fudge > 0
may not be strictly correct due to sub-client local pruning; see for details.
This module supports parallel processing of sub-client queries using whatever threading implementation (if any) is provided by the DiaColloDB::threads module. Parallel sub-client processing is enabled by default if a working threads or forks module was found by DiaColloDB::threads, but can be disabled by specifying the fork=>0
option to the list-client.
List URLs
List URLs passed as the the url
option to the constructor can be either ARRAY-refs of sub-URLs or simple strings with an optional list://
scheme. In the latter case, sub-URLs in the argument string are separated by whitespace or by a plus character ("+") followed by the sub-URL scheme, e.g.:
["file://a","file://b"] ##-- ARRAY-ref of explicit file URLs
["a" , "b" ] ##-- ARRAY-ref of implicit file URLs
"list://file://a file://b" ##-- string with space-separated explicit file URLs
"list://a b" ##-- string with space-separated implicit file URLs
"list://file://a+file://b" ##-- list with "+"-separated explicit file URLs
"list://a+://b" ##-- list with "+"-separated implicit file URLs
Options can be passed to the appropriate sub-URLs via those URLs' query strings, as described in "open" in DiaColloDB::Client. Options to the DiaColloDB::Client::list object itself can be passed in by using a sub-URL consisting of a HASH-ref or only a query string, e.g.:
["a","b",{fudge=>0}] ##-- ARRAY-ref with local options as HASH-ref
["a","b","?fudge=0"] ##-- ARRAY-ref with local options as query-string
"list://a b ?fudge=0" ##-- space-sparated string with local options
"list://a+://b+://?fudge=0" ##-- "+"-separated string with local options
KNOWN BUGS
Incorrect Independent Collocate Frequencies
Prior to the introduction of extend() queries in DiaCollODB v0.11.000, the list-clients were always apt to return incorrrect independent collocate frequencies f2 whenever the queried subcorpora were not partitioned explicitly by date, even with $fudge=-1
. Although the reported joint frequencies f12 ought to have been correct in this case, it could easily happen that the independent collocate frequencies f2 got mis-reported, leading to incorrect computations of f2-sensitive association scores such as milf
(pointwise mutual information * log-frequency product), ll
(log likelihood), or the default ld
(log Dice). Such errors occurred whenever the pre-v0.11.000 list client accessed multiple sub-clients (e.g. $a
and $b
) and some candidate collocate $v
occured in both of the subcorpora, but only occured together with the target term $w
in one of the sub-clients' indices.
Suppose $v
occurs in subcorpus $a
with frequency f_a($v)
and in subcorpus $b
with frequency f_b($v)
, but only occurs together with $w
in subcorpus $a
with frequency f_a($w,$v)
; i.e. f_b($w,$v)==0
. Since only collocates with nonzero co-occurrence frequencies are collected in subcorpus profiles, the sub-profile for $w
over subcorpus $b
will not contain an entry for $v
at all. This is fine if we are only interested in the total co-occurrence frequency f($w,$v) = f_a($w,$v) + f_b($w,$v)
, but if we are using an "interesting" association score, we also need to refer to the total independent collocate frequency f($v) = f_a($v) + f_b($v)
, but since f_b($v)
will not have been reported by the subprofile for corpus $b
, its value will be treated as 0 (zero), leading to an incorrect estimate of the association score.
As of v0.11.000, each list client is queried a second time using its extend() method to acquire independent collocate frequencies for "missing" keys such as $v
in the example above. This introduces additional processing overhead, which can be disabled by setting the extend=>0
option to the list-client to simulate the old, incorrect, pre-v0.11 behavior.
Incorrect Joint Frequencies
Similar to the case for independent collocate frequencies, the joint frequencies f12 reported by this module prior to v0.12.016 were incorrect whenever the pre-v0.12.016 list client accessed multiple sub-clients (e.g. $a
and $b
), and some candidate collocate $v
occured together with the target term $w
in both sub-clients' indices, but was among the $fudge*$kbest
items per epoch for only one of the sub-clients.
The extend() method was re-implemented in v0.12.016 to perform a full profile() on the "missing" candidate collocates, ensuring correct acquisition of both joint and independent collocate frequencies.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.