NAME

Statistics::Sequences::Vnomes - The Serial Test (psi-square) and Generalized Serial Test (delta psi-square) for equiprobability of v-nomes (or v-plets/bits) (Good's and Kendall-Babington Smith's tests)

SYNOPSIS

my $vnomes = Statistics::Sequences::Vnomes->new();
$vnomes->load(qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/); # data treated categorically - could also be, e.g., 'a' and 'b'
my $freq_href = $vnomes->observed(length => 3); # returns hashref of frequency distribution for trinomes in the sequence
my @freq = $vnomes->observed(length => 3); # returns only the observed trinome frequencies (not keyed by the trinomes themselves)
my $val = $vnomes->observed_mean(length => 3); # mean observed frequencies (2.5); also get their SD by 'wanting' an array
$val = $vnomes->expected(length => 3); # mean chance expectation for the frequencies (2.5)
$val = $vnomes->psisq(length => 3); # Good's "second backward differences" psi-square (3.4); option 'delta' gives alternative estimates
$val = $vnomes->p_value(length => 3, tails => 1); # 1-tailed p-value for psi-square (0.0913)
$val = $vnomes->z_value(length => 3, tails => 1, ccorr => 1); # inverse-phi of the 1-tailed p-value (1.333)
my $href = $vnomes->stats_hash(length => 3, values => {observed => 1, p_value => 1}, tails => 1); # include any stat-method (& their options)
$vnomes->dump(length => 3,
 values => {observed => 1, observed_mean => 1, expected => 1, psisq => 1, p_value => 1}, # what stats to show (or not if => 0)
 format => 'table', flag => 1, precision_s => 3, precision_p => 7, verbose => 1, tails => 1);
 # prints:
# Vnomes (3) statistics
#.-----------+-----------+-----------+-----------+-----------.
#| observed  | expected  | p_value   | psisq     | observed- |
#|           |           |           |           | _mean     |
#+-----------+-----------+-----------+-----------+-----------+
#| '011' = 4 | 2.500     | 0.0913418 | 3.400     | 2.500     |
#| '010' = 1 |           |           |           |           |
#| '111' = 2 |           |           |           |           |
#| '000' = 1 |           |           |           |           |
#| '101' = 2 |           |           |           |           |
#| '001' = 3 |           |           |           |           |
#| '100' = 3 |           |           |           |           |
#| '110' = 4 |           |           |           |           |
#'-----------+-----------+-----------+-----------+-----------'
# these and other methods inherited from Statistics::Sequences and its parent, Statistics::Data:
$vnomes->dump_data(delim => ','); # a comma-separated single line of the loaded data
$vnomes->save(path => 'seq.csv'); # serialized for later retrieval by open() method

DESCRIPTION

Implements tests of the independence of successive elements of a sequence of data: serial tests for v-nomes (a.k.a v-plets or, for binary data, v-bits) - singlets/monobits, dinomes/doublets, trinomes/triplets, etc.. Test are of variations in sub-sequence length v that are equally likely for the sampled sequence. For example, a sequence sampled from a "heads'n'tails" (H and T) distribution can be tested for its equal representation of the trinomes HTH, HTT, TTT, THT, and so on. Counting up these v-nomes at all points in the sequence, permitting overlaps, yields a statistic - psi-square - that is approximately distributed as chi-square; the Kendall-Babington Smith statistic.

However, because these counts are not independent (given the overlaps), Good's Generalized Serial Test is the default test-statistic returned by this module's test routine: It computes psi-square by differencing, viz., in relation to not only the specified length, or value of v, but also its value for the first two prior lengths of v, yielding a statistic, delta-square-psi-square (the "second backward difference" measure) that is exactly distributed as chi-square.

The test is suitable for multi-state data, not only the binary, dichotomous sequence suitable for the runs or joins tests. It can also be used to test that the individual elements in the list are uniformly distributed, that the states are equally represented, i.e., as a chi-square-based frequency test (a.k.a. test of uniformity, equiprobability, equidistribution). This is done by giving a length of 1, i.e., testing for mononomes.

Note that this is not the so-called serial test described by Knuth (1998, Ch. 2), which involves non-overlapping pairs of sequences.

METHODS

Methods include those described in Statistics::Sequences, which can be used directly from this module, as follows.

new

$vnomes = Statistics::Sequences::Vnomes->new();

Returns a new Vnomes object. Expects/accepts no arguments but the classname.

load

$vnomes->load(@data); # anonymously
$vnomes->load(\@data);
$vnomes->load('sample1' => \@data); # labelled whatever

Loads data anonymously or by name - see load in Statistics::Data for ways data can be loaded and retrieved (more than shown here). Every load unloads all previous loads and any additions to them.

Data for this test of sequences can be categorical or numerical, all being treated categorically. Also, the data do not have to be dichotomous (unlike in tests of runs and joins.

add, read, unload

See Statistics::Data for these and other operations on loaded data.

observed, vnomes_observed, vno

$href = $vnomes->observed(length => n, circularize => 1|0); # returns keyed distribution; assumes data have already been loaded
@ari = $vnomes->observed(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => n, circularize => 1|0); # returns frequencies only

Returns the frequency distribution for the observed number of vnomes of the given length in a sequence. These are counted up as overlapping. Called in array context, returns an array of the counts for each possible pattern; otherwise, where these are keyed by the pattern in a hash reference. A value for length greater than zero is required, and must be no more than the sample-size. So for the sequence

1 0 0 0 1 0 0 1 0 1 1 0 

there are 12 vnomes of length 1 (mononomes, the number of elements) from the two (0 and 1) that are possible; 11 dinomes from the four (10, 00, 00, 01) that are possible; and 10 trinomes from eight (100, 000, 001, 010, etc.) that are possible.

If the options circularize equals 1 (default), the count continues to loop to the beginning of the sequence until all elements from the end are included. So instead of ending the count for trinomes at (110), the count will also include (101) and (010), increasing the observed value of trinomes to 12.

The data to test can already have been loaded, or you send it directly keyed as data.

observed_mean

$mean = $vnomes->observed_mean(length => n, data => \@data); # options as for observed()
($mean, $stdev) = $vnomes->observed_mean(length => n, data => \@data); # options as for observed()

Returns the mean of the observed frequencies for the v-nomes of the given length. Called in array context, the mean and also the standard deviation of the frequencies are returned. Options as for the observed method. For other descriptives of the observed frequencies, get the array returned from calling observed and use Statistics::Lite methods, which this method depends on.

expected, vnomes_expected, vne

$count = $vnomes->expected(length => n, circularize => 1|0); # assumes data have already been loaded
$count = $vnomes->expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => n, circularize => 1|0, states => [0, 1]);

Returns the expected number of observations for each vnome of the given length in a sequence; i.e., the mean chance expectation, assuming that each event is generated by a random uniform process. Options are as for observed. This expected frequency is given by:

  E[V] = Nkv

where k is the number of possible states (alternatives; as read from the data or as explicitly given as an array in states), v is the vnome length, and N is the length of the sequence, less v + 1 if the count is not to be circularized.

Another way to think of this is as the number of mononome observations in the sequence divided by the number of possible permutations of its states for the given length. So, for a sequence made up of 0s and 1s, there are four possible variations of length 2 (00, 10, 01 and 11), so that the expected frequency for each of these variations in a sequence of 20 values is 20 / 4, i.e., 5.

variance

(Not defined for this package.)

obsdev, observed_deviation

(Not defined for this package.)

stdev, standard_deviation

(Not defined for this package.)

psisq

$vnomes->psisq(length => n, delta => '2|1|0', circularize => '1|0', states => [qw/A C G T/]);

Performs Good's Generalized Serial Test (by default), of v-nomes on the given or named distribution, yielding a psi-square statistic. Actually, this is not the raw psi-square value for sub-sequences of length v (Kendall-Babington Smith statistic), because, unless length (v) = 1, this value is not asymptotically distributed as chi-square. Instead, this method returns, by default (without specifiying a value for delta, or if delta is not 1 or 0) the "second backward differences" psi-square (delta^2-psi^2). This incorporates psi-square values for backwardly adjacent values of length (v), i.e., for sub-sequences of length v, v - 1, and v - 2. It is not only asymptotically chi-square distributed, but uses statistically independent counts of all the possible variations of sequences of the given length (Good, 1953).

To get the "first backward differences" of psi-square, which is the difference between the psi-square values for sub-sequences of length v and length v - 1, specify delta => 1. While it is chi-square distributed, counts of first-differences are not statistically independent (Good, 1953; Good & Gover, 1967), and "the sequence of second differences forms a much better set of statistics for testing the hypothesis of flat-randomness" (Good & Gover, 1967, p. 104). To incoroporate no backward differences in the calculation, specifiy delta => 0. Leave delta undefined to use the second-backward differences.

The implemented algorithm is that given by Good (1953, Eq. 1); benchmarking shows no reliable speed difference to alternative forms of the equation, as given by Good (1957, Eq. 2) and in the NIST test suite (Rukhin et al., 2001). Good's original algorithm can also be found in individual papers describing the application of the Serial Test (e.g., Davis & Akers, 1974).

By default, the p-value associated with the test-statistic is 2-tailed. See the Statistics::Sequences manpage for generic options other than the following Vnome test-specific ones.

z_value, vnomes_zscore, vzs, zscore

$val = $vnomes->z_value(); # data already loaded, use default windows and prob
$val = $vnomes->z_value(data => $aref);
($zvalue, $pvalue) =  $vnomes->z_value(data => $aref, tails => 2); # same but wanting an array, get the p-value too

Returns the zscore not from a direct test of deviation but from the (inverse-phi) of the p_value, using Math::Cephes ndtri. Called in array context, also returns the p-value itself. Same options and conditions as above. Other options are precision_s (for the z_value) and precision_p (for the p_value).

p_value, test, vnomes_test, vnt

$p = $vnomes->p_value(); # using loaded data and default args
$p = $vnomes->p_value(data => [1, 0, 1, 1, 0], exact => 1); #  using given data (by-passing load and read)
$p = $vnomes->p_value(trials => 20, observed => 10); # without using data

Returns probability of obtaining the psisq value for data already loaded, or directly keyed as data. The p-value is read off the complemented chi-square distribution (incomplete gamma integral) using Math::Cephes igamc.

dump

$vnomes->dump(length => 3, values => {psisq => 1, p_value => 1}, format => 'table|labline|csv', flag => 1, precision_s => 3, precision_p => 7, verbose => 1, tails => 1);

Print Vnome-test results to STDOUT. See dump in the Statistics::Sequences manpage for details. For comparability with other modules in the Statistics::Sequences package, the Z-value associated (by Math::Cephes::ndtri) with the obtained p-value is reported. If text => 2, then you get a verbose dump, including (1) the actual test-statistic depending on the value of delta tested (delta2_psi2 for the second difference measure (default), delta_psi2 for the first difference measure, and psi2 for the raw measure), followed by degrees-of-freedom in parentheses; and (2) a warning, if relevant, that your length value might be too large with respect to the sample size (see NIST reference, above, in discussing length). If text => 1, you just get the average observed and expected frequencies for each v-nome, the Z-value, and its associated p-value.

OPTIONS

Options common to the above stats methods.

length

The length of the v-nome, i.e., the value of v. Must be an integer greater than or equal to 1, and smaller than than the sample-size.

What is a meaningful maximal value of length? As a chi-square test, it could be held that there should be an expected frequency of at least 5 for each v-nome. This is "conventional wisdom" recommended by Knuth (1988) but can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2001) recommends that length should be less than the floored value of log2 of the sample-size, minus 2. No tests are here made of these recommendations.

circularize

By default, circularizes the data sequence; i.e., the datum after the last element is the first element. This affects (and slightly simplifies) the calculation of the expected frequency of each v-nome, and so the value of each psi-square. Circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) offer, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103).

states

A referenced array listing the unique states (or 'events' or 'letters') in the population from which the sequence was sampled. This is useful to specify if the sequence itself is likely not to include all the possible states; it might even include only one of them. If this array is not specified, the unique states are identified from the sequence itself. If giving a list of states, a check in each test is made to ensure that the sequence contains only those elements in the list.

EXAMPLE

Seating at the diner

This is the data from Swed and Eisenhart (1943) also given as an example for the Runs test and Turns test. It lists the occupied (O) and empty (E) seats in a row at a lunch counter. Have people taken up their seats on a random basis - or do they show some social phobia, or are they trying to pick up? What does the test of Vnomes reveal?

use Statistics::Sequences::Vnomes;
my $vnomes = Statistics::Sequences::Vnomes->new();
my @seating = (qw/E O E E O E E E O E E E O E O E/);
$vnomes->load(\@seating);
$vnomes->dump(length => 3, values => {z_value => 1, p_value => 1}, format => 'labline', flag => 1, precision_s => 3, precision_p => 3, tails => 1);

This prints:

z_value = 2.015, p_value = 0.022*

That is, the observed frequency of each possible trio of seating arrangements (the dinomes OOO, OOE, OEE, EEE, etc.) differed significantly from that expected. Look up the observed frequencies for each possible trinome to see if this is because there are more empty or occupied neighbouring seats (phobia or philia):

$vnomes->dump(length => 3, values => {observed => 1}, format => 'labline');

This prints:

observed = ('OEE' = 4,'EEO' = 4,'EEE' = 2,'OEO' = 1,'EOE' = 5,'OOO' = 0,'OOE' = 0,'EOO' = 0)

As the chance-expected frequency is 2.5 (using the expected method), there are clearly more than expected trinomes involving empty seats than occupied seats - suggesting a non-random factor like social phobia or body odour is at work in sequencing people's seating here. Noting that the sequencing isn't significant for dinomes (with length => 2) might also tell us something about what's going on.

REFERENCES

Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. Journal of Parapsychology, 38, 393-407.

Delucchi, K. L. (1993). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166-176.

Gatlin, L. L. (1979). A new measure of bias in finite sequences with applications to ESP data. Journal of the American Society for Psychical Research, 73, 29-43. (Used for one of the reference tests in the CPAN distribution.)

Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. Proceedings of the Cambridge Philosophical Society, 49, 276-284.

Good, I. J. (1957). On the serial test for random sequences. Annals of Mathematical Statistics, 28, 262-264.

Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [square-root]2. Journal of the Royal Statistical Society A, 130, 102-107.

Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. Journal of the Royal Statistical Society, 101, 147-166.

Knuth, D. E. (1998). The art of computer programming (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: Addison-Wesley.

Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2001). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22b.pdf.

SEE ALSO

Statistics::Sequences sub-modules for other tests of sequences, and for sharing data between these tests.

TO DO/BUGS

Handle non-overlapping v-nomes.

AUTHOR/LICENSE

rgarton AT cpan DOT org

This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).

DISCLAIMER

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.

END

This ends documentation of the Perl implementation of the psi-square statistic, Kendall-Babington Smith test, and Good's Generalized Serial Test, for randomness in a sequence.