NAME

Statistics::Sequences::Vnomes - The Serial Test (psi-square) and Generalized Serial Test (delta psi-square) for equiprobability of v-nomes (or v-plets/bits) (Good's and Kendall-Babington Smith's tests)

SYNOPSIS

use Statistics::Sequences::Vnomes;
$vnomes = Statistics::Sequences::Vnomes->new();
$vnomes->load(qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/);
$vnomes->test(length => 2)->dump();

DESCRIPTION

This module implements tests of the independence of successive elements of a sequence/series of data (list, vector, etc.) - specifically, "serial tests" for v-nomes (a.k.a v-plets or, for binary data, v-bits) - what are call singlets/monobits, dinomes/doublets, trinomes/triplets, etc..

Serial tests tell us if all the variations of the states, of a certain sub-sequence length, v, that would be possible in the population from which the series has been sampled, are equally represented in the sample. For example, a series sampled from a "heads'n'tails" (H and T) population can be tested for its equal representation of the trinomes HTH, HTT, TTT, THT, and so on. Counting up these v-nomes at all points in the series, permitting overlaps, yields a statistic - psi-square - that is approximately distributed as chi-square; the Kendall-Babington Smith statistic. However, because these counts are not independent (given the overlaps), Good's Generalized Serial Test is more appropriate, and this is the default test-statistic returned by this module's test routine - it computes psi-square by differencing, viz., in relation to not only the specified length, or value of v, but also its value for the first two prior lengths of v, yielding a statistic, delta-square-psi-square (the "second backward difference" measure) that is exactly distributed as chi-square. The test is suitable for multi-state data, not only the binary, dichotomous series suitable for the Runs and Joins tests in this package. Note that this is not the serial test described by Knuth (1998), which concerns non-overlapping pairs of sequences. (Given this variety of definitions of what is a "serial test," this module - like that for Runs, Pot, etc. - is named after the basic construct tested - i.e., v-nomes - rather than the property of v-nomes (seriality, successive independence, etc.) being tested.)

METHODS

new

$vnomes = Statistics::Sequences::Vnomes->new();

Returns a new Vnomes object. Expects/accepts no arguments but the classname.

load

$vnomes->load(@data);
$vnomes->load(\@data);
$vnomes->load('dist1' => \@data1, 'dist2' => \@data2)
$vnomes->load({'dist1' => \@data1, 'dist2' => \@data2})

Loads data anonymously or by name. See load in the Statistics::Sequences manpage.

test

$vnomes->test(length => ?integer?, delta => '1|0', circularize => '1|0', states => [qw/A C G T/]);

Performs the serial test of v-nomes on the given or named distribution.

To test for the significance of the psi-square statistic, the raw psi-square value for sub-sequences of length v is, by default, not used - because, unless length (v) = 1, psi-square is not asymptotically distributed as chi-square. However, the differences between psi-square values for backwardly adjacent values of length (v) are asymptotically distributed as chi-square. By default, then, a "second backward differences" psi-square value is calculated, named (as per Good, 1953) as delta^2psi^2, which makes use of the psi-square values for sub-sequences of length v, v - 1, and v - 2. This statistic is logically (and empirically shown to be) not only chi-square distributed, but to offer statistically independent counts of all the possible variations of sequences of length for the series in question. A practical upshot of this is that the square-root of delta^2psi^2 gives us a Z-value that, per the normal distribution, yields a p-value that is equivalent to that calculated via the chi-square distribution on the basis of delta^2psi^2. The Z-value, however, that is returned when using the simple direct psi-square (as per Kendall & Babington Smith) is only approximate. [A future version might yield this Z-value, inversely, from the p-value itself.]

Note that the "first backward differences" of psi-square, which is the difference between the psi-square values for sub-sequences of length v and length v - 1, is also not ordinarily returned. While it is chi-square distributed, counts of such first-differences are not statistically independent (Good, 1953; Good & Gover, 1967). This value can, however, be returned in place of the default by specifying delta => 1. But note: "the sequence of second differences forms a much better set of statistics for testing the hypothesis of flat-randomness" (Good & Gover, 1967, p. 104) [compared to the first differences].

The algorithm implemented for psi-square is that given by Good (1953, Eq. 1); benchmarking shows no appreciable difference to the form of Good (1957, Eq. 2). This algorithm is also as used in the NIST test suite, although written differently (Rukhin et al., 2001). Good's original algorithm can also be found in individual papers describing the application of the Serial Test (e.g., Davis & Akers, 1974).

By default, the p-value associated with the test-statistic is 2-tailed. See the Statistics::Sequences manpage for generic options other than the following Vnome test-specific ones. At the end of the test, the class object is lumped with the usual statistics; this time, however, the value of observed is the average of the observed frequencies of each v-nome, and an additional statistic, observed_stdev, the standard deviation of the observed frequencies is also formed.

Options

length

The length of the v-nome, i.e., the value of v. Must be an integer greater than or equal to 1, and smaller than than the sample-size.

What is a meaningful maximal value on length? As a chi-square test, it could be held that there should be an expected frequency of at least 5 for each v-nome. This is "conventional wisdom" recommended by Knuth (1988) but can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2001) recommends that length should be less than the rounded value of log2 of the sample-size, minus 2. No tests are here made of these recommendations, but if you choose to "dump" your results with verbosity (see dump), you will get a note if the NIST warning would apply.

circularize

By default, circularizes the data series; i.e., the datum after the last element is the first element. This affects (and slightly simplifies) the calculation of the expected frequency of each v-nome, and so the value of each psi-square. Circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) offer, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103).

delta

By default, the statistics are based on the second backward difference of psi-squares, i.e., as the Generalized Serial Test, as described by Good, see REFERENCES. If delta => 0, the original Kendall-Babington Smith statistic is used.

states

A referenced array listing the unique states (or 'events' or 'letters') in the population from which the series was sampled. This is useful to specify if the series itself is likely not to include all the possible states; it might even include only one of them. If this array is not specified, the unique states are identified from the series itself - in which case there ought to be at least two states in the series. Having only one state specified is not permissible. If giving a list of states, a check in each test is made to ensure that the data series contains only those elements in the list.

dump

$vnomes->dump(flag => '1|0', text => '0|1|2');

Print Vnome-test results to STDOUT. See dump in the Statistics::Sequences manpage for details. After naming the test-statistic (delta^2psi^2 for the second difference measure, delta_psi^2 for the first difference measure, and psi^2 for the raw measure), the degrees-of-freedom follow in parentheses, and then the value of the test-statistic. If text => 2, then you get a verbose telling of the inputs and results, including, if relevant, a warning if your length value might be too large with respect to the sample size. Otherwise, you just get the average observed and expected frequencies for each v-nome, the requested test-statistic (delta^2psi^2 by default), and its associated p-value.

After testing, parameters named 'nstates' (the number of states), 'samplings' (the size of the sample), 'length' (what you requested) can be retrieved from the class object. You can retrieve the counts for each of the Vnomes in the series as a hash-reference named 'counts' in the class object, e.g.:

print "No. of $vnomes->{'length'}-nome variations of $vnomes->{'nstates'} states among $vnomes->{'samplings'} samplings:\n";
foreach (sort keys %{$vnomes->{'counts'}}) {
    printf("\t%s\t%d\n", $_, $vnomes->{'counts'}->{$_});
}

EXAMPLE

Seating at the diner

This is the data from Swed and Eisenhart (1943) also given as an example for the Runs test. It lists the occupied (O) and empty (E) seats in a row at a lunch counter. Have people taken up their seats on a random basis? The Runs test suggested some non-random basis for people to take their seats, ouputting (as per dump):

Runs: observed = 11.00, expected = 7.88, z = 1.60, 1p = 0.054834

That means there was more serial discontinuity than expected. What does the test of Vnomes tell us?

use Statistics::Sequences::Vnomes;
my $vnomes = Statistics::Sequences::Vnomes->new();
my @seating = (qw/E O E E O E E E O E E E O E O E/);
$vnomes->load(\@seating);
$vnomes->test(length => 2)->dump();

This outputs, as returned by string:

delta^2psi^2 (1) = 1, 2p = 0.317310507862914

That is, the observed frequency of each possible pair of seating arrangements (OO, OE, EE, EO) did not differ significantly from that expected. Taking a bigger picture, though, and changing the value of length to 3, yields:

delta^2psi^2 (2) = 6.25, 2p = 0.04139369336234074

REFERENCES

Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. Journal of Parapsychology, 38, 393-407.

Delucchi, K. L. (1993). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166-176.

Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. Proceedings of the Cambridge Philosophical Society, 49, 276-284.

Good, I. J. (1957). On the serial test for random sequences. Annals of Mathematical Statistics, 28, 262-264.

Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [square-root]2. Journal of the Royal Statistical Society A, 130, 102-107.

Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. Journal of the Royal Statistical Society, 101, 147-166.

Knuth, D. E. (1998). The art of computer programming (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: Addison-Wesley.

Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2001). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22b.pdf.

SEE ALSO

Statistics::Sequences for other tests of sequences, and for sharing data between these tests.

TO DO/BUGS

Implementation of the serial test for non-overlapping v-nomes.

REVISION HISTORY

See CHANGES in installation dist for revisions.

AUTHOR/LICENSE

rgarton AT cpan DOT org

This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).

DISCLAIMER

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.

End

This ends documentation of the Perl implementation of the chi-square, Kendall-Babington Smith, and Good's Generalized Serial Test for randomness.