NAME
Statistics::Data - Manage loading, accessing, updating one or more sequences of data for statistical analysis
VERSION
This is documentation for Version 0.06 of Statistics/Data.pm, released August 2013.
SYNOPSIS
use Statistics::Data 0.06;
my $dat = Statistics::Data->new();
# With labelled sequences:
$dat->load({'aname' => \@data1, 'anothername' => \@data2}); # labels are arbitrary
$aref = $dat->access(label => 'aname'); # gets back a copy of @data1
$dat->add(aname => [2, 3]); # pushes new values onto loaded copy of @data1
$dat->dump_list(); # print to check if both arrays are loaded and their number of elements
$dat->unload(label => 'anothername'); # only 'aname' data remains loaded
$aref = $dat->access(label => 'aname'); # $aref is a reference to a copy of @data1
$dat->dump_vals(label => 'aname', delim => ','); # proof in print it's back
# With multiple anonymous sequences:
$dat->load(\@data1, \@data2); # any number of anonymous arrays
$dat->add([2], [6]); # pushes a single value apiece onto copies of @data1 and @data2
$aref = $dat->access(index => 1); # returns reference to copy of @data2, with its new values
$dat->unload(index => 0); # only @data2 remains loaded, and its index is now 0
# With a single anonymous data sequence:
$dat->load(1, 2, 2);
$dat->add(1); # loaded sequence is now 1, 2, 2, 1
$dat->dump_vals(); # same as: print @{$dat->access()}, "\n";
$dat->unload(); # all gone
DESCRIPTION
Handles data for some other statistics modules, as in loading, updating and retrieving data for analysis. Performs no actual statistical analysis itself.
Rationale is not wanting to write the same or similar load, add, etc. methods for every statistics module, not to provide an omnibus API for Perl stat modules. It, however, encompasses much of the variety of how Perl stats modules do the basic handling their data. Used for Statistics::Sequences (and its sub-tests).
SUBROUTINES/METHODS
The basics aims/rules/behaviors of the methods have been/are as described in the RATIONALE section, below. The possibilities are many, but, to wrap up: any loaded/added sequence of data ends up cached within the class object's '_DATA' aref as an aref itself. Optionally (but preferably), this sequence is associated with a 'label', i.e., a stringy name, if it's been loaded/added as such. The sequences can be updated or retrieved according to the order in which they were loaded/added (by index) or (preferably) its 'label'. In this way, any particular statistical method (e.g., to calculate the number of runs in the sequence, as in Statistics::Sequences::Runs), can refer to the 'index' or 'label' of the sequence to do its analysis upon - or it can still use its own rules to select the appropriate sequence, or provide the appropriate sequence within the call to itself. The particular data structures supported here to load, update, retrieve, unload data are specified under load.
new
$dat = Statistics::Data->new();
Returns a new Statistics::Data object.
copy
$seq2 = $dat->copy();
Alias: clone
Returns a copy of the class object with its data loaded (if any). Note this is not a copy of any particular data but the whole blessed hash. Alternatively, use pass to get all the data added to a new object, or use access to load/add particular sequences into another object. Nothing modified in this new object affects the original.
load
$dat->load(@data); # CASE 1 - can be updated/retrieved anonymously, or as index => i (load order)
$dat->load(\@data); # CASE 2 - same, as aref
$dat->load(data => \@data); # CASE 3 - updated/retrieved as label => 'data' (arbitrary name, not just 'data'); or by index (order)
$dat->load({ data => \@data }) # CASE 4 - same as CASE 4, as hashref
$dat->load(blues => \@blue_data, reds => \@red_data); # CASE 5 - same as CASE 3 but with multiple named loads
$dat->load({ blues => \@blue_data, reds => \@red_data }); # CASE 6 - same as CASE 5 bu as hashref
$dat->load(\@blue_data, \@red_data); # CASE 7 - same as CASE 2 but with multiple aref loads
# Not supported:
#$dat->load(data => @data); # not OK - use CASE 3 instead
#$dat->load([\@blue_data, \@red_data]); # not OK - use CASE 7 instead
#$dat->load([ [blues => \@blue_data], [reds => \@red_data] ]); # not OK - use CASE 5 or CASE 6 instead
#$dat->load(blues => \@blue_data, reds => [\@red_data1, \@red_data2]); # not OK - too mixed to make sense
Alias: load_data
Cache a list of data as an array-reference. Each call removes previous loads, as does sending nothing. If data need to be cached without unloading previous loads, try add. Arguments with the following structures are acceptable as data, and will be accessible by either index or label as expected:
- load ARRAY
-
Load an anonymous array that has no named values. For example:
$dat->load(1, 4, 7); $dat->load(@ari);
This is loaded as a single sequence, with an undefined label, and indexed as 0. Note that trying to load a labelled dataset with an unreferenced array is wrong for it will be treated like this case - the label will be "folded" into the sequence itself:
$dat->load('dist' => 3); # no croak but not ok!
- load AREF
-
Load a reference to a single anonymous array that has no named values, e.g.:
$dat->load([1, 4, 7]); $dat->load(\@ari);
This is loaded as a single sequence, with an undefined label, and indexed as 0.
- load ARRAY of AREF(s)
-
Same as above, but note that more than one unlabelled array-reference can also be loaded at once, e.g.:
$dat->load([1, 4, 7], [2, 5, 9]); $dat->load(\@ari1, \@ari2);
Each sequence can be accessed, using access, by specifying index => index, the latter value representing the order in which these arrays were loaded.
- load HASH of AREF(s)
-
Load one or more labelled references to arrays, e.g.:
$dat->load('dist1' => [1, 4, 7]); $dat->load('dist1' => [1, 4, 7], 'dist2' => [2, 5, 9]);
This loads the sequence(s) with a label attribute, so that when calling access, they can be retrieved by name, e.g., passing label => 'dist1'. The load method involves a check that there is an even number of arguments, and that, if this really is a hash, all the keys are defined and not empty, and all the values are in fact array-references.
- load HASHREF of AREF(s)
-
As above, but where the hash is referenced, e.g.:
$dat->load({'dist1' => [1, 4, 7], 'dist2' => [2, 5, 9]});
This means that using the following forms will produce unexpected results, if they do not actually croak, and so should not be used:
$dat->load(data => @data); # no croak but wrong - puts "data" in @data - use \@data
$dat->load([\@blue_data, \@red_data]); # use unreferenced ARRAY of AREFs instead
$dat->load([ [blues => \@blue_data], [reds => \@red_data] ]); # treated as single AREF; use HASH of AREFs instead
$dat->load(blues => \@blue_data, reds => [\@red_data1, \@red_data2]); # mixed structures not supported
add
Alias: add_data, append_data, update
Same usage as shown above for load. Just push any value(s) or so along, or loads an entirely labelled sequence, without clobbering what's already in there (as load would). If data have not been loaded with a label, then appending data to them happens according to the order of array-refs set here, see EXAMPLES could even skip adding something to one previously loaded sequence by, e.g., going $dat->add([], \new_data) - adding nothing to the first loaded sequence, and initialising a second array, if none already, or appending these data to it.
access
$aref = $dat->access(); #returns the first and/or only sequence anonymously loaded, if any
$aref = $dat->access(index => integer); #returns the ith sequence anonymously loaded
$aref = $dat->access(label => 'a_name'); # returns a particular named cache of data
Alias: get_data
Return the data that have been loaded/added to. Only one access of a single sequence at a time; just tries to get 'data' if no 'label' is given or the given 'label' does not exist. If this fails, a croak is given.
unload
$dat->unload(); # deletes all cached data, named or not
$dat->unload(index => integer); # deletes the aref named 'data' whatever
$dat->unload(label => 'a name'); # deletes the aref named 'data' whatever
Empty, clear, clobber what's in there. Croaks if given index or label does not refer to any loaded data. This should be used whenever any already loaded or added data are no longer required ahead of another add, including via copy or share.
share
$dat_new->share($dat_old);
Aliases: pass, import
Adds all the data from one Statistics::Data object to another. Changes in the new copies do not affect the originals.
ndata
$n = $self->ndata();
Returns the number of loaded data sequences.
all_full
$bool = $dat->all_full(\@data); # test data are valid before loading them
$bool = $dat->all_full(label => 'mydata'); # checking after loading/adding the data (or key in 'index')
Checks not only if the data sequence, as named or indexed, exists, but if it is non-empty: has no empty elements, with any elements that might exist in there being checked with hascontent.
all_numeric
$bool = $dat->all_numeric(\@data); # test data are valid before loading them
$bool = $dat->all_numeric(label => 'mydata'); # checking after loading/adding the data (or key in 'index')
Ensure data are all numerical, using looks_like_number
in Scalar::Util.
all_proportions
$bool = $dat->all_proportions(\@data); # test data are valid before loading them
$bool = $dat->all_proportions(label => 'mydata'); # checking after loading/adding the data (or key in 'index')
Ensure data are all proportions. Sometimes, the data a module needs are all proportions, ranging from 0 to 1 inclusive. A dataset might have to be cleaned
dump_vals
$seq->dump_vals(delim => ", "); # assumes the first (only?) loaded sequence should be dumped
$seq->dump_vals(index => I<int>, delim => ", "); # dump the i'th loaded sequence
$seq->dump_vals(label => 'mysequence', delim => ", "); # dump the sequence loaded/added with the given "label"
Prints to STDOUT a space-separated line (ending with "\n") of a loaded/added data's elements. Optionally, give a value for delim to specify how the elements in each sequence should be separated; default is a single space.
dump_list
Dumps a list (using Text::SimpleTable) of the data currently loaded, without showing their actual elements. List is firstly by index, then by label (if any), then gives the number of elements in the associated sequence.
save_to_file
$dat->save_to_file(path => 'mysequences.csv');
$dat->save_to_file(path => 'mysequences.csv', serializer => 'XML::Simple', compress => 1, secret => '123'); # serialization options
Saves the data presently loaded in the Statistics::Data object to a file, with the given path. This can be retrieved, with all the data added to the Statistics::Data object, via load_from_file. Basically a wrapper to store
method in Data::Serializer; cf. for options.
load_from_file
$dat->load_from_file(path => 'medata.csv', format => 'xml|csv');
$dat->load_from_file(path => 'mysequences.csv', serializer => 'XML::Simple', compress => 1, secret => '123'); # serialization options
Loads data from a file, assuming there are data in the given path that have been saved in the format used in save_to_file. Basically a wrapper to retrieve
method in Data::Serializer; cf. for options; and then to load. If the data retreived are actually to be added to any data already cached via a previous load or add, define the optional parameter keep => 1.
EXAMPLES
1. Multivariate data (a tale of horny frogs)
In a study of how doing mental arithmetic affects arousal in self and others (i.e., how mind, body and world interact), three male frogs were maths-trained and then, as they did their calculations, were measured for pupillary dilation and perceived attractiveness. After four runs, average measures per frog can be loaded:
$frogs->load(Names => [qw/Freddo Kermit Larry/], Pupil => [59.2, 77.7, 56.1], Attract => [3.11, 8.79, 6.99]);
But one more frog still had to graudate from training, and data are now ready for loading:
$frogs->add(Names => ['Sleepy'], Pupil => [83.4], Attract => [5.30]);
$frogs->dump_data(label => 'Pupil'); # prints "59.2 77.7 56.1 83.4" : all 4 frogs' pupil data for analysis by some module
Say we're finished testing for now, so:
$frogs->save_to_file(path => 'frogs.csv');
$frogs->unload();
But another frog has been trained, measures taken:
$frogs->load_from_file(path => 'frogs.csv');
$frogs->add(Pupil => [93], Attract => [6.47], Names => ['Jack']); # add yet another frog's data
$frogs->dump_data(label => 'Pupil'); # prints "59.2 77.7 56.1 83.4 93": all 5 frogs' pupil data
Now we run another experiment, taking measures of heart-rate, and can add them to the current load of data for analysis:
$frogs->add(Heartrate => [.70, .50, .44, .67, .66]); # add entire new sequence for all frogs
print "heartrate data are bung" if ! $frogs->all_proportions(label => 'Heartrate'); # validity check (could do before add)
$frogs->dump_list(); # see all four data-sequences now loaded, each with 5 observations (1 per frog), i.e.:
.-------+-----------+----.
| index | label | N |
+-------+-----------+----+
| 0 | Names | 5 |
| 1 | Attract | 5 |
| 2 | Pupil | 5 |
| 3 | Heartrate | 5 |
'-------+-----------+----'
2. Using as a base module
As Statistics::Sequences, and so its sub-modules, use this module as their base, it doesn't have to do much data-managing itself:
use Statistics::Sequences;
my $seq = Statistics::Sequences->new();
$seq->load(qw/f b f b b/); # using Statistics::Data method
say $seq->p_value(stat => 'runs', exact => 1); # using Statistics::Sequences::Runs method
Or if these data were loaded directly within Statistics::Data, the data can be shared around modules that use it as a base:
use Statistics::Data;
use Statistics::Sequences::Runs;
my $dat = Statistics::Data->new();
my $runs = Statistics::Sequences::Runs->new();
$dat->load(qw/f b f b b/);
$runs->pass($dat);
say $runs->p_value(exact => 1);
DIAGNOSTICS
- Don't know how to load/add data
-
Croaked when attempting to load or add data with an unsupported data structure where the first argument is a reference. See the examples under load for valid (and invalid) ways of sending data to them.
- Data for accessing need to be loaded
-
Croaked when calling access, or any methods that use it internally -- viz., dump_vals and the validity checks all_numeric -- when it is called with a label for data that have not been loaded, or did not load successfully.
- Data for unloading need to be loaded
-
Croaked when calling unload with an index or a label attribute and the data these refer to have not been loaded, or did not load successfully.
- There is no path for saving (or loading) data
-
Croaked when calling save_to_file or load_from_file without a value for the required path argument, or if (when loading from it) it does not exist.
DEPENDENCIES
List::AllUtils - used for its all
method when testing loads
Number::Misc - used for its is_even
method when testing loads
String::Util - used for its hascontent
and nocontent
methods
Data::Serializer - required for save_to_file and load_from_file
Scalar::Util - required for all_numeric
Text::SimpleTable - required for dump_list
RATIONALE
The basics aims/rules/behaviors of all the methods have been/are to:
- lump data as arefs into the class object
-
That's sequences in general, without discriminating at the outset between continuous/numeric or categorical/nominal/stringy data - because the stats methods themselves don't matter here. The point is to make these things available for modular statistical analysis without having to worry about all the loading, adding, accessing, etc. within the same package. No other information is cached, not whether they've been analysed, updated ... - just the arefs themselves. Maybe later versions could distinguish data from other info, but for now, that's all left up to the stat analysis modules themselves.
- handle multiple arefs
-
That's much of the crux of having a Statistics::Data object, or making any stats object - otherwise, they'd just use a module from the Data or List families to handle the data. Also because many Perl stats modules have found it useful to have this functionality - rather than managing multiple objects.
- distinguish between handling whole sequences or just their elements
-
To add_data in most stats modules is to append (push) values to an existing aref, and same thing for deleting data. Some do this just for the single sequence they cache, others by naming particular sequences to append values to, delete values from. But sometimes it's useful to add a whole new sequence without clobbering what was already in there as data, or delete one or more (but not all) sequences already loaded; e.g., some stats modules find it useful to load/add to/delete sequences in multiple separate calls: Statistics::DependantTTest, Statistics::KruskalWallis, Statistics::LogRank. That's taken here as a matter of loading and unloading, not adding and deleting.
- handle named arefs
-
That's both hashes of arefs, and hashrefs of arefs. This is already useful for Statistics::ANOVA and Statistics::FisherPitman - loaded/added to in single calls. There's also the case of having one or more named sequences (arefs) to have multiple sequences attached to them - e.g., when testing for a match of one "target" sequence to one or more "response" sequences; not implemented here, but the existing methods should be able to readily serve up such things.
- handle anonymous data
-
If there's only ever a single sequence of data to analyse by a stats module (such as in Statistics::Autocorrelation and Statistics::Sequences), then naming them, and getting at them by names, might be inconvenient. There should also be support for multiple anonymous loads, which would be accessed by index (order of load) (modules Statistics::ChisqIndep and Statistics::TTest have found this useful). Still, providing this functionality has meant (so far) not keying data by any label, only storing the label within an anonymous hash, alongside the data.
- ample aliases
-
Perl stats modules use a wide variety of names for performing the same or similar data-handling operations within them; e.g., a load in one is an add in another which is really an update in yet another. So the methods here have several aliases representing method names used in other modules.
- easy, obvious adoption by other modules of the methods
-
The modules that use this one simply make themselves "based" on it, and they're always free to define their own load, access, etc. methods.
BUGS AND LIMITATIONS
Please report any bugs or feature requests to bug-statistics-data-0.01 at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Data-0.01. This will notify the author, and then you'll automatically be notified of progress on your bug as any changes are made.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Statistics::Data
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Statistics-Data-0.06
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
AUTHOR
Roderick Garton, <rgarton at cpan.org>
LICENSE AND COPYRIGHT
Copyright 2009-2013 Roderick Garton
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See perl.org/licenses for more information.