NAME

Math::SimpleHisto::XS - Simple histogramming, but kinda fast

SYNOPSIS

use Math::SimpleHisto::XS;
my $hist = Math::SimpleHisto::XS->new(
  min => 10, max => 20, nbins => 1000,
);

$hist->fill($x);
$hist->fill($x, $weight);
$hist->fill(\@xs);
$hist->fill(\@xs, \@ws);

my $data_bins = $hist->all_bin_contents; # get bin contents as array ref
my $bin_centers = $hist->bin_centers; # dito for the bins

DESCRIPTION

This module implements simple 1D histograms with fixed or variable bin size. The implementation is mostly in C with a thin Perl layer on top.

If this module isn't powerful enough for your histogramming needs, have a look at the powerful-but-experimental SOOT module or submit a patch.

The lower bin boundary is considered part of the bin. The upper bin boundary is considered part of the next bin or overflow.

Bin numbering starts at 0.

EXPORT

Nothing is exported by this module into the calling namespace by default. You can choose to export the following constants:

INTEGRAL_CONSTANT

Or you can use the import tag ':all' to import all.

FIXED- VS. VARIABLE-SIZE BINS

This module implements histograms with both fixed and variable bin sizes. Fixed bin size means that all bins in the histogram have the same size. Implementation-wise, this means that finding a bin in the histogram, for example for filling, takes constant time (O(1)).

For variable width histograms, each bin can have a different size. Finding a bin is implemented with a binary search, which has logarithmic run-time complexity in the number of bins O(log n).

BASIC METHODS

new

Constructor, takes named arguments. In order to create a fixed bin size histogram, the following parameters are mandatory:

min

The lower boundary of the histogram.

max

The upper boundary of the histogram.

nbins

The number of bins in the histogram.

On the other hand, for creating variable width bin size histograms, you must provide only the bins parameter with a reference to an array of nbins + 1 bin boundaries. For example,

my $hist = Math::SimpleHisto::XS->new(
  bins => [1.5, 2.5, 4.0, 6.0, 8.5]
);

creates a histogram with four bins:

[1.5, 2.5)
[2.5, 4.0)
[4.0, 6.0)
[6.0, 8.5)

fill

Fill data into the histogram. Takes one or two arguments. The first must be the coordinate that determines where data is to be added to the histogram. The second is optional and can be a weight for the data to be added. It defaults to 1.

If the coordinate is a reference to an array, it is assumed to contain many data points that are to be filled into the histogram. In this case, if the weight is used, it must also be a reference to an array of weights.

fill_by_bin

Fills data into the histogram and works like fill(), but the first argument (the value(s)) must be bin numbers instead of coordinates.

min, max, nbins, width, highest_bin

Return static histogram attributes: minimum coordinate, maximum coordinate, number of bins, total width of the histogram, and the index of the highest bin in the histogram (which is just nbins - 1).

underflow, overflow

Return the accumulated contents of the under- and overflow bins (which have the ranges from (-inf, min) and [max, inf) respectively).

total

The total sum of weights that have been filled into the histogram, excluding under- and overflow.

nfills

The total number of fill operations (currently including fills that fill into under- and overflow, but this is subject to change).

BIN ACCESS METHODS

binsize

Returns the size of a bin. For histograms with variable width bin sizes, the size of the bin with the provided index is returned (defaults to the first bin). Example:

$hist->binsize(12);

Returns the size of the 13th bin.

all_bin_contents, bin_content

$hist->all_bin_contents() returns the contents of all histogram bins as a reference to an array. This is not the internal storage but a copy.

$hist->bin_content($ibin) returns the content of a single bin.

bin_centers, bin_center

$hist->bin_centers() returns a reference to an array containing the coordinates of all bin centers.

$hist->bin_center($ibin) returns the coordinate of the center of a single bin.

bin_lower_boundaries, bin_lower_boundary

Same as bin_centers and bin_center respectively, but for the lower boundary coordinate(s) of the bin(s). Note that this lower boundary is considered part of the bin.

bin_upper_boundaries, bin_upper_boundary

Same as bin_centers and bin_center respectively, but for the upper boundary coordinate(s) of the bin(s). Note that this lower boundary is not considered part of the bin.

find_bin

$hist->find_bin($x) returns the bin number of the bin in which the given coordinate falls. Returns undef if the coordinate is outside the histogram range.

SETTERS

set_bin_content

$hist->set_bin_content($ibin, $content) sets the content of a single bin.

set_underflow, set_overflow

$hist->set_underflow($content) sets the content of the underflow bin. set_overflow does the obvious.

set_nfills

$hist->set_nfills($n) sets the number of fills.

set_all_bin_contents

Given a reference to an array containing numbers, sets the contents of each bin in the histogram to the number in the respective array element. Number of elements needs to match the number of bins in the histogram.

CLONING

clone, new_alike

$hist->clone() clones the object entirely.

$hist->new_alike() clones the parameters of the object, but resets the contents of the clone.

new_from_bin_range, new_alike_from_bin_range

$hist->new_from_bin_range($first_bin, $last_bin) creates a copy of the histogram including all bins from $first_bin to $last_bin. For example, $hist->new_from_bin_range(50, 199) would create a new histogram with 150 bins (the range is inclusive!) and copy the respective data from the original histogram. All bin contents outside the range will be added to the under- or overflow respectively. Specifying a last bin above the highest bin number of the source histogram yields a new histogram running up to the highest bin of the source.

$hist->new_alike_from_bin_range($first_bin, $last_bin) does the same, but resets all contents (like new_alike).

CALCULATIONS

rebin

Given a rebinning factor, clones the current histogram and modifies it to have $rebin_factor times fewer bins. You can only rebin by factors that divide the number of bins of the input histogram.

For example, you can rebin a histogram with 200 bins by a factor of 10. This results in a histogram with 20 bins. You cannot rebin the same histogram by a factor of 7 because 7 does not divide 200 without remainder.

add_histogram

Given another histogram object, this method will add the content of that object to the invocant's content. This works only if the binning of the histograms is exactly the same. Throws an exception if that is not the case.

subtract_histogram

Given another histogram object, this method will subtract the content of that object from the invocant's content. This works only if the binning of the histograms is exactly the same. Throws an exception if that is not the case.

integral

Returns the integral over the histogram. Very limited at this point. Usage:

my $integral = $hist->integral($from, $to, TYPE);

Where $from and $to are the integration limits and the optional TYPE is a constant indicating the method to use for integration. Currently, only INTEGRAL_CONSTANT is implemented (and assumed as the default). This means that the bins will be treated as rectangles, but fractional bins are treated correctly.

If the integration limits are outside the histogram boundaries, there is no warning, the integration is silently performed within the range of the histogram.

mean

Calculates the mean of the histogram contents.

Note that the result is not usually the same as if you calculated the mean of the input data directly due to the effect of the binning.

standard_deviation

Calculates the standard deviation of the histogram contents.

Note that the result is not usually the same as if you calculated the std. dev. of the input data directly due to the effect of the binning.

First parameter may be the previously calculated mean to avoid recalculating it. If not provided, it will be calculated on the fly.

median

Calculates and returns the estimated median of the data in the histogram. Achieves sub-bin-size resolution by estimating the median position within the bin from the sum of data below and above the median bin.

The estimation is necessary since the true median requires the original data.

median_absolute_deviation

WARNING this is apparently still crashy when facing weird data!

Calculates and returns an estimate of the median absolute deviation (MAD) of the histogram. This is a fairly expensive operation.

Optionally, as an optimization, you can pass in the previously calculated median estimate of the histogram to prevent it from having to be recalculated. Make sure you pass in the correct value or the behaviour of this method is undefined and might even crash your perl!

normalize

Normalizes the histogram to the parameter of the $hist->normalize($total) call. Normalization defaults to 1.

cumulative

Calculates the cumulative histogram of the invocant histogram and returns it as a new histogram object.

The cumulative (if done in Perl) is:

for my $i (0..$n) {
  $content[$i] = sum(map $original_content[$_], 0..$i);
}

As a convenience, if a numeric argument is passed to the method, the OUTPUT histogram will be normalized using number BEFORE calculating the cumulation. This means that

my $cumu = $histo->cumulative(1.);

gives a cumulative histogram where the last bin contains exactly 1.

multiply_constant

Scales all bin contents, as well as over- and underflow by the given constant.

RANDOM NUMBERS

This module comes with a Mersenne twister-based Random Number Generator that follows that in the Math::Random::MT module. It is available in the Math::SimpleHisto::XS::RNG class. You can create a new RNG by passing one or more integers to the Math::SimpleHisto::XS::RNG->new(...) method. The object's rand() method works like the normal Perl rand($x) function.

You can use a histogram as a source for random numbers that follow the distribution of the histogram.

push @random_like_hist, $hist->rand() for 1..100000;

If you pass a Math::SimpleHisto::XS::RNG object to the call to rand(), that random number generator will be used.

rand

Optionally given a Math::SimpleHisto::XS::RNG object (a random number generator), this returns a random number that is drawn from the distribution of the histogram.

SERIALIZATION

This class defines serialization hooks for the Storable module. Therefore, you can simply serialize objects using the usual

use Storable;
my $string = Storable::nfreeze($histogram);
# ... later ...
my $histo_object = Storable::thaw($string);

Currently, this mechanism hardcodes the use of the simple dump format. This is subject to change!

Serialization and Compatibility

If at all possible, the de-serialization routine new_from_dump will be maintained in such a way that it will be able to deserialize dumps of histograms that were done with earlier versions of this module. If a new version of this module can not at all achieve this, that will be mentioned prominently in the change log.

The other way around, serialized histograms are not generally backwards-compatible across major versions. That means you cannot deserialize a dump made with version 1.01 of this module using version 0.05. Such backwards-incompatible changes will always be accompanied with major version number changes (0.X => 1.X, 1.X => 2.X...).

Serialization Formats

The various serialization formats that this module supports (see the dump documentation below) all have various pros and cons. For example, the native_pack format is by far the fastest, but is not portable. The simple format is a very simple-minded text format, but it is portable and performs well (comparable to the JSON format when using JSON::XS, other JSON modules will be MUCH slower). Of all formats, the YAML format is the slowest. See xt/bench_dumping.pl for a simple benchmark script.

None of the serialization formats currently supports compression, but the native_pack format produces the smallest output at about half the size of the JSON output. The simple format is close to JSON for all but the smallest histograms, where it produces slightly smaller dumps. The YAML produced is a bit bigger than the JSON.

dump

This module has fairly simple serialization methods. Just call the dump method on an object of this class and provide the type of serialization desire. Currently valid serializations are simple, JSON, YAML, and native_pack. Case doesn't matter.

For YAML support, you need to have the YAML::Tiny module available. For JSON support, you need any of JSON::XS, JSON::PP, or JSON. The three modules are tried in order at compile time. The chosen implementation can be polled by looking at the $Math::SimpleHisto::XS::JSON_Implementation variable. It contains the module name. Setting this vairable has no effect.

The simple serialization format is a home grown text format that is subject to change, but in all likeliness, there will be some form of version migration code in the deserializer for backwards compatibility.

All of the serialization formats except for native_pack are text-based and thus portable and endianness-neutral.

native_pack should not be used when the serialized data is transferred to another machine.

new_from_dump

Given the type of the dump (simple, JSON, YAML, native_pack) and the actual dump string, creates a new histogram object from the contained data and returns it.

Deserializing JSON and YAML dumps requires the respective support modules to be available. See above.

SEE ALSO

SOOT is a dynamic wrapper around the ROOT C++ library which does histogramming and much more. Beware, it is experimental software.

Serialization can make use of the JSON::XS, JSON::PP, JSON or YAML::Tiny modules. You may want to use the convenient Storable module for transparent serialization of nested data structures containing objects of this class.

ACKNOWLEDGMENTS

This module contains some code written by Abhijit Menon-Sen, who wrote Math::Random::MT.

AUTHOR

Steffen Mueller, <smueller@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2011, 2012, 2013, 2014 by Steffen Mueller

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.