NAME

Catmandu::Exporter::Stat - a statistical export

SYNOPSIS

# Calculate statistics on the availabity of the ISBN fields in the dataset
cat data.json | catmandu convert -v JSON to Stat --fields isbn

# Export the statistics as YAML
cat data.json | catmandu convert -v JSON to Stat --fields isbn --as YAML

DESCRIPTION

The Catmandu::Stat package can be used to calculate statistics on the availablity of fields in a data file. Use this exporter to count the availability of fields or count the number of duplicate values. For each field the exporter calculates the following statistics:

* name    : the name of a field
* count   : the number of occurences of a field in all records
* zeros   : the number of records without a field
* zeros%  : the percentage of records without a field
* min     : the minimum number of occurences of a field in any record
* max     : the maximum number of occurences of a field in any record
* mean    : the mean number of occurences of a field in all records
* variance : the variance of the field number
* stdev   : the standard deviation of the field number
* uniq~   : the estimated number of unique records
* uniq%   : the estimated percentage of uniq values
* entropy : the minimum and maximum entropy in the field values (estimated value)

Details:

* entropy is an indication in the variation of field values (are some values more unique than others)
* entropy values are displayed as : minimum/maximum entropy
* when the minimum entropy = 0, then all the field values are equal
* when the minimum and maximum entropy are equal, then all the field values are different
* the 'uniq%' and 'entropy' fields are estimated and are normally within 1% of the
  correct value (this is done to keep the memory requirements of this module low)

Each statistical report contains one row named hash '#' which contains the total number of records.

CONFIGURATION

v

Verbose output. Show the processing speed.

fix FIX

A fix or a fix file containing one or more fixes applied to the input data before the statistics are calculated.

fields KEY[,KEY,...]

One or more fields in the data for which statistics need to be calculated. No deep nested fields are allowed. The exporter will collect statistics on the availability of a field in all records. For instance, the following record contains one 'title' field, zero 'isbn' fields and 3 'author' fields

---
title: ABCDEF
author:
    - Davis, Miles
    - Parker, Charly
    - Mingus, Charles
year: 1950

Examples of operation:

# Calculate statistics on the number of records that contain a 'title'
cat data.json | catmandu convert JSON to Stat --fields title

# Calculate statistics on the number of records that contain a 'title', 'isbn' or 'subject' fields
cat data.json | catmandu convert JSON to Stat --fields title,isbn,subject

# The next example will not work: no deeply nested fields allowed
cat data.json | catmandu convert JSON to Stat --fields foo.bar.x.y

When no fields parameter is available, then all fields are read from the first input record.

as Table | CSV | YAML | JSON | ...

By default the statistics are exported in a Table format. The use 'as' option to change the export format.

topk NUMBER

To calculate the entropy an estimate of the probability distribution of the data set needs to be calculated. Topk is the expected lower bound on the number of field values which have repeated entries. By default it is set to 100. If there are more fields values with doubles, then this number needs to be increased.

hll NUMBER

This is the Algorithm::HyperLogLog parameter calculating the estimation of cardinality (uniqueness) of a data set. The HLL register parameter, which should be between 4 and 16, gives an estimate on the precision of the calculation. The bigger the number, the better precision but also more memory will be used. Default: 14.

SEE ALSO

Catmandu::Exporter , Statistics::Descriptive , Statistics::TopK , Algorithm::HyperLogLog