NAME
Data::Freq - Collects data, counts frequency, and makes up a multi-level counting report
VERSION
Version 0.05
SYNOPSIS
use Data::Freq;
my $data = Data::Freq->new('date');
while (my $line = <STDIN>) {
$data->add($line);
}
$data->output();
DESCRIPTION
Data::Freq
is an object-oriented module to collect data from log files or any kind of data sources, count frequency of particular patterns, and generate a counting report.
See also the command-line tool data-freq.
The simplest usage is to count lines of a log files in terms of a particular category such as date, username, remote address, and so on.
For more advanced usage, Data::Freq
is capable of aggregating counting results at multiple levels. For example, lines of a log file can be grouped into months first, and then under each of the months, they can be further grouped into individual days, where all the frequency of both months and days is summed up consistently.
Analyzing an Apache access log
The example below is a copy from the "SYNOPSIS" section.
my $data = Data::Freq->new('date');
while (my $line = <STDIN>) {
$data->add($line);
}
$data->output();
It will generate a report that looks something like this:
123: 2012-01-01
456: 2012-01-02
789: 2012-01-03
...
where the left column shows the number of occurrences of each date.
The date/time value is automatically extracted from the log line, where the first field enclosed by a pair of brackets [...]
is parsed as a date/time text by the Date::Parse::str2time()
function. (See Date::Parse.)
See also "logsplit" in Data::Freq::Record.
Multi-level counting
The initialization parameters for the new() method can be customized for a multi-level analysis.
If the field specifications are given, e.g.
Data::Freq->new(
{type => 'date'}, # field spec for level 1
{type => 'text', pos => 2}, # field spec for level 2
);
# assuming the position 2 (third portion, 0-based)
# is the remote username.
then the output will look like this:
123: 2012-01-01
100: user1
20: user2
3: user3
456: 2012-01-02
400: user1
50: user2
6: user3
...
Below is another example along this line:
Data::Freq->new('month', 'day');
# Level 1: 'month'
# Level 2: 'day'
with the output:
12300: 2012-01
123: 2012-01-01
456: 2012-01-02
789: 2012-01-03
...
45600: 2012-02
456: 2012-02-01
789: 2012-02-02
...
See "field specification" for more details about the initialization parameters.
Custom input
The data source is not restricted to log files. For example, a CSV file can be analyzed as below:
my $data = Data::Freq->new({pos => 0}, {pos => 1});
# or more simply, Data::Freq->new(0, 1);
open(my $csv, 'source.csv');
while (<$csv>) {
$data->add([split /,/]);
}
Note: the add() method accepts an array ref, so that the input does not have to be split by the default "logsplit" in Data::Freq::Record function.
For more generic input data, a hash ref can also be given to the add() method.
E.g.
my $data = Data::Freq->new({key => 'x'}, {key => 'y'});
# Note: keys *cannot* be abbrebiated like Data::Freq->new('x', 'y')
$data->add({x => 'foo', y => 'abc'});
$data->add({x => 'bar', y => 'def'});
$data->add({x => 'foo', y => 'ghi'});
$data->add({x => 'bar', y => 'jkl'});
...
In the field specifications, the value of pos
or key
can also be an array ref, where the multiple elements selected by the pos
or key
will be join
'ed by a space (or the value of $"
).
This is useful when a log format contains a date that is not enclosed by a pair of brackets [...]
.
E.g.
my $data = Data::Freq->new({type => 'date', pos => [0..3]});
# Log4x with %d{dd MMM yyyy HH:mm:ss,SSS}
$data->add("01 Jan 2012 01:02:03,456 INFO - test log\n");
# pos 0: "01"
# pos 1: "Jan"
# pos 2: "2012"
# pos 3: "01:02:03,456"
As a result, "01 Jan 2012 01:02:03,456" will be parsed as a date string.
Custom output
The output() method accepts different types of parameters as below:
A file handle or an instance of
IO::*
By default, the result is printed out to
STDOUT
. With this parameter given, it can be any other output destination.A callback subroutine ref
If a callback is specified, it will be invoked with a node object (Data::Freq::Node) passed as an argument. See "frequency tree" for more details about the tree structure.
Roughly, each node represents a counting result for each line in the default output format, in the depth-first order (i.e. the same order as the default output lines).
$data->output(sub { my $node = shift; print "Count: ", $node->count, "\n"; print "Value: ", $node->value, "\n"; print "Depth: ", $node->depth, "\n"; print "\n"; });
A hash ref of options to control output format
$data->output({ with_root => 0 , # also prints total (root node) transpose => 0 , # prints values before counts indent => ' ', # repeats (depth - 1) times separator => ': ' , # separates the count and the value prefix => '' , # prepended before the count no_padding => 0 , # disables padding for the count });
The format option can be specified together with a file handle.
$data->output(\*STDERR, {indent => "\t"});
The output does not include the grand total by default. If the with_root
option is set to a true value, the total count will be printed as the first line (level 0), and all the subsequent levels will be shifted to the right.
The transpose
option flips the order of the count and the value in each line. E.g.
2012-01: 12300
2012-01-01: 123
2012-01-02: 456
2012-01-03: 789
...
2012-02: 45600
2012-02-01: 456
2012-02-02: 789
...
The indent unit (repeated appropriate times) and the separator (between the count and the value) can be customized with the respective options, indent
and separator
.
The default output format has apparent ambiguity between the indent and the padding for alignment.
For example, consider the output below:
1200000: Level 1
900000: Level 2
900000: Level 3
5: Level 2
...
where the second "Level 2" appears to have a deeper indent than the "Level 3."
Although the positions of colons (:
) are consistently aligned, it may seem to be slightly inconsistent.
The indent depth will be clearer if a prefix
is added:
$data->output({prefix => '* '});
* 1200000: Level 1
* 900000: Level 2
* 900000: Level 3
* 5: Level 2
...
Alternatively, the no_padding
option can be set to a true value to disable the left padding.
$data->output({no_padding => 1});
1200000: Level 1
900000: Level 2
900000: Level 3
5: Level 2
...
Field specification
Each argument passed to the new() method is passed to the "new" in Data::Freq::Field method.
For example,
Data::Freq->new(
'month',
'day',
);
is equivalent to
Data::Freq->new(
Data::Freq::Field->new('month'),
Data::Freq::Field->new('day'),
);
and because of the way the argument is interpreted by the Data::Freq::Field class, it is also equivalent to
Data::Freq->new(
Data::Freq::Field->new({type => 'month'}),
Data::Freq::Field->new({type => 'day'}),
);
type => { 'text' | 'number' | 'date' }
The basic data types are
'text'
,'number'
, and'date'
, which determine how each input data is normalized for the frequency counting, and how the results are sorted.The
'date'
type can also be written as the format string forPOSIX::strftime()
function. (See POSIX.)Data::Freq->new('%Y-%m'); Data::Freq->new({type => '%H'});
If the type is simply specified as
'date'
, the format defaults to'%Y-%m-%d'
.In addition, the keywords below can be used as synonims:
'year' : equivalent to '%Y' 'month' : equivalent to '%Y-%m' 'day' : equivalent to '%Y-%m-%d' 'hour' : equivalent to '%Y-%m-%d %H' 'minute': equivalent to '%Y-%m-%d %H:%M' 'second': equivalent to '%Y-%m-%d %H:%M:%S'
aggregate => { 'unique' | 'max' | 'min' | 'average' }
The
aggregate
parameter alters how eachcount
is calculated, where the defaultcount
is equal to the sum of all thecount
's for its child nodes.'unique' : the number of distinct child values 'max' : the maximum count of the child nodes 'min' : the minimum count of the child nodes 'average': the average count of the child nodes
sort => { 'value' | 'count' | 'first' | 'last' }
The
sort
parameter is used as the key by which the group of records will be sorted for the output.'value': sort by the normalized value 'count': sort by the frequency count 'first': sort by the first occurrence in the input 'last' : sort by the last occurrence in the input
order => { 'asc' | 'desc' }
The
order
parameter controls the sorting in the either ascending or descending order.pos => { 0, 1, 2, -1, -2, ... }
If the
pos
parameter is given or an integer value (or a list of integers) is given without a parameter name, the value whose frequency is counted will be selected at the indices from an array ref input or a text split by the logsplit() function.key => { any key(s) for input hash refs }
If the
pos
parameter is given, it is assumed that the input is a hash ref, where the value whose frequency is counted will be selected by the specified key(s).convert => sub {...}
If the
convert
parameter is set to a subroutine ref, it is invoked to convert the value to a normalized form for frequency counting.The subroutine is expected to take one string argument and return a converted string.
If the type
parameter is either text
or number
, the results are sorted by count
in the descending order by default (i.e. the most frequent value first).
For the date
type, the sort
parameter defaults to value
, and the order
parameter defaults to asc
(i.e. the time-line order).
Frequency tree
Once all the data have been collected with the add() method, a frequency tree
has been constructed internally.
Suppose the Data::Freq
instance is initialized with the two fields as below:
my $field1 = Data::Freq::Field->new({type => 'month'});
my $field2 = Data::Freq::Field->new({type => 'text', pos => 2});
my $data = Data::Freq->new($field1, $field2);
...
a result tree that looks like below will be constructed as each data record is added:
Depth 0 Depth 1 Depth 2
$field1 $field2
{432: root}--+--{123: "2012-01"}--+--{10: "user1"}
| +--{ 8: "user2"}
| +--{ 7: "user3"}
| ...
+--{135: "2012-02"}--+--{11: "user3"}
| +--{ 9: "user2"}
| ...
...
In the diagram, a node is represented by a pair of braces {...}
, and each integer value is the total number of occurrences of the node value, under its parent category.
The root node maintains the grand total of records that have been added.
The tree structure can be recursively visited by the traverse() method.
Below is an example to generate a HTML:
print qq(<ul>\n);
$data->traverse(sub {
my ($node, $children, $recurse) = @_;
my ($count, $value) = ($node->count, $node->value);
# HTML-escape $value if necessary
print qq(<li>$count: $value);
if (@$children > 0) {
print qq(\n<ul>\n);
for my $child (@$children) {
$recurse->($child); # invoke recursion
}
print qq(</ul>\n);
}
print qq(</li>\n);
});
print qq(</ul>\n);
METHODS
new
Usage:
Data::Freq->new($field1, $field2, ...);
Constructs a Data::Freq
object.
The arguments $field1
, $field2
, etc. are instances of Data::Freq::Field, or any valid arguments that can be passed to "new" in Data::Freq::Field.
The actual data to be analyzed need to be added by the add() method one by one.
The Data::Freq
object maintains the counting results, based on the specified fields. The first field ($field1
) is used to group the added data into the major category. The next subsequent field ($field2
) is for the sub-category under each major group. Any more subsequent fields are interpreted recursively as sub-sub-category, etc.
If no fields are given to the new() method, one field of the text
type will be assumed.
add
Usage:
$data->add("A record");
$data->add("A log line text\n");
$data->add(['Already', 'split', 'data']);
$data->add({key1 => 'data1', key2 => 'data2', ...});
Adds a record that increments the counting by 1.
The interpretation of the input depends on the type of fields specified in the new() method. See "evaluate_record" in Data::Freq::Field.
output
Usage:
# I/O
$data->output(); # print results (default format)
$data->output(\*OUT); # print results to open handle
$data->output($io); # print results to IO::* object
# Callback
$data->output(sub {
my $node = shift;
# $node is a Data::Freq::Node instance
});
# Options
$data->output({
with_root => 0 , # if true, prints total at root
transpose => 0 , # if true, prints values before counts
indent => ' ', # repeats (depth - 1) times
separator => ': ', # separates the count and the value
prefix => '' , # prepended before the count
no_padding => 0 , # if true, disables padding for the count
});
# Combination
$data->output(\*STDERR, {opt => ...});
$data->output($open_fh, {opt => ...});
Generates a report of the counting results.
If no arguments are given, default format results are printed out to STDOUT
. Any open handle or an instance of IO::*
can be passed as the output destination.
If the argument is a subroutine ref, it is regarded as a callback that will be called for each node of the frequency tree in the depth-first order. (See "frequency tree" for details.)
The following arguments are passed to the callback:
$node: Data::Freq::Node
The current node (Data::Freq::Node)
$children: [$child_node1, $child_node2, ...]
An array ref to the list of child nodes, sorted based on the field
Note:
$node->children
is a hash ref (unsorted) of a raw counting data.
traverse
Usage:
$data->traverse(sub {
my ($node, $children, $recurse) = @_;
# Do something with $node before its child nodes
# $children is a sorted list of child nodes,
# based on the field specification
for my $child (@$children) {
$recurse->($child); # invoke recursion
}
# Do something with $node after its child nodes
});
Provides a way to traverse the result tree with more control than the output() method.
A callback must be passed as an argument, and will ba called with the following arguments:
$node: Data::Freq::Node
The current node (Data::Freq::Node)
$children: [$child_node1, $child_node2, ...]
An array ref to the list of child nodes, sorted based on the field
Note:
$node->children
is a hash ref (unsorted) of a raw counting data.$recurse: sub ($a_child_node)
A subroutine ref, with which the resursion is invoked at a desired time
When the traverse() method is called, the root node is passed as the $node
parameter first. Until the $recurse
subroutine is explicitly invoked for the child nodes, no recursion will be invoked automatically.
root
Returns the root node of the frequency tree. (See "frequency tree" for details.)
The root node is created during the new() method call, and maintains the total number of added records and a reference to its direct child nodes for the first field.
fields
Returns the array ref to the list of fields (Data::Freq::Field).
The returned array should not be modified.
AUTHOR
Mahiro Ando, <mahiro at cpan.org>
BUGS
Please report any bugs or feature requests to bug-data-freq at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Freq. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Data::Freq
You can also look for information at:
GitHub repository (report bugs here)
RT: CPAN's request tracker (report bugs here, alternatively)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2012 Mahiro Ando.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.