NAME
data-freq - a text frequency analysis tool
SYNOPSIS
data-freq [options] [--] [files..]
OPTIONS
Field Type
-t | --text -y | --year -u | --number -m | --month -d | --date | --day --hour --minute --second +%m-%d +%H +%H:%M etc. | --strftime=FMT
For multiple fields, each
Field Type Option
begins new field specification.Field Selector
-p NUM | --pos=NUM
NUM
can be zero, positive, negative, multiple separated by commas (,
), and/or a range with a..
operator.Field Output
-n NUM | --limit=NUM -z | --zero -o NUM | --offset=NUM
NUM
can be zero, positive, or negative.Field Aggregation
-U | --unique -M | --max -N | --min -Y | --average
Field Sorting
-V | --value -F | --first -A | --asc -S | --score -L | --last -D | --desc
Input Format
-b STR | --split=STR
Output Format
-I STR | --indent=STR -R | --root -P STR | --prefix=STR -T | --transpose -B STR | --separator=STR -O | --nopadding
Help
-v | --version -h | --help -a | --man -c | --check
EXAMPLES
Monthly view counts
Long: data-freq --month < access_log Short: data-freq -m < access_log
Monthly + Daily
Long: data-freq --month --day < access_log Short: data-freq -md < access_log
Monthly + Top 3 users per month
Long: data-freq --month \ --text --pos=2 --limit=3 \ access_log Short: data-freq -m -tp2 -n3 access_log
Top 5 days in the number of distinct users
Long: data-freq --day --score --limit=5 \ --text --pos=2 --unique --zero \ access_log Short: data-freq -dS -n5 -tp2 -Uz access_log
Hourly aggregation
Long: data-freq --strftime %H Short: data-freq +%H
DESCRIPTION
Overview
data-freq
is a command line tool to analyze frequency of particular types of text data. It is based on the corresponding Perl module Data::Freq.
For example, consider an input file:
Abc Def
Def Ghi
Ghi Jkl
Abc Def
Def Ghi
Abc Def
The command can be executed as below:
data-freq filename
(or)
data-freq < filename
Then the output will be
3: Abc Def
2: Def Ghi
1: Ghi Jkl
where the number on the left indicates how many times each line of text appears in the input.
Log file analysis
This tool is designed especially in favor of log file analysis.
A typical log file for the Apache web server consists of lines like this:
1.2.3.4 - user1 [01/Jan/2012:01:02:03 +0000] "GET / HTTP/1.1" 200 12
One of the simplest examples for such a log file is
data-freq --month /var/log/httpd/access_log
which will yield something like this:
12300: 2012-01
23400: 2012-02
34500: 2012-03
Note the date/time information is automatically extracted from the first chunk of text that is enclosed by a pair of brackets [...]
.
If the access log file is very large, it is recommended to do some experiment for a part of the log until satisfactory options are determined. E.g.
tail -1000 /var/log/httpd/access_log | \
data-freq --[several different options]
In order to select a specific field from the log line, use the --pos
option:
# Count IP addresses
data-freq --pos=0 < access_log
(or)
data-freq -p0 < access_log
# Count remote usernames
data-freq --pos=2 < access_log
(or)
data-freq -p2 < access_log
If the --pos
option is used, it is regarded as the 0-based index for the array of words in each input line.
Multi-level analysis
data-freq
is capable of aggregating frequency data at multiple levels.
E.g.
data-freq --month --day < access_log
(or)
data-freq -md < access_log
where --month
is for the first level, and --day
is for the second level.
The output will look something like this:
12300: 2012-01
210: 2012-01-01
321: 2012-01-02
432: 2012-01-03
...
23400: 2012-02
321: 2012-02-01
432: 2012-02-02
543: 2012-02-03
...
34500: 2012-03
543: 2012-02-01
654: 2012-02-02
765: 2012-02-03
...
Below is another example to list top 3 users per month:
data-freq --month --text --pos=2 --limit=3 < access_log
(or)
data-freq -m -tp2 -n3 < access_log
Output:
12300: 2012-01
1200: user1
230: user2
135: user3
23400: 2012-02
2400: user1
1122: user4
765: user3
34500: 2012-03
3600: user2
2100: user3
1350: user1
Note: the dates are sorted by the time-line order, while the users are sorted by the count.
Field types
There are three basic field types as below:
--text
Each line in the input is added as a text entry so that its frequency is counted.
If the
--pos
option is given, each line is split into chunks, and only the selected chunk(s) at the position are counted.--number
The input is interpreted as numbers, which affects the sorting order in the output.
--pos
option should usually be given, but if it is omitted, the first chunk is used as the input number.--date
The input is parsed as date/time and formatted based on the
POSIX::strftime()
format. (See POSIX.) The default format is%Y-%m-%d
which looks like2001-02-03
.Unless
--pos
option is explicitly given, the first field enclosed by a pair of brackets[...]
in the input line is automatically parsed.The date/time format can be specified with the
--strftime
option, or a plus sign+
followed by the format is interpreted as the--strftime
option. E.g.--strftime=%m-%d (or) +%m-%d
The options below can be used as shortcuts for the date/time format:
--year : '%Y' --month : '%Y-%m' --day : '%Y-%m-%d' --hour : '%Y-%m-%d %H' --minute: '%Y-%m-%d %H:%M' --second: '%Y-%m-%d %H:%M:%S'
In order to place multiple field specifications, each of the field type
option indicates the beginning of the group of options that belong to the same field.
The default type is --text
and it can be omitted for the first field, but cannot be omitted from the second field on.
data-freq --text --pos=2 # correct
data-freq --pos=2 # ok
data-freq --text --pos=2 --text --pos=0 # correct
data-freq --pos=2 --text --pos=0 # ok
data-freq --pos=2 --pos=0 # incorrect
Selecting fields
--pos
Selects a field at the given position in each input line. The position is a 0-based index (i.e. the first chunk is the position 0).
Multiple positions can be specified with comma-separated numbers or a range described by a
..
operator.data-freq --pos=2 data-freq --pos=1,2,5 data-freq --pos=0..3
For a field with the --pos
option, the input line is split into chunks by whitespaces (unless the --split
option is explicitly given), while any chunk enclosed by a pair of parentheses (...)
, brackets [...]
, braces {...}
, double quotes "..."
, or single quotes '...'
is grouped as one field, even if it contains whitespaces.
Nested parentheses, brackets, and braces are not supported.
For the field of the --date
type, even if the --pos
option is not set, the first chunk enclosed by a pair of brackets [...]
is automatically selected.
Some log formats do not enclose the date/time by brackets. In that case, the --pos
option with a range operator is useful.
For example, if the log line looks like this:
01 Jan 2012 01:02:03,456 INFO - test log
then the --pos
option can be used as below:
data-freq --pos=0..3
Limiting output
In the output, the number of records to display under each category can be limited by the options below:
--limit
Limits the records to the given number. If a negative number is specified, the number is counted from the end.
--offset
Skips as many records as the given number. If a negative number is specified, the number is counted from the end.
--zero
Short for
--limit=0
.
Sorting results
The output can be sorted on the per-field basis by the attributes below:
--score
Sorts by the score (left-hand side numbers).
--value
Sorts by the value (right-hand side texts).
--first
Sorts by the first occurrence in the input.
--last
Sorts by the last occurrence in the input.
The direction of the order can be controlled by these respective options:
--asc
Sorts in the ascending order
--desc
Sorts in the descending order
If the sorting and/or ordering options above are omitted, the default sorting method will be determined as follows:
1. If the field type is --text
, the output will be sorted by --score
by default (i.e. the most frequent text first). Otherwise (if the field type is either --number
or any kind of --date
), the output will be sorted by --value
by default (i.e. the number-line or time-line order).
2. If the sorting type is either --score
or --last
, the output will be sorted in the descending order by default. Otherwise, the default is the ascending order.
Aggregating subcategory
If one of the aggregation options below is given to a field, it alters the meaning of what is displayed as the score of its parent field.
Without the aggregation, the frequency of each field is counted independently, where the parent field count is usually equal to the sum of the child field counts. The aggregation options use the alternative method instead of scoring the sum.
--unique
Scores the number of distinct values.
--max
Scores the maximum count.
--min
Scores the minimum count.
--average
Scores the average count.
Below is an example to show top 5 days in the number of distinct users:
data-freq --day --score --limit=5 \
--text --pos=2 --unique --zero \
access_log
(or)
data-freq -dS -n5 -tp2 -Uz access_log
where --day
is the daily aggregate for the first level, and --text --pos=2
is for the usernames per day.
The --score
option is to sort the first field by the score (unique usernames) rather than by the date itself, and then the top 5 days will be printed out with --limit=5
.
The --unique
option makes the first field count the number of unique usernames instead of the total number of occurrences, while the --zero
option for the second field hides all the individual usernames, since the only purpose here is to list the dates.
As a result, the output will look like
1100: 2012-03-05
860: 2012-02-20
789: 2012-02-13
641: 2012-03-12
580: 2012-02-27
where each number on the left is the number of unique users on each day, and the listed dates are the top 5 among others.
Input format
--split
Specifies the field separator for each of the input lines.
For example, in order to analyze a CSV file,
data-freq --split=, --pos=2 < input.csv
will count the third field in each line.
Output format
There are a number of ways to control the output format.
--indent
Alters the indent spaces (or any other characters) that repeat as many times as the depth (minus 1) at each field level. E.g.
data-freq --indent=++
will output something like this:
21: AAA ++12: BBB ++++10: CCC ++++ 2: DDD ++ 9: EEE ++++ 6: FFF ++++ 3: GGG
--prefix
Prepends a prefix between the indent and the score value.
Example:
data-freq --prefix='* '
Output:
* 21: AAA * 12: BBB * 10: CCC * 2: DDD * 9: EEE * 6: FFF * 3: GGG
--separator
Sets the separator between the score and the counted text.
Example:
data-freq --separator=' => '
Output:
21 => AAA 12 => BBB 10 => CCC 2 => DDD 9 => EEE 6 => FFF 3 => GGG
--root
Also displays the grand total at the level 0. All the subsequent levels are shifted to the right.
34: Total 21: AAA 12: BBB 10: CCC 2: DDD 9: EEE 6: FFF 3: GGG 13: HHH 13: III 12: JJJ 1: KKK
--transpose
Swaps the position of the score and the counted text.
AAA: 21 BBB: 12 CCC: 10 DDD: 2 EEE: 9 FFF: 6 GGG: 3
--nopadding
Suppresses the space padding to the left, which is by default for the alignment of the counted texts.
21: AAA 12: BBB 10: CCC 2: DDD 9: EEE 6: FFF 3: GGG
Note: the indent space above is strictly fixed as multiple of 4 spaces, while the texts at the same level may not be aligned.
AUTHOR
Mahiro Ando, <mahiro at cpan.org>
LICENSE AND COPYRIGHT
Copyright 2012 Mahiro Ando.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.