NAME

uniq_wc - efficiently count unique tokens from a file

VERSION

version 0.014

SYNOPSIS

uniq_wc [options] FILE

DESCRIPTION

The Linear Probabilistic Counter is space efficient and allows the implementer to specify the desired level of accuracy. This algorithm is useful when space efficiency is important but you need to be able to control the error in your results. This algorithm works in a two-step process. The first step assigns a bitmap in memory initialized to all zeros. A hash function is then applied to the each entry in the input data. The result of the hash function maps the entry to a bit in the bitmap, and that bit is set to 1. The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to get the estimate.

n = -m ln Vn

In the equation m is the size of the bitmap and Vn is the ratio of empty bits over the size of the map. The important thing to note is that the size of the original bitmap can be much smaller than the expected max cardinality. How much smaller depends on how much error you can tolerate in the result. Because the size of the bitmap, m, is smaller than the total number of distinct elements, there will be collisions. These collisions are required to be space-efficient but also result in the error found in the estimation. So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error we will see in the end result.

(source)

OPTIONS

--help

This.

--length

Feature vector length (in MB, default: 1).

--seed

Custom seed (integer).

--bits

How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes.

EXAMPLES

Use:

$ time uniq_wc -l 8 enwik8
361181

real    0m1.262s
user    0m1.220s
sys     0m0.036s

Instead of:

$ time perl -mbytes -lne '++$u{lc$1}while/(\w+)/g}{print~~keys%u' enwik8
361990

real    0m6.798s
user    0m6.744s
sys     0m0.028s

SEE ALSO

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2021 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.