NAME

Data::BitStream::XS - A bit stream class including integer coding methods

VERSION

version 0.08

SYNOPSIS

use Data::BitStream::XS;
my $stream = Data::BitStream::XS->new;
$stream->put_gamma($_) for (1 .. 20);
$stream->rewind_for_read;
my @values = $stream->get_gamma(-1);

See Data::BitStream for more examples.

DESCRIPTION

An XS implementation providing read/write access to bit streams. This includes many integer coding methods as well as straightforward ways to implement new codes.

Bit streams are often used in data compression and in embedded products where memory is at a premium.

This code provides a nearly drop-in XS replacement for the Data::BitStream module. If you do not need the flexibility of the Moose/Mouse/Moo system, you can use this directly.

Versions 0.03 and later of the Data::BitStream class will attempt to use this XS class if it is available. Most operations will be 50-100 times faster, while not sacrificing any of its flexibility, so it is highly recommended. In other words, if this module is installed, any code using Data::BitStream will automatically speed up.

While direct use of the XS class is a bit faster than going through Moose/Mouse/Moo, the vast majority of the benefit is internal. Hence, for maximum portability and flexibility just install this module for the speed, and continue using the Data::BitStream class as usual.

METHODS

CLASS METHODS

new

Creates a new object. By default it has no associated file, is mode RW, and has maxlen 0 (no space allocated). An optional hash of arguments may be supplied. Examples:

$stream = Data::BitStream::XS->new( size => 10_000 );

Indicates an initial start size of 10,000 bits. The normal behavior is to expand the data area as needed, but this can be used to make an initial allocation. A small amount of time will be saved, and it will be more space efficient if the number of bits is known in advance. This often will be a premature optimization.

$stream = Data::BitStream::XS->new( mode => 'ro' );

The stream is opened as a read-only stream. Attempts to open it for write will fail, hence all write / put methods will also fail. This is most useful for opening a file for read, which will ensure no changes are made.

Possible modes include 'read' / 'r', 'readonly' / 'ro', 'write' / 'w', 'writeonly' / 'wo', 'append' / 'a', and 'readwrite' / 'rw' / 'rdwr'.

$stream = Data::BitStream::XS->new( file    => "c.bsc",
                                    fheader => "HEADER $foo $bar",
                                    mode    => 'w' );

A file is associated with the stream. Upon closing the file, going out of scope, or otherwise being destroyed, the stream will be written to the file, with the given header string written first. While the current implementation writes at close time, later implementations may write as the stream is written to.

$stream = Data::BitStream::XS->new( file => "c.bsc",
                                    fheaderlines => 1,
                                    mode => 'ro' );

A file is associated with the stream. The contents of the file will be slurped into the stream. The given number of header lines will be skipped at the start (with their contents put into fheader so they can be retrieved). While the current implementation slurps the contents, later implementations may read from the file as the stream is read.

maxbits

Returns the number of bits in a word, which is the largest allowed size of the bits argument to read and write. This will be either 32 or 64.

It is theoretically possible that the maximum bits for this class and Perl do not match. So this class may report 32-bit maxbits while Perl is 64-bit, and vice-versa. This would usually happen only if it was loaded into a different Perl than it was compiled for.

code_is_supported

Returns a hash of information about a code if it is known, and undef otherwise.

The argument is a text name, such as 'Gamma', 'Rice(2)', etc.

code_is_universal

Returns undef if the code is not known, 0 if the code is non-universal, and a non-zero integer if it is universal.

The argument is a text name, such as 'Gamma', 'Rice(2)', etc.

A code is universal if there exists a constant C such that C plus the length of the code is less than the optimal code length, for all values. What this typically means for us in practical terms is that non-universal codes are fine for small numbers, but their size increases rapidly, making them inappropriate when large values are possible (no matter how rare). A classic non-universal code is Unary coding, which takes k+1 bits to store value k. This is very good if most values are 0 or near zero. If we have rare values in the tens of thousands, it's not so great. It is likely to be fatal if we ever come across a value of 2 billion.

add_code

Used for the dispatch table methods code_put and code_get as well as other helper methods like code_is_universal and code_is_supported. This is typically handled internally, but can be used to register a new code or variant. An example of an Omega-Golomb code:

Data::BitStream::XS::add_code(
   { package   => __PACKAGE__,
     name      => 'OmegaGolomb',
     universal => 1,
     params    => 1,
     encodesub => sub {shift->put_golomb( sub {shift->put_omega(@_)}, @_ )},
     decodesub => sub {shift->get_golomb( sub {shift->get_omega(@_)}, @_ )},
   }
);

which registers the name OmegaGolomb as a new universal code that takes one parameter. Given a stream $stream, this is now allowed:

$stream->erase_for_write;
$stream->code_put("OmegaGolomb(5)", 477);
$stream->rewind_for_read;
my $value = $stream->code_get("OmegaGolomb(5)");
die unless $value == 477;

OBJECT METHODS (reading)

These methods are only valid while the stream is in reading state.

rewind

Moves the position to the stream beginning.

exhausted

Returns true is the stream is at the end. Rarely used.

read($bits [, 'readahead'])

Reads $bits from the stream and returns the value. $bits must be between 1 and maxbits.

The position is advanced unless the second argument is the string 'readahead'.

Attempting to read past the end of the stream is a fatal error. However, readahead is allowed as it is speculative. All positions past the end of the stream will always be filled with zero bits.

readahead($bits) >

Identical to calling read with 'readahead' as the second argument. Returns the value of the next $bits bits (between 1 and maxbits). Returns undef if the current position is at the end. Allows reading past the end of the stream (fills with zeros as necessary). Does not advance the position.

skip($bits)

Advances the position $bits bits. Used in conjunction with readahead.

Attempting to skip past the end of the stream is a fatal error.

read_string($bits)

Reads $bits bits from the stream and returns them as a binary string, such as '0011011'. Attempting to read past the end of the stream is a fatal error.

OBJECT METHODS (writing)

These methods are only valid while the stream is in writing state.

write($bits, $value)

Writes $value to the stream using $bits bits. $bits must be between 1 and maxbits, unless value is 0 or 1, in which case bits may be larger than maxbits.

The stream length will be increased by $bits bits. Regardless of the contents of $value, exactly $bits bits will be used. If $value has more non-zero bits than $bits, the lower bits are written. In other words, $value will be effectively masked before writing.

put_string(@strings)

Takes one or more binary strings (e.g. '1001101', '001100') and writes them to the stream. The number of bits used for each value is equal to the string length.

put_stream($source_stream)

Writes the contents of $source_stream to the stream. This is a helper method that might be more efficient than doing it in one of the many other possible ways. The default implementation uses:

$self->put_string( $source_stream->to_string );
put_raw($packed, [, $bits])

Writes the packed big-endian vector $packed which has $bits bits of data. If $bits is not present, then length($packed) will be used as the byte-length. It is recommended that you include $bits.

OBJECT METHODS (conversion)

These methods may be called at any time, and will adjust the state of the stream.

to_string

Returns the stream as a binary string, e.g. '00110101'.

to_raw

Returns the stream as packed big-endian data. This form is portable to any other implementation on any architecture.

to_store

Returns the stream as some scalar holding the data in some implementation specific way. This may be portable or not, but it can always be read by the same implementation. It might be more efficient than the raw format.

from_string($string)

The stream will be set to the binary string $string.

from_raw($packed [, $bits])

The stream is set to the packed big-endian vector $packed which has $bits bits of data. If $bits is not present, then length($packed) will be used as the byte-length. It is recommended that you include $bits.

from_store($blob [, $bits])

Similar to from_raw, but using the value returned by to_store.

OBJECT METHODS (other)

pos

A read-only non-negative integer indicating the current position in a read stream. It is advanced by read, get, and skip methods, as well as changed by to, from, rewind, and erase methods.

len

A read-only non-negative integer indicating the current length of the stream in bits. It is advanced by write and put methods, as well as changed by from and erase methods.

maxlen

A read-only non-negative integer indicating the current storage size of the stream in bits. This will always be greater than or equal to the stream len. Applications will not normally need to know this.

trim

Resizes the data to the stream len, releasing all expansion space to the system. Not normally needed.

writing

A read-only boolean indicating whether the stream is open for writing or reading. Methods for read such as read, get, skip, rewind, skip, and exhausted are not allowed while writing. Methods for write such as write and put are not allowed while reading.

The write_open and erase_for_write methods will set writing to true. The write_close and rewind_for_read methods will set writing to false.

The read/write distinction allows implementations more freedom in internal caching of data. For instance, they can gather writes into blocks. It also can be helpful in catching mistakes such as reading from a target stream.

erase

Erases all the data, while the writing state is left unchanged. The position and length will both be 0 after this is finished.

read_open

Reads the current input file, if one exists.

write_open

Changes the state to writing with no other API-visible changes.

write_close

Changes the state to reading, and the position is set to the end of the stream. No other API-visible changes happen.

erase_for_write

A helper function that performs erase followed by write_open.

rewind_for_read

A helper function that performs write_close followed by rewind.

fheader

Returns the contents of the header lines read if the fheaderlines option was given to new.A This allows one to read the header of an image format, with the stream pointing to the data, and the header contents easily obtainable. Unfortunately it isn't completely generic, as it assumes a fixed number of lines. An alternative API would be to have a user supplied sub.

OBJECT METHODS (coding)

All coding methods are biased to 0. This means values from 0 to 2^maxbits-1 (for universal codes) may be encoded, even if the original code as published starts with 1.

All get_ methods take an optional count as the last argument. If $count is 1 or not supplied, a single value will be read. If $count is positive, that many values will be read. If $count is negative, values are read until the end of the stream.

get_ methods called in list context will return a list of all values read. Called in scalar context they return the last value read.

put_ methods take one or more values as input after any optional parameters and write them to the stream. All values must be non-negative integers that do not exceed the maximum encodable value (typically ~0, but may be lower for some codes depending on parameter, and non-universal codes will be practically limited to smaller values).

get_unary([$count])
put_unary(@values)

Reads/writes one or more values from the stream in 0000...1 unary coding. Unary coding is only appropriate for relatively small numbers, as it uses $value + 1 bits per value.

get_unary1([$count])
put_unary1(@values)

Reads/writes one or more values from the stream in 1111...0 unary coding.

get_binword($bits, [$count])
put_binword($bits, @values)

Reads/writes one or more values from the stream as fixed-length binary numbers, each using $bits bits.

get_gamma([$count])
put_gamma(@values)

Reads/writes one or more values from the stream in Elias Gamma coding.

get_delta([$count])
put_delta(@values)

Reads/writes one or more values from the stream in Elias Delta coding.

get_omega([$count])
put_omega(@values)

Reads/writes one or more values from the stream in Elias Omega coding.

get_levenstein([$count])
put_levenstein(@values)

Reads/writes one or more values from the stream in Levenstein coding (sometimes called Levenshtein or Левенште́йн coding).

get_evenrodeh([$count])
put_evenrodeh(@values)

Reads/writes one or more values from the stream in Even-Rodeh coding.

get_goldbach_g1([$count])
put_goldbach_g1(@values)

Reads/writes one or more values from the stream in Goldbach G1 coding.

get_goldbach_g2([$count])
put_goldbach_g2(@values)

Reads/writes one or more values from the stream in Goldbach G2 coding.

get_fib([$count])
put_fib(@values)

Reads/writes one or more values from the stream in Fibonacci coding. Specifically, the order m=2 C1 codes of Fraenkel and Klein.

get_fibgen($m [, $count])
put_fibgen($m, @values)

Reads/writes one or more values from the stream in generalized Fibonacci coding. The order m should be between 2 and 16. These codes are described in Klein and Ben-Nissan (2004). For m=2 the results are identical to the standard C1 form.

get_comma($bits [, $count])
put_comma($bits, @values)

Reads/writes one or more values from the stream in Comma coding. The number of bits bits should be between 1 and 16. bits=1 implies Unary coding. bits=2 is the ternary comma code. No leading zeros are used.

get_blocktaboo($taboo [, $count])
put_blocktaboo($taboo, @values)

Reads/writes one or more values from the stream in block-based Taboo coding. The parameter taboo is the binary string of the taboo code to use, such as '00'. taboo='1' implies Unary coding. taboo='0' implies Unary1 coding. No more than 16 bits of taboo code may be given. These codes are a more efficient version of comma codes, as they allow leading zeros.

get_golomb($m [, $count])
put_golomb($m, @values)

Reads/writes one or more values from the stream in Golomb coding.

get_golomb(sub { ... }, $m [, $count])
put_golomb(sub { ... }, $m, @values)

Reads/writes one or more values from the stream in Golomb coding using the supplied subroutine instead of unary coding, which can make them work with large outliers. For example to use Fibonacci coding for the base:

$stream->put_golomb( sub {shift->put_fib(@_)}, $m, $value);

$value = $stream->get_golomb( sub {shift->get_fib(@_)}, $m);
get_rice($k [, $count])
put_rice($k, @values)

Reads/writes one or more values from the stream in Rice coding, which is the time efficient case where m = 2^k.

get_rice(sub { ... }, $k [, $count])
put_rice(sub { ... }, $k, @values)

Reads/writes one or more values from the stream in Rice coding using the supplied subroutine instead of unary coding, which can make them work with large outliers. For example to use Omega coding for the base:

$stream->put_rice( sub {shift->put_omega(@_)}, $k, $value);

$value = $stream->get_rice( sub {shift->get_omega(@_)}, $k);
get_gammagolomb($m [, $count])
put_gammagolomb($m, @values)

Reads/writes one or more values from the stream in Golomb coding using Elias Gamma codes for the base. This is a convenience since they are common.

get_gamma_golomb($m [, $count])
put_gamma_golomb($m, @values)

Aliases for get_gammagolomb and put_gammagolomb.

get_expgolomb($k [, $count])
put_expgolomb($k, @values)

Reads/writes one or more values from the stream in Rice coding using Elias Gamma codes for the base. This is a convenience since they are common.

get_gamma_rice($k [, $count])
put_gamma_rice($k, @values)

Aliases for get_expgolomb and put_expgolomb. This name better describes the algorithm, but is not in common use.

get_baer($k [, $count])
put_baer($k, @values)

Reads/writes one or more values from the stream in Baer c_k coding. The parameter k must be between -32 and 32.

get_boldivigna($k [, $count])
put_boldivigna($k, @values)

Reads/writes one or more values from the stream in the Zeta coding of Paolo Boldi and Sebastiano Vigna. The parameter k must be between 1 and maxbits (32 or 64). Typical values for k are between 2 and 6.

get_arice($k [, $count])
put_arice($k, @values)

Reads/writes one or more values from the stream in Adaptive Rice coding. Technically this is ExpGolomb coding since the default method for encoding the base is using the Elias Gamma code. The value of $k will adapt to better fit the values. This interface will likely change to make $k a reference.

get_arice(sub { ... }, $k [, $count])
put_arice(sub { ... }, $k, @values)

Reads/writes one or more values from the stream in Adaptive Rice coding using the supplied subroutine instead of Elias Gamma coding to encode the base. The value of $k will adapt to better fit the values. This interface will likely change to make $k a reference.

get_startstop(\@m [, $count])
put_startstop(\@m, @values)

Reads/writes one or more values using Start/Stop codes. The parameter is an array reference which can be an anonymous array, for example:

$stream->put_startstop( [0,3,2,0], @array );
my @array2 = $stream->get_startstop( [0,3,2,0], -1);
get_startstepstop(\@m [, $count])
put_startstepstop(\@m, @values)

Reads/writes one or more values using Start-Step-Stop codes. The parameter is an array reference which can be an anonymous array, for example:

$stream->put_startstepstop( [3,2,9], @array );
my @array3 = $stream->get_startstepstop( [3,2,9], -1);
code_get($code, [, $count])
code_put($code, @values )

These methods wrap up all the previous encoding and decoding methods in an internal dispatch table. code is a text name of the code, such as 'Gamma', 'Fibonacci', etc. Codes with parameters are called as 'Rice(2)', 'StartStop(0-0-2-4-14)', etc.

# $use_rice and $k obtained from options, parameters, or wherever.
my $code = $use_rice ? "Rice($k)" : "Delta";
my $nvalues = scalar @values;
$stream->code_put($code, @values);
# ....
my @vals = $stream->code_get($code, $nvalues);
print "Read $nvalues values with code '$code':  ", join(',', @vals), "\n";

SEQUENCE METHODS (class methods)

These methods are exported to allow testing, and because they may be of some use for callers. They are not directly related.

is_prime($n)

Given an unsigned integer n, returns 0 if the number is not prime, 1 if it is prime. The algorithm currently used is trial division. Speed is approximately equal to the code used by Math::Prime::XS version 0.26 (the algorithms are identical). The algorithm may be changed.

prime_count($n)

Returns the Prime Count function Pi(n). The current implementation relies on sieving to find the primes within the interval, so will take some time and memory.

nth_prime($n)

Returns the value of the nth prime, for n >= 1. Note that:

prime_count(nth_prime(n)) = n

nth_prime(prime_count(n)+1) = next_prime(n)

for all n >= 1.

prime_init($n)

Precalculates anything necessary to do fast calls for operations within the range up to n. Not necessary, but helpful with performance when doing repeated calls with increasing n.

SEE ALSO

Data::BitStream
Data::BitStream::Base
Data::BitStream::WordVec
Data::BitStream::Code::Gamma
Data::BitStream::Code::Delta
Data::BitStream::Code::Omega
Data::BitStream::Code::Levenstein
Data::BitStream::Code::EvenRodeh
Data::BitStream::Code::Additive
Data::BitStream::Code::Fibonacci
Data::BitStream::Code::Golomb
Data::BitStream::Code::Rice
Data::BitStream::Code::GammaGolomb
Data::BitStream::Code::ExponentialGolomb
Data::BitStream::Code::StartStop
Data::BitStream::Code::Baer
Data::BitStream::Code::BoldiVigna
Data::BitStream::Code::Comma
Data::BitStream::Code::Taboo
Data::BitStream::Code::ARice
Math::Prime::Util

AUTHORS

Dana Jacobsen <dana@acm.org>

ACKNOWLEDGEMENTS

Peter Elias, Peter Fenwick, and David Solomon have excellent resources on variable length coding, and Solomon especially has done a lot of work in tracking down and explaining many of the more obscure codes.

For prime number work, Eratosthenes of Cyrene provided the world with his wonderfully elegant and simple algorithm for finding the primes. Terje Mathisen, A.R. Quesada, and B. Van Pelt all had useful ideas which I used in my wheel sieve.

COPYRIGHT

Copyright 2011-2014 by Dana Jacobsen <dana@acm.org>

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.