NAME

Hadoop::IO::TypedBytes

VERSION

version 0.002

SYNOPSIS

query.hql

add jar /usr/lib/hive/lib/hive-contrib.jar;
set hive.script.recordwriter=org.apache.hadoop.hive.contrib.util.typedbytes.TypedBytesRecordWriter;
set hivevar:serde=org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe;

select transform (foo, bar, baz) row format serde '${serde}' using 'script.pl' as ...

script.pl

use Hadoop::IO::TypedBytes qw/make_reader decode_hive/;

my $reader = make_reader(\*STDIN);
while (defined (my $row = decode_hive($reader))) {
    my ($foo, $bar, $baz) = @$row;
    ...
}

NB: these examples only use TypedBytes to pass data from Hive to the transform script. To use TypedBytes for output as well, please refer to Hive documentation.

DESCRIPTION

This package is useful if you want to pass multiline strings or binary data to a transform script in a Hive query.

Encoding routines "encode" and "encode_hive" take Perl values and return strings with their TypedBytes representations.

Decoding routines "decode" and "decode_hive" take a reader callback (see "READING") instead of binary strings, because TypedBytes streams as implemented by Hadoop and Hive are unframed, in other words it is impossible to say in advance how long the object will be and read it in one go before passing it to the decoder. For this reason decoder consumes the binary stream directly.

NAME

Hadoop::IO::TypedBytes - Hadoop/Hive compatible TypedBytes serializer/deserializer.

FUNCTIONS

Nothing is exported by default.

encode
encode($val) -> $binary_string

Encode a scalar, array or hash reference as TypedBytes binary representation.

If you are interfacing with Hive, use "encode_hive" instead.

Containers are encoded recursively. Blessed references are not supported and will cause function to die.

Scalars are always encoded as TypedBytes strings, array references are encoded as TypedBytes arrays and hash references are encoded as TypedBytes structs. Note that as of time of writing Hive TypedBytes decoder does not support arrays and structs. Hive documentation suggests to use JSON in that case.

decode
decode($reader) -> $obj

Decode a binary TypedBytes stream into corresponding perl object.

See "READING" for description of the $reader parameter.

If you interfacing with Hive, use "decode_hive" instead.

TypedBytes lists and arrays are decoded as array references.

TypedBytes structs are decoded as hash references.

TypedBytes standard empty marker and special marker are docoded as empty lists, but you shouldn't encounter these under normla circumstances.

encode_hive
encode_hive(@row) -> $binary_string

Encode a row of values in a format expected by Hive: concatenation of values encoded terminated by a Hive specific "end-of-row" marker.

See notes for "encode" for description of encoding process.

decode_hive
decode_hive($reader) -> $row

Decode a binary TypedBytes stream into an array repsresenting a single row received from Hive.

See "READING" for description of the $reader parameter.

See notes for "decode" for a general description of decoding process.

make_reader
make_reader($fh) -> $callback

Construct reader callback from a filehandle.

READING

Both Hadoop and Hive streams that use TypedBytes provide no way to know size of the following record without decoding the object. For this reason "decode" and "decode_hive" take a callback instead of a string. Reader callback is passed a single parameter, number of bytes to read and should return a string of exactly that length or undef to indicate end-of-file. It should die and NOT return undef if number of bytes in the stream was less than requested.

Use "make_reader" to make a compatible reader callback from a file-handle.

AUTHORS

  • Philippe Bruhat

  • Sabbir Ahmed

  • Somesh Malviya

  • Vikentiy Fesunov

COPYRIGHT AND LICENSE

This software is copyright (c) 2023 by Booking.com.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.