NAME
Hadoop::IO::TypedBytes
VERSION
version 0.002
SYNOPSIS
query.hql
add jar /usr/lib/hive/lib/hive-contrib.jar;
set hive.script.recordwriter=org.apache.hadoop.hive.contrib.util.typedbytes.TypedBytesRecordWriter;
set hivevar:serde=org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe;
select transform (foo, bar, baz) row format serde '${serde}' using 'script.pl' as ...
script.pl
use Hadoop::IO::TypedBytes qw/make_reader decode_hive/;
my $reader = make_reader(\*STDIN);
while (defined (my $row = decode_hive($reader))) {
my ($foo, $bar, $baz) = @$row;
...
}
NB: these examples only use TypedBytes to pass data from Hive to the transform script. To use TypedBytes for output as well, please refer to Hive documentation.
DESCRIPTION
This package is useful if you want to pass multiline strings or binary data to a transform script in a Hive query.
Encoding routines "encode" and "encode_hive" take Perl values and return strings with their TypedBytes representations.
Decoding routines "decode" and "decode_hive" take a reader callback (see "READING") instead of binary strings, because TypedBytes streams as implemented by Hadoop and Hive are unframed, in other words it is impossible to say in advance how long the object will be and read it in one go before passing it to the decoder. For this reason decoder consumes the binary stream directly.
NAME
Hadoop::IO::TypedBytes - Hadoop/Hive compatible TypedBytes serializer/deserializer.
FUNCTIONS
Nothing is exported by default.
- encode
-
encode($val) -> $binary_string
Encode a scalar, array or hash reference as TypedBytes binary representation.
If you are interfacing with Hive, use "encode_hive" instead.
Containers are encoded recursively. Blessed references are not supported and will cause function to die.
Scalars are always encoded as TypedBytes strings, array references are encoded as TypedBytes arrays and hash references are encoded as TypedBytes structs. Note that as of time of writing Hive TypedBytes decoder does not support arrays and structs. Hive documentation suggests to use JSON in that case.
- decode
-
decode($reader) -> $obj
Decode a binary TypedBytes stream into corresponding perl object.
See "READING" for description of the
$reader
parameter.If you interfacing with Hive, use "decode_hive" instead.
TypedBytes lists and arrays are decoded as array references.
TypedBytes structs are decoded as hash references.
TypedBytes standard empty marker and special marker are docoded as empty lists, but you shouldn't encounter these under normla circumstances.
- encode_hive
-
encode_hive(@row) -> $binary_string
Encode a row of values in a format expected by Hive: concatenation of values encoded terminated by a Hive specific "end-of-row" marker.
See notes for "encode" for description of encoding process.
- decode_hive
-
decode_hive($reader) -> $row
Decode a binary TypedBytes stream into an array repsresenting a single row received from Hive.
See "READING" for description of the
$reader
parameter.See notes for "decode" for a general description of decoding process.
- make_reader
-
make_reader($fh) -> $callback
Construct reader callback from a filehandle.
READING
Both Hadoop and Hive streams that use TypedBytes provide no way to know size of the following record without decoding the object. For this reason "decode" and "decode_hive" take a callback instead of a string. Reader callback is passed a single parameter, number of bytes to read and should return a string of exactly that length or undef to indicate end-of-file. It should die and NOT return undef
if number of bytes in the stream was less than requested.
Use "make_reader" to make a compatible reader callback from a file-handle.
AUTHORS
Philippe Bruhat
Sabbir Ahmed
Somesh Malviya
Vikentiy Fesunov
COPYRIGHT AND LICENSE
This software is copyright (c) 2023 by Booking.com.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.