NAME

JSON::SL::Tuba - High performance SAX-like interface for JSON

SYNOPSIS

Create a very naive JSON encoder using JSON::SL::Tuba

my $JSON ||= <<'EOJ';
{
"a" : "b",
"c" : { "d" : "e" },
"f" : [ "g", "h", "i", "j" ],
"a number" : 0.4444444444,
"a (false) boolean": false,
"another (true) boolean" : true,
"a null value" : null,
"exponential" : 1.3413400E4,
"an\tescaped key" : "a u-\u0065\u0073caped value", "שלום":"להראות"
}
EOJ

# Split the 'stream' into multiple chunks to demonstrate the streaming
# feature:
my @Chunks = unpack("a(8)*", $JSON);

# Make a subclass and set up the methods..

package A::Giant::Tuba;
use base qw(JSON::SL::Tuba);

sub on_any {
    my ($tuba,$info,$data) = @_;

    #use constant comparisons
    if ($info->{Type} == TUBA_TYPE_JSON) {
        printf STDERR ("JSON DOCUMENT: %c\n\n", $info->{Mode});
        return;
    }

    # or use the mnemonic ones
    if ($info->{Key} && $info->{Mode} =~ m,[>\+],) {
        printf ('"%s" : ', $info->{Key});
    }

    if ($info->{Type} == TUBA_TYPE_STRING) {
        printf('"%s",' . "\n", $data || "<NO DATA>");

    } elsif ($info->{Type} =~ m,[\[\{],) {
        if ($info->{Mode} eq '+') {
            print $info->{Type} . "\n";
        } else {

            print $JSON::SL::Tuba::CloseTokens{$info->{Type}} . ",\n";
        }
    } else {
        if (defined $data) {
            print $data . ",\n"
        } else {
            die ("hrrm.. what have we here?")
                unless $info->{Type} == TUBA_TYPE_NULL;
            print "null,\n";
        }
    }
}

my $o = My::Giant::Tuba->new();
$o->parse($_) for @Chunks;

Output:

JSON DOCUMENT: +

{
"a" : b,
"c" : {
"d" : e,
},
"f" : [
g,
h,
i,
j,
],
"a number" : 0.4444444444,
"a (false) boolean" : 0,
"another (true) boolean" : 1,
"a null value" : null
"exponential" : 13413.4,
"an	escaped key" : a u-escaped value,
"שלום" : להראות,
},
JSON DOCUMENT: -

DESCRIPTION

JSON::SL::Tuba provides an event-based and high performance SAX-like interface for parsing streaming JSON.

Emphasis when designing JSON::SL::Tuba was the reduction of boilerplate (the author does not have favorable experiences with SAX APIs) and high performance.

This uses the same core JSON functionality and speed as JSON::SL.

To use JSON::SL::Tuba, simply inherit from it and define one or more methods to be called when a parse event occurs.

In normal cases (and this is the default), only a single method (see below) needs to be implemented to be able to receive events.

Of course, if your application requirements are more complex, Tuba is able to deliver you events to the resolution of a single character.

CALLBACK ARGUMENTS AND TERMINOLOGY

These are the list of methods which to implement. All methods follow a single unified calling convention in the form of

callback($tuba, $info, $data);

where $tuba is the JSON::SL::Tuba instance, $info is a hash reference containing metadata about the item for which the event was received, and $data contains the actual 'data' (if applicable)

Info Hash

This hash contains metadata for determining relevant information about the current item.

The hash and all its contents are read-only. Their contents are not valid after the callback returns (see "CAVEATS"). This is for both performance and sanity reasons.

Its keys and values are as follows

Type

This is the type of JSON object for which an event was received.

The following table represents a table of type constants, and their mnemonic symbols. The value to this key itself is a double-typed scalar which yields either the character or the numeric value depending on the context.

Constant            Mnemonic Symbol     Description

=== Scalar Types ===
TUBA_TYPE_STRING        "               "string" value
TUBA_TYPE_KEY           #               hash key
TUBA_TYPE_BOOLEAN       ?               JSON boolean atom ('true','false')
TUBA_TYPE_NUMBER        =               number
TUBA_TYPE_NULL          ~               JSON 'null' atom

=== Container Types ===
TUBA_TYPE_OBJECT        {               hash (JSON 'object')
TUBA_TYPE_LIST          [               array (JSON 'list')

=== Pseudo Types ===
TUBA_TYPE_JSON          D               the entire stream
TUBA_TYPE_SPECIAL       ^               non-string scalar
TUBA_TYPE_DATA          c               any scalar data
Mode

This is the 'mode' of the callback. The mode is also a magical mnemomic constant similar to the type.

I use the term element to mean any kind of JSON variable/object - i.e. anything listed in the above type table.

Constant            Mnemonic Symbol     Description

TUBA_MODE_START         +               the start of an element
TUBA_MODE_END           -               completion of an element
TUBA_MODE_ON            >               data (contents) of an element

By default, the behavior is as follows:

Complex type events (new hash, new list) are delivered as START events. When they complete, END events are provided

For Scalar types, the START and END callbacks are not delivered, but their contents internally accumulated and delivered in whole via a single ON callback.

Almost every aspect of this is entirely configurable, and these are just (what I hope) sane defaults.

Key

By default, keys are not delivered as their own events, but rather attached to this field for the values which succeed them.

This field, if present, will contain the JSON key.

Only valid if the parent object is a hash.

See the accum_kv option below for a way to make keys be delivered as their own events.

Index

Like key, but instead of a string key, this is a numeric index. Indexes are never delivered as explicit events (since they are inherently implicit entities).

Only valid if parent object is a list.

Escaped

This is a boolean flag. Set to true if the current string needs escaping. This is never set unless string events are delivered incrementally.

Data

Nothing much to say here. This is the pure 'data' associated with the callback.

For ON-style callbacks, this will contain a complete string/number/key (the default), or fragment thereof.

By default, strings are unescaped and numeric formats converted to their Perl equivalent when they cannot be easily stringified.

Complex (non-scalar) objects will never receive an ON-style callback.

START and STOP callbacks never have any data, either.

CALLBACKS

If you've read the above section, then the names of the callbacks to be delivered are relatively consistent.

on_any

This is the default and catch-all callback for all events. The subsequent callbacks in the list do not offer any more capability than this method, but are merely present for performance and convenience (the dispatching for those methods is done in pure C, rather than several layers of Perl).

Therefore, the semantics and behavior of on_any depends on the functionality of the method for which on_any has been made a surrogate.

Determining this can be quite easy. Simply combine the Type and Mode fields to yield the equivalent function name:

if ($info->{Type} == TUBA_TYPE_LIST and $info->{Mode} eq '+') {
    my $callback_name = "start_list";
}
# etc.
start_json
end_json

Delivered on the beginning or end of a stream.

start_object
end_object

Delivered on the beginning and end of a hash

start_list
stop_list

Delivered on the beginning or end of an array.

start_string
stop_string

Delivered when a string has started or stopped. More specifically, this means when the lexer has seen an opening or closing "

on_string

This is where string-specific data gets delivered. This can be either an entire string, or a fragment thereof. In the case of the former, the string is unescaped.

on_data

This is an optional (and default) generic callback for incremental mode - fragments of numbers, booleans, strings, and keys will be delivered here, with the START and STOP callbacks signalling their beginning and end.

start_number
stop_number
on_number

These three methods follow the same semantics as their *_string equivalents, except of course, there is no unescaping

start_boolean
stop_boolean
on_boolean

Same behavior as strings and numbers, except that the object (in the default accumulator mode) is converted to a JSON::SL::Boolean

start_null
stop_null
on_null

Delivered for JSON null atoms. In accumulator mode, these get converted into undef values.

OPTIONS

Accumulators

By default JSON::SL::Tuba uses internal accumulators to buffer your data. This makes for high level events being delivered efficiently without having to call into perl with multiple callbacks for very small units of data. This also makes it easier for you the user, as state handling mechanisms do not need to be as complex.

In addition, Tuba has a special kv (key-value) accumulator which buffers hash keys internally and only ever delivers them as the Key field within the informational hash passed to callbacks.

Accumulator settings control whether incremental 'data' callbacks will be invoked for a specific scalar type or not.

$tuba->accum(tuba, type => boolean, another_type => boolean, ...)

Set accumulator parameters. Each type argument is one of the TUBA_TYPE_ constants (or a mnemonic character), and each boolean argument is whether data for that type should be accumulated.

$tuba->accum_kv(boolean)

Gets or sets the status of the key-value accumulator. Note that enabling the key-value accumulator will also enable the generic key (i.e. #) but disabling the key-value accumulator will not reverse this effect.

$tuba->accum_all(boolean)

This enables or disables the accumulator settings for all scalar types (but not the key-value accumulator)

Generic Options

$tuba->cb_unified(boolean)

If only a single callback is being used, set this option to have Tuba call the on_any callback initially instead of using this as a fallback.

This is not enabled by default as it prevents any other methods from being called, but should be turned on if you don't care about that fact.

$tuba->utf8(boolean)

Tell Tuba to set the SvUTF8 flag on strings.

$tuba->allow_unhandled(boolean)

By default, Tuba will croak if it cannot find a handler method for a given event (this effectively means the on_any method has not been implemented). This is usually what you want. To disable this behavior, set allow_unhandled to a true value.

Parsing Data

There is one method:

$tuba->parse($json_chunk)

And that's all there is to it. Tuba will parse all data fed to it.

If accumulator mode is not being used, then you will be guaranteed to rhave processed every bit of data in $json_chunk, leaving nothing buffered.

This method will croak on error (and I have not yet implemented error handling).

Storing Data in Tuba

The tuba object is a simple hash references. Feel free to use it and abuse it. One exception is the _TUBA key which contains the pointer to the internal C structure. You will probably have perl croak for trying to modify this read-only variable - but if perl doesn't croak, your program will crash - so don't modify it.

BUGS AND CAVEATS

It would be nice to provide an error handler.

Info Hash

The info hash passed to callbacks is read only and volatile. This means the following:

Trying to access a non-existent key in the hash (i.e. any key not listed in the section describing this hash) will throw an error about accessing a disallowed key.

Trying to modify any value in the hash will throw and error.

Keeping references to values within the hash, e.g.

my $ref = \$hash->{Type};

will not work as the value will not be consistent after the callback has returned.

It is safe to take a reference to the Key field, though.

Speed

Considering what Tuba does and the convenience it provides, it's blazingly fast. Nevertheless, JSON::SL is still at least twice the speed.

SEE ALSO

JSON::SL

AUTHOR AND COPYRIGHT

Copyright (C) 2012 M. Nunberg

You may use and distribute this software under the same terms and conditions as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 141:

Non-ASCII character seen before =encoding in '"שלום":"להראות"'. Assuming UTF-8