NAME

threads::tbb - interface to the Threading Building Blocks (TBB) API

SYNOPSIS

# this synopsis is available as examples/incredible-threadable.pl
package Incredible::Threadable;
use threads::tbb;

sub new {
    my $class = shift;
    # make containers which are efficient and thread-safe
    tie my @input, "threads::tbb::concurrent::array";
    push @input, @_;  # coming soon: @input = @_
    tie my @output, "threads::tbb::concurrent::array";
    bless { input => \@input,
            output => \@output, }, $class;
}

sub parallel_transmogrify {
    my $self = shift;

    # Initialize the TBB library, and set a specification of required
    # modules and/or library paths for worker threads.
    my $tbb = threads::tbb->new( requires => [ $0 ] );

    my $min = 0;
    my $max = scalar @{ $self->{input} };
    my $range = threads::tbb::blocked_int->new( $min, $max, 5 );

    my $body = $tbb->for_int_method( $self, "my_callback" );

    $body->parallel_for( $range );
}

sub my_callback {
    my $self = shift;
    my $int_range = shift;

    for my $idx ($int_range->begin .. $int_range->end-1) {
        my $item = $self->{input}->[$idx];

        my $transmuted = $item->transmogrify;

        $self->{output}->[$idx] = $transmuted;
    }
}

package Item;
sub transmogrify {
    my $self = shift;
    "Ex-$self->{id}";
}

package main;
use feature 'say';

unless ($threads::tbb::worker) {  # single script uses can use this
    my $parallel_transmogrificator = Incredible::Threadable->new(
        map { chomp; bless { id => $_ }, "Item" } <>
    );

    $parallel_transmogrificator->parallel_transmogrify();
    say "Turned to $_" for $parallel_transmogrificator->results();
}

DESCRIPTION

This module provides access to a few core TBB API functions to Perl programs.

The algorithms employed by TBB are quite different to threads as provided by "use threads;" - instead of directly starting threads and managing their activity and communication/synchronisation, an API is provided that provides data parallelism;

Not a thread-centric API

With threads::tbb, you don't write your algorithms from the perspective of a thread and what the thread should do next. Instead, a selection of parallelism primitives which have been found to be workable and scalable are provided.

Just as when writing "co-operative multi-threading" programs as with Event or POE, the challenge is to break heavy work into small but substantial, generally non-blocking chunks of work. "Substantial" is yet to be quantified; it's likely to be around the ballpark of 1,000's of Perl runloop iterations. Unlike event-based programming, you can freely recurse into the library, start new parallel sections, and expect all runnable tasks to process, up to the number of threads that you started.

As your program runs, the API allows the TBB library to keep queues (trees actually) of runnable tasks. These are identified and kept in thread-affinitive task lists. Other threads can come along and "steal" work from these lists, to keep cores busy.

What this means is that it is relatively easy to make programs which can make best use of processing power available on newer multi-core CPUs.

Worker Interpreters

When the first threads::tbb::init object is made, one worker thread is created for each processor core or virtual core. This is performed by the TBB library before the Perl interpreter can use strict;

Subsequent calls to it will not create new worker pthreads, instead they will re-use the existing threads.

Each worker thread is for the most part, completely isolated from the other threads - just like use threads. Unlike use threads, the perl_clone() function is never used. Instead, each interpreter must load all of the modules required to get it to do useful work on its own. This is largely automatic, however it isn't foolproof and you will benefit from using the constructor thoughtfully.

Shared Data

Worker threads do not share any perl variables with the main process. A system of "lazy deep cloning" is used to transport Perl data structures between threads; you must pass data through these objects, as they are the only objects which are the same between threads. See threads::tbb::concurrent for more information.

You cannot share information between threads using threads::shared, nor use threads::lite's receive or receive_table; see threads::tbb::concurrent::queue#TODO

Unlocking malloc

Perl core does not yet ship with a thread-scalable malloc function (see "Allocate OPs from arenas" in perltodo. Memory allocation by Perl core will both suffer from contention (as all threads must use the memory allocator in turn) and from false sharing on SMP systems due to insufficient alignment of allocated blocks. That is, blocks smaller than the smallest unit of cache the processor can "own" are allocated and this can cause cache contention.

So for the greatest scalability you will also need to use an arena-based memory allocator; a simple way to do this is by setting LD_PRELOAD=libtbbmalloc_proxy.so.2. See http://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat/

CLASS METHODS

The only threads::tbb class method is the constructor for a new TBB context. This context is a demand that worker threads have at least the module set specified loaded. By default, workers should end up with the same module set as "now".

use threads::tbb;
my $tbb = threads::tbb->new();

To make this happen, the library takes a copy of the %INC global variable (see "%INC" in perlvar) at compile time. It also saves and places a special callback onto the @INC global (see "require" in perlfunc) which records all of the modules later loaded by code.

It builds these into two lists which are passed to the worker threads for driving thread initialization before any work is done. They can be specified manually (as in threads::lite):

my $tbb = threads::tbb->new(
    lib => \@INC,   # default: @INC at module BEGIN time
    modules => [ qw(Math::BigRat) ],
);
lib

This is an ordered list of paths to prepend to @INC of the worker threads before any modules are loaded. If any paths already exist on @INC of the worker thread, they are not duplicated.

modules

This is an ordered list of modules to 'require' in the worker thread. The modules in this list are specified in module-form (eg "Math::BigRat"). If you want to specify instead a list of require-form (eg "Math/BigRat.pm"), this is also possible:

my $tbb = threads::tbb->new( requires => [ "Math/BigRat.pm" ] );

As the list of modules are processed, if any module encountered is already in the %INC - for instance, if it was loaded as a dependency of another module - then it is not re-loaded.

The default is to take the %INC saved from the module load, and sort it such that, eg Moose/Object.pm sorts after Moose.pm, and then after that alphabetically. After this sorted list, any modules which were seen by require or use are added to the list in the order they were included in the main program.

Note if you add paths to the beginning of @INC yourself, after use threads::tbb but before threads::tbb->new(), then threads::tbb will not see them. So, put your use lib "path" statements before the first use threads::tbb;, or specify required modules yourself.

METHODS

These methods are available on body objects which must first be obtained by methods on the threads::tbb object.

parallel_for

parallel_for can be used to process a set of data. It is passed a range object, and a body object. The body object encapsulates state, and the range selects a part of that state.

You can declare the body object using either of the following methods:

$tbb->for_int_array_func( \@array, "Some::Func" )

This returns a body object, suitable for use with a threads::tbb::blocked_int range, and allows a single threads::tbb::concurrent::array for shared state. The Some::Func subroutine will be called as:

&{"Some::Func"}( $range, $array_ref );

If it wants to communicate state, it should do so via the $array_ref.

$tbb->for_int_method( $object, "method" )

This will create a body object which calls the "method" method of $object on sub-divided ranges, as:

$object->method( $range );

$object will be cloned once for each worker, so can be modified and the results expected to stay consistent within the lifetime of the parallel_for; the calling $object will see none of them.

As more sophisticated body object types are implemented, they will have functions made for them, depending on what state the support etc.

It's a good idea not to assume that the concurrent containers are deep copying values passed through them unless you do it yourself; the only safe access is to assign an item from the container, and to assign an item back to the container. These operations will do deep copies where required, and pass references where the values came from the same interpreter. There is more discussion of this on threads::tbb::concurrent

parallel_map#TODO

The calling convention of parallel_for allows you to manually specify the "grain size" - the iterations required to to useful work. It also reduces time spent in function call overhead. However, a simpler API could be possible:

use threads::tbb;

my $tbb = threads::tbb->new;

my @output = $tbb->parallel_map(sub {
    my $val = shift;
    $val->frobnicate();
    return $val;
}, @input);

This indicates use of the c<tbb::auto_partitioner()>, which presumably times how long it takes to process the block for given input sizes, and adjusts the size of the blocks accordingly.

parallel_reduce#TODO

Parallel_reduce is just parallel_for, but with another function to combine results from the array at the end.

To use parallel_reduce, you need to create a body function which has two methods.

# map/reduce
use threads::tbb;

tie my @array, "threads::tbb::concurrent::array";
push @array, @data;

# get a range for that array.  up to 5 at a time.
my $array_range = threads::tbb::blocked_int->new(0, $#array+1, 5);

my $tbb = threads::tbb->new;

# make a body object
my $body = $tbb->reduce_int_array_sub(
    \@array,
    sub {
        my $range = shift;

        # code may be executed in a thread.
        # we now have exclusive use of
        #   @array[$range->begin .. $range->end-1]

        my $price = 0;
        for my $i ($range->begin .. $range->end-1) {
            my $val = $array[$i];  # another lazy deep copy

            $price += $val->compute_cost();
        }
        return $price;
    },
    sub {
        # fold a value from a and b to a new value.
        # lazy deep copies in and out.  Check b is not undef.
        my ($a, $b) = @_;
        return defined $b ? $a + $b : $a;
    },
);

$tbb->parallel_reduce($array_range, $body);

pipeline / filter #TODO

This extremely useful API allows you to structure code that performs multiple discrete steps on a continuous stream of data, with worker threads picking up whatever needs doing.

parallel_while#TODO

This one could potentially be used to implement a generic multi-processor event loop.

my $i = 20;
my $iterator = sub {
    $i-- || undef;
};
# deep copied to the sub in this block
parallel_while($iterator, sub {

    # you can add another iterator to the while block
    parallel_while_add($iterator2);

});

Each of the iterators added run in the thread context of the interpreter that added them.

parallel_sort#TODO

Sorting with some scalability. No plan for this yet; it probably also would not scale beyond one processor without naughty cross-thread peeking (see threads::tbb::concurrent)

parallel_scan #TODO #LATER

... an obscure one; see http://en.wikipedia.org/wiki/Prefix_sum ...

Not implemented.

SEE ALSO

threads::tbb::blocked_int, threads::tbb::concurrent, threads::tbb::concurrent:item, threads::tbb::concurrent::array, threads::tbb::concurrent::hash

threads, threads::lite

http://threadingbuildingblocks.org

Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism, By James Reinders. Publisher: O'Reilly Media. Released: July 2007. isbn://978-0-596-51480-8 (print) isbn://978-0-596-15959-7 (ebook).

AUTHOR AND LICENSE

threads::tbb was written by Sam Vilain sam.vilain@openparallel.com

Copyright (c) 2011, OpenParallel. threads::tbb is Free Software; you may use it and/or modify it under the same terms as Perl itself.

The TBB library itself is GPL-2, with a special exception that you may use it as a part of a free software library without restriction. Whether that implies that use of this library imparts the freedoms granted by the GPL on users receiving copies of software built using this library, or whether using with, say, a GPL-3 library revokes the right to copy the software is left as an exercise for the OSS licensing geek reader.

CHANGES

version 0.02, May 10 2011

This version principally adds the corresponding white paper and a couple of minor documentation changes. Hopefully the next version will actually implement some more of the TBB API!