NAME
threads::tbb - interface to the Threading Building Blocks (TBB) API
SYNOPSIS
# this synopsis is available as examples/incredible-threadable.pl
package Incredible::Threadable;
use threads::tbb;
sub new {
my $class = shift;
# make containers which are efficient and thread-safe
tie my @input, "threads::tbb::concurrent::array";
push @input, @_; # coming soon: @input = @_
tie my @output, "threads::tbb::concurrent::array";
bless { input => \@input,
output => \@output, }, $class;
}
sub parallel_transmogrify {
my $self = shift;
# Initialize the TBB library, and set a specification of required
# modules and/or library paths for worker threads.
my $tbb = threads::tbb->new( requires => [ $0 ] );
my $min = 0;
my $max = scalar @{ $self->{input} };
my $range = threads::tbb::blocked_int->new( $min, $max, 5 );
my $body = $tbb->for_int_method( $self, "my_callback" );
$body->parallel_for( $range );
}
sub my_callback {
my $self = shift;
my $int_range = shift;
for my $idx ($int_range->begin .. $int_range->end-1) {
my $item = $self->{input}->[$idx];
my $transmuted = $item->transmogrify;
$self->{output}->[$idx] = $transmuted;
}
}
package Item;
sub transmogrify {
my $self = shift;
"Ex-$self->{id}";
}
package main;
use feature 'say';
unless ($threads::tbb::worker) { # single script uses can use this
my $parallel_transmogrificator = Incredible::Threadable->new(
map { chomp; bless { id => $_ }, "Item" } <>
);
$parallel_transmogrificator->parallel_transmogrify();
say "Turned to $_" for $parallel_transmogrificator->results();
}
DESCRIPTION
This module provides access to the a selection of Intel's Threading Building Blocks (TBB) library to Perl programs. TBB is a C++ library that provides pre-tested, scalable algorithms for solving a number of common problems that benefit from parallelism.
The API provided by TBB and this module is quite different to threads as provided by POSIX threads and "use threads;" - instead of directly starting threads and managing their activity and communication/synchronisation, an API is provided that provides data parallelism;
divide and conquer processing with task stealing via the "parallel_for" and "parallel_reduce#TODO" APIs
(coming soon) data flow processing via the "pipeline" API
(later, maybe) task-oriented programming via the "task" API
Not a thread-centric API
With threads::tbb
, you don't write your algorithms from the perspective of a thread and what the thread should do next. Instead, a selection of parallelism primitives which have been found to be workable and scalable are provided.
Just as when writing "co-operative multi-threading" programs as with Event or POE, the challenge is to break heavy work into small but substantial, generally non-blocking chunks of work. "Substantial" is yet to be quantified; it's likely to be around the ballpark of 1,000's of Perl runloop iterations. Unlike event-based programming, you can freely recurse into the library, start new parallel sections, and expect all runnable tasks to process, up to the number of threads that you started.
As your program runs, the API allows the TBB library to keep queues (trees actually) of runnable tasks. These are identified and kept in thread-affinitive task lists. Other threads can come along and "steal" work from these lists, to keep cores busy.
What this means is that it is relatively easy to make programs which can make best use of processing power available on newer multi-core CPUs. It also avoids per-thread overheads, to only start as many threads are required to use all of the parallelism in hardware. Each thread requires its own C stack and complete Perl interpreter. Therefore it is generally not desirable to create more threads than the hardware has available.
Worker Interpreters
When the first threads::tbb::init
object is made, one worker thread is created for each processor core or virtual core. This is performed by the TBB library before the Perl interpreter can use strict;
Subsequent calls to it will not create new worker pthreads, instead they will re-use the existing threads.
Each worker thread is for the most part, completely isolated from the other threads - just like use threads
. Unlike use threads
, the perl_clone()
function is never used. Instead, each interpreter must load all of the modules required to get it to do useful work on its own. This is largely automatic, however it isn't foolproof and you will benefit from using the constructor thoughtfully.
Shared Data
Worker threads do not share any perl variables with the main process. A system of "lazy deep cloning" is used to transport Perl data structures between threads; you must pass data through these objects, as they are the only objects which are the same between threads. See threads::tbb::concurrent for more information.
You cannot share information between threads using threads::shared, nor use threads::lite
's receive or receive_table; see threads::tbb::concurrent::queue#TODO
Unlocking malloc
Perl core does not yet ship with a thread-scalable malloc function (see "Allocate OPs from arenas" in perltodo. Memory allocation by Perl core will both suffer from contention (as all threads must use the memory allocator in turn) and from false sharing on SMP systems due to insufficient alignment of allocated blocks. That is, blocks smaller than the smallest unit of cache the processor can "own" are allocated and this can cause cache contention.
So for the greatest scalability you will also need to use an arena-based memory allocator; a simple way to do this is by setting LD_PRELOAD=libtbbmalloc_proxy.so.2
. See http://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat/
CLASS METHODS
The only threads::tbb
class method is the constructor for a new TBB context. This context is a demand that worker threads have at least the module set specified loaded. By default, workers should end up with the same module set as "now".
use threads::tbb;
my $tbb = threads::tbb->new();
To make this happen, the library takes a copy of the %INC global variable (see "%INC" in perlvar) at compile time. It also saves and places a special callback onto the @INC global (see "require" in perlfunc) which records all of the modules later loaded by code.
It builds these into two lists which are passed to the worker threads for driving thread initialization before any work is done. They can be specified manually (as in threads::lite):
my $tbb = threads::tbb->new(
lib => \@INC, # default: @INC at module BEGIN time
modules => [ qw(Math::BigRat) ],
);
lib
-
This is an ordered list of paths to prepend to @INC of the worker threads before any modules are loaded. If any paths already exist on @INC of the worker thread, they are not duplicated.
modules
-
This is an ordered list of modules to 'require' in the worker thread. The modules in this list are specified in module-form (eg "Math::BigRat"). If you want to specify instead a list of require-form (eg "Math/BigRat.pm"), this is also possible:
my $tbb = threads::tbb->new( requires => [ "Math/BigRat.pm" ] );
As the list of modules are processed, if any module encountered is already in the
%INC
- for instance, if it was loaded as a dependency of another module - then it is not re-loaded.The default is to take the
%INC
saved from the module load, and sort it such that, egMoose/Object.pm
sorts afterMoose.pm
, and then after that alphabetically. After this sorted list, any modules which were seen byrequire
oruse
are added to the list in the order they were included in the main program.
Note if you add paths to the beginning of @INC
yourself, after use threads::tbb
but before threads::tbb->new()
, then threads::tbb
will not see them. So, put your use lib "path"
statements before the first use threads::tbb;
, or specify required modules yourself.
METHODS
These methods are usually available on body objects which must first be obtained by methods on the threads::tbb
object; some convenience wrappers incorporate this stage for you.
parallel_for
parallel_for
can be used to process a set of data. It is passed a range object, and a body object. The body object encapsulates state, and the range selects a part of that state.
You can declare the body object using either of the following methods:
$tbb->for_int_array_func( \@array, "Some::Func" )
-
This returns a body object, suitable for use with a threads::tbb::blocked_int range, and allows a single
threads::tbb::concurrent::array
for shared state. TheSome::Func
subroutine will be called as:&{"Some::Func"}( $range, $array_ref );
If it wants to communicate state, it should do so via the
$array_ref
. $tbb->for_int_method( $object, "method" )
-
This will create a body object which calls the "method" method of
$object
on sub-divided ranges, as:$object->method( $range );
$object
will be cloned once for each worker, so can be modified and the results expected to stay consistent within the lifetime of the parallel_for; the calling$object
will see none of them.
As more sophisticated body object types are implemented, they will have functions made for them, depending on what state the support etc.
It's a good idea not to assume that the concurrent containers are deep copying values passed through them unless you do it yourself; the only safe access is to assign an item from the container, and to assign an item back to the container. These operations will do deep copies where required, and pass references where the values came from the same interpreter. There is more discussion of this on threads::tbb::concurrent
map_list_func
This planned API wraps the for_int_array_func body object for a map-like API. The function must be static.
use threads::tbb;
my $tbb = threads::tbb->new;
sub Some::Static::Func {
my $val = shift;
$val->frobnicate();
return $val;
}
my @result = $tbb->map_list_func('Some::Static::Func', @array);
The grain size for this operation defaults to 1. See threads::tbb::blocked_int for a discussion on what that refers to. It can be overridden by defining a scalar with the same name as the function, eg in the above example:
$Some::Static::Func = 3; # process 3 items at a time
It does not affect the calling convention; each call to the function named is passed a single item only.
reduce_int_array_func#TODO
Parallel_reduce is just parallel_for
, but with another function to combine results from the array at the end.
To use parallel_reduce via this API, you first create a body function which has two methods.
# map/reduce
use threads::tbb;
tie my @array, "threads::tbb::concurrent::array";
push @array, @data;
# get a range for that array. up to 5 at a time.
my $array_range = threads::tbb::blocked_int->new(0, $#array+1, 5);
my $tbb = threads::tbb->new;
# make a body object
my $body = $tbb->reduce_int_array_func(
\@array,
__PACKAGE__.'::map_func',
__PACKAGE__.'::reduce_func',
);
sub map_func {
my $array = shift;
my $range = shift;
# code may be executed in another thread.
# we now have exclusive use of
# $array->[$range->begin .. $range->end-1]
my $price = 0;
for my $i ($range->begin .. $range->end-1) {
my $val = $array[$i]; # another lazy deep copy
$price += $val->compute_cost();
}
return $price;
},
sub reduce_func {
my ($a, $b) = @_;
return defined $b ? $a + $b : $a;
);
$tbb->parallel_reduce($array_range, $body);
say "total is ", $body->result;
map_reduce_list_func#TODO
Just like map_list_func, this is a convenience wrapper for reduce_int_array_func
.
use threads::tbb;
my $tbb = threads::tbb->new;
sub Some::Static::Func {
my $val = shift;
$val->frobnicate();
return $val;
}
sub Some::Static::reduction {
my ($a, $b) = @_;
return defined $b ? $a + $b : $a;
}
my $result = $tbb->map_reduce_list_func(
'Some::Static::Func', 'Some::Static::reduction',
@array
);
pipeline / filter #TODO
This extremely useful API allows you to structure code that performs multiple discrete steps on a continuous stream of data, with worker threads picking up whatever needs doing.
parallel_while#TODO
This one could potentially be used to implement a generic multi-processor event loop.
my $i = 20;
my $iterator = sub {
$i-- || undef;
};
# deep copied to the sub in this block
parallel_while($iterator, sub {
# you can add another iterator to the while block
parallel_while_add($iterator2);
});
Each of the iterators added run in the thread context of the interpreter that added them.
parallel_sort#TODO
Sorting with some scalability. No plan for this yet; it probably also would not scale beyond one processor without naughty cross-thread peeking (see threads::tbb::concurrent)
parallel_scan #TODO #LATER
... an obscure one; see http://en.wikipedia.org/wiki/Prefix_sum ...
Not implemented.
SEE ALSO
threads::tbb::blocked_int, threads::tbb::concurrent, threads::tbb::concurrent:item, threads::tbb::concurrent::array, threads::tbb::concurrent::hash
http://threadingbuildingblocks.org
Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism, By James Reinders. Publisher: O'Reilly Media. Released: July 2007. isbn://978-0-596-51480-8 (print) isbn://978-0-596-15959-7 (ebook).
AUTHOR AND LICENSE
threads::tbb
was written by Sam Vilain sam.vilain@openparallel.com
Copyright (c) 2011, OpenParallel. threads::tbb
is Free Software; you may use it and/or modify it under the same terms as Perl itself.
The TBB library itself is GPL-2, with a special exception that you may use it as a part of a free software library without restriction. Whether that implies that use of this library imparts the freedoms granted by the GPL on users receiving copies of software built using this library, or whether using with, say, a GPL-3 library revokes the right to copy the software is left as an exercise for the OSS licensing geek reader.
CHANGES
- version 0.03, July 7 2011
-
New convenience wrapper for marking foreign XS object types as safely sharable between threads; see threads::tbb::refcounter.
New map_list_func convenience API (wrapper for for_int_array_func); see "map_list_func" in threads::tbb
Documentation enhancements
Memory leak fixes with containers: they now free their contents when they are freed.
- version 0.02, May 10 2011
-
This version principally adds the corresponding white paper and a couple of minor documentation changes. Hopefully the next version will actually implement some more of the TBB API!