NAME

PDL::Parallel::threads - sharing PDL data between Perl threads

SYNOPSIS

use PDL;
use PDL::Parallel::threads qw(retrieve_pdls share_pdls);

# Technically, this is pulled in for you by PDL::Parallel::threads,
# but using it in your code pulls in the named functions like async.
use threads;

# Also, technically, you can use PDL::Parallel::threads with
# single-threaded programs, and even with perl's not compiled
# with thread support.

# Create some shared PDL data
zeroes(1_000_000)->share_as('My::shared::data');

# Create an ndarray and share its data
my $test_data = sequence(100);
share_pdls(some_name => $test_data);  # allows multiple at a time
$test_data->share_as('some_name');    # or use the PDL method

# Kick off some processing in the background
async {
    my ($shallow_copy)
        = retrieve_pdls('some_name');

    # thread-local memory
    my $other_ndarray = sequence(20);

    # Modify the shared data:
    $shallow_copy++;
};

# ... do some other stuff ...

# Rejoin all threads
for my $thr (threads->list) {
    $thr->join;
}

use PDL::NiceSlice;
print "First ten elements of test_data are ",
    $test_data(0:9), "\n";

DESCRIPTION

This module provides a means to share PDL data between different Perl threads. In contrast to PDL's posix thread support (see PDL::ParallelCPU), this module lets you work with Perl's built-in threading model. In contrast to Perl's threads::shared, this module focuses on sharing data, not variables.

Because this module focuses on sharing data, not variables, it does not use attributes to mark shared variables. Instead, you must explicitly share your data by using the "share_pdls" function or "share_as" PDL method that this module introduces. Those both associate a name with your data, which you use in other threads to retrieve the data with the "retrieve_pdls". Once your thread has access to the ndarray data, any modifications will operate directly on the shared memory, which is exactly what shared data is supposed to do. When you are completely done using a piece of data, you need to explicitly remove the data from the shared pool with the "free_pdls" function. Otherwise your data will continue to consume memory until the originating thread terminates, or put differently, you will have a memory leak.

This module lets you share two sorts of ndarray data. You can share data for an ndarray that is based on actual physical memory, such as the result of "zeroes" in PDL::Core. You can also share data using memory mapped files. (Note: PDL v2.4.11 and higher support memory mapped ndarrays on all major platforms, including Windows.) There are other sorts of ndarrays whose data you cannot share. You cannot directly share ndarrays that have not been physicalised, though a simple "make_physical" in PDL::Core, "sever" in PDL::Core, or "copy" in PDL::Core will give you an ndarray based on physical memory that you can share. Also, certain functions wrap external data into ndarrays so you can manipulate them with PDL methods. For example, see "plmap" in PDL::Graphics::PLplot and "plmeridians" in PDL::Graphics::PLplot. These you cannot share directly, but making a physical copy with "copy" in PDL::Core will give you something that you can safely share.

Physical Memory

The mechanism by which this module achieves data sharing of physical memory is remarkably cheap. It's even cheaper then a simple affine transformation. The sharing works by creating a new shell of an ndarray for each call to "retrieve_pdls" and setting that ndarray's memory structure to point back to the same locations of the original (shared) ndarray. This means that you can share ndarrays that are created with standard constructors like "zeroes" in PDL::Core, "pdl" in PDL::Core, and "sequence" in PDL::Basic, or which are the result of operations and function evaluations for which there is no data flow, such as "cat" in PDL::Core (but not "dog" in PDL::Core), arithmetic, "copy" in PDL::Core, and "sever" in PDL::Core. When in doubt, sever your ndarray before sharing and everything should work.

There is an important nuance to sharing physical memory: The memory will always be freed when the originating thread terminates, even if it terminated cleanly. This can lead to segmentation faults when one thread exits and frees its memory before another thread has had a chance to finish calculations on the shared data. It is best to use barrier synchronization to avoid this (via PDL::Parallel::threads::SIMD), or to share data solely from your main thread.

Memory Mapped Data

As of 0.07, data sharing of memory-mapped ndarrays is exactly the same as any other. It has not been tested with PDL::IO::FlexRaw-mapped ndarrays.

Package and Name Munging

PDL::Parallel::threads lets you associate your data with a specific text name. Put differently, it provides a global namespace for data. Users of the C programming language will immediately notice that this means there is plenty of room for developers using this module to choose the same name for their data. Without some combination of discipline and help, it would be easy for shared memory names to clash. One solution to this would be to require users (i.e. you) to choose names that include their current package, such as My-Module-workspace or, following perlpragma, My::Module/workspace instead of just workspace. This is sometimes called name mangling. Well, I decided that this is such a good idea that PDL::Parallel::threads does the second form of name mangling for you automatically! Of course, you can opt out, if you wish.

The basic rules are that the package name is prepended to the name of the shared memory as long as the name is only composed of word characters, i.e. names matching /^\w+$/. Here's an example demonstrating how this works:

package Some::Package;
use PDL;
use PDL::Parallel::threads 'retrieve_pdls';

# Stored under '??foo'
sequence(20)->share_as('??foo');

# Shared as 'Some::Package/foo'
zeroes(100)->share_as('foo');

sub do_something {
  # Retrieve 'Some::Package/foo'
  my $copy_of_foo = retrieve_pdls('foo');

  # Retrieve '??foo':
  my $copy_of_weird_foo = retrieve_pdls('??foo');

  # ...
}

# Move to a different package:
package Other::Package;
use PDL::Parallel::threads 'retrieve_pdls';

sub something_else {
  # Retrieve 'Some::Package/foo'
  my $copy_of_foo = retrieve_pdls('Some::Package/foo');

  # Retrieve '??foo':
  my $copy_of_weird_foo = retrieve_pdls('??foo');

  # ...
}

The upshot of all of this is that if you use some module that also uses PDL::Parallel::threads, namespace clashes are highly unlikely to occur as long as you (and the author of that other module) use simple names, like the sort of thing that works for variable names.

FUNCTIONS

This module provides three stand-alone functions and adds one new PDL method.

share_pdls

Shares ndarray data across threads using the given names.

share_pdls (name => ndarray, name => ndarray, ...)

This function takes key/value pairs where the value is the ndarray to store, and the key is the name under which to store the ndarray. You can later retrieve the memory with the "retrieve_pdls" method.

Sharing an ndarray with physical memory (or that is memory-mapped) increments the data's reference count; you can decrement the reference count by calling "free_pdls" on the given name. In general this ends up doing what you mean, and freeing memory only when you are really done using it.

my $data1 = zeroes(20);
my $data2 = ones(30);
share_pdls(foo => $data1, bar => $data2);

This can be combined with constructors and fat commas to allocate a collection of shared memory that you may need to use for your algorithm:

share_pdls(
    main_data => zeroes(1000, 1000),
    workspace => zeroes(1000),
    reduction => zeroes(100),
);

share_pdls preserves the badflag and badvalue on ndarrays.

share_as

Method to share an ndarray's data across threads under the given name.

$pdl->share_as(name)

This PDL method lets you directly share an ndarray. It does the exact same thing as "shared_pdls", but its invocation is a little different:

# Directly share some constructed memory
sequence(20)->share_as('baz');

# Share individual ndarrays:
my $data1 = zeroes(20);
my $data2 = ones(30);
$data1->share_as('foo');
$data2->share_as('bar');

Like many other PDL methods, this method returns the just-shared ndarray. This can lead to some amusing ways of storing partial calculations partway through a long chain:

my $results = $input->sumover->share_as('pre_offset') + $offset;

# Now you can get the result of the sumover operation
# before that offset was added, by calling:
my $pre_offset = retrieve_pdls('pre_offset');

This function achieves the same end as "share_pdls": There's More Than One Way To Do It, because it can make for easier-to-read code. In general I recommend using the share_as method when you only need to share a single ndarray memory space.

share_as preserves the badflag and badvalue on ndarrays.

retrieve_pdls

Obtain ndarrays providing access to the data shared under the given names.

my ($copy1, $copy2, ...) = retrieve_pdls (name, name, ...)

This function takes a list of names and returns a list of ndarrays that provide access to the data shared under those names. In scalar context the function returns the ndarray corresponding with the first named data set, which is usually what you mean when you use a single name. If you specify multiple names but call it in scalar context, you will get a warning indicating that you probably meant to say something differently.

my $local_copy = retrieve_pdls('foo');
my @both_ndarrays = retrieve_pdls('foo', 'bar');
my ($foo, $bar) = retrieve_pdls('foo', 'bar');

retrieve_pdls preserves the badflag and badvalue on ndarrays.

free_pdls

Frees the shared memory (if any) associated with the named shared data.

free_pdls(name, name, ...)

This function marks the memory associated with the given names as no longer being shared, handling all reference counting and other low-level stuff. You generally won't need to worry about the return value. But if you care, you get a list of values---one for each name---where a successful removal gets the name and an unsuccessful removal gets an empty string.

So, if you say free_pdls('name1', 'name2') and both removals were successful, you will get ('name1', 'name2') as the return values. If there was trouble removing name1 (because there is no memory associated with that name), you will get ('', 'name2') instead. This means you can handle trouble with perl greps and other conditionals:

my @to_remove = qw(name1 name2 name3 name4);
my @results = free_pdls(@to_remove);
if (not grep {$_ eq 'name2'} @results) {
    print "That's weird; did you remove name2 already?\n";
}
if (not $results[2]) {
    print "Couldn't remove name3 for some reason\n";
}

This function simply removes an ndarray's memory from the shared pool. It does not interact with bad values in any way. But then again, it does not interfere with or screw up bad values, either.

DIAGNOSTICS

share_pdls: expected key/value pairs

You called share_pdl with an odd number of arguments, which means that you could not have supplied key/value pairs. Double-check that every ndarray (or filename) that you supply is preceded by its shared name.

share_pdls: you already have data associated with '$name'

You tried to share some data under $name, but some data is already associated with that name. Typo? You can avoid namespace clashes with other modules by using simple names and letting PDL::Parallel::threads mangle the name internally for you.

share_pdls: Could not share an ndarray under name '$name' because ...
... the ndarray does not have any allocated memory.

You tried to share an ndarray that does not have any memory associated with it.

... the ndarray's data does not come from the datasv.

You tried to share an ndarray that has a funny internal structure, in which the data does not point to the buffer portion of the datasv. I'm not sure how that could happen without triggering a more specific error, so I hope you know what's going on if you get this. :-)

share_pdls passed data under '$name' that it does not know how to store

share_pdls only knows how to store raw data ndarrays. It'll croak if you try to share other kinds of ndarrays, and it'll throw this error if you try to share anything else, like a hashref.

retrieve_pdls: '$name' was created in a thread that has ended or is detached

In some other thread, you added some data to the shared pool. If that thread ended without you freeing that data (or the thread has become a detached thread), then we cannot know if the data is available. You should always free your data from the data pool when you're done with it, to avoid this error.

retrieve_pdls could not find data associated with '$name'

Pretty simple: either data has never been added under this name, or data under this name has been removed.

retrieve_pdls: requested many ndarrays... in scalar context?

This is just a warning. You requested multiple ndarrays (sent multiple names) but you called the function in scalar context. Why do such a thing?

LIMITATIONS

You cannot share memory mapped files that require features of PDL::IO::FlexRaw. That is a cool module that lets you pack multiple ndarrays into a single file, but simple cross-thread sharing is not trivial and is not (yet) supported.

If you are dealing with a physical ndarray, you have to be a bit careful about how the memory gets freed. If you don't call free_pdls on the data, it will persist in memory until the end of the originating thread, which means you have a classic memory leak. If another thread creates a thread-local copy of the data before the originating thread ends, but then tries to access the data after the originating thread ends, this will be fine as the reference count of the datasv will have been increased.

BUGS

None known at this point.

SEE ALSO

PDL::ParallelCPU, MPI, PDL::Parallel::MPI, OpenCL, threads, threads::shared

AUTHOR, COPYRIGHT, LICENSE

This module was written by David Mertens. The documentation is copyright (C) David Mertens, 2012. The source code is copyright (C) Northwestern University, 2012. All rights reserved.

This module is distributed under the same terms as Perl itself.

DISCLAIMER OF WARRANTY

Parallel computing is hard to get right, and it can be exacerbated by errors in the underlying software. Please do not use this software in anything that is mission-critical unless you have tested and verified it yourself. I cannot guarantee that it will perform perfectly under all loads. I hope this is useful and I wish you well in your usage thereof, but BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.