NAME

PDL::Parallel::threads - sharing PDL data between Perl threads

VERSION

This documentation describes version 0.02 of PDL::Parallel::threads.

SYNOPSIS

use PDL;
use PDL::Parallel::threads qw(retrieve_pdls share_pdls);

# Technically, this is pulled in for you by PDL::Parallel::threads,
# but using it in your code pulls in the named functions like async.
use threads;

# Also, technically, you can use PDL::Parallel::threads with
# single-threaded programs.

# Create some shared PDL data
zeroes(1_000_000)->share_as('My::shared::data');

# Create a piddle and share its data
my $test_data = sequence(100);
share_pdls(some_name => $test_data);  # allows multiple at a time
$test_data->share_as('some_name');    # or use the PDL method

# Or work with memory mapped files:
share_pdls(other_name => 'mapped_file.dat');

# Kick off some processing in the background
async {
    my ($shallow_copy, $mapped_piddle)
        = retrieve_pdls('some_name', 'other_name');
    
    # thread-local memory
    my $other_piddle = sequence(20);
    
    # Modify the shared data:
    $shallow_copy++;
};

# ... do some other stuff ...

# Rejoin all threads
for my $thr (threads->list) {
    $thr->join;
}

use PDL::NiceSlice;
print "First ten elements of test_data are ",
    $test_data(0:9), "\n";

DESCRIPTION

This module provides a means to share PDL data between different Perl threads. In contrast to PDL's posix thread support (see PDL::Parallel::CPU or, for older versions of PDL, PDL::ParallelCPU), this module lets you work with Perl's built-in threading model. In contrast to Perl's threads::shared, this module focuses on sharing data, not variables.

Because this module focuses on sharing data, not variables, it does not use attributes to mark shared variables. Instead, you must explicitly share your data by using the "share_pdls" function or "share_as" PDL method that this module introduces. Those both associate a name with your data, which you use in other threads to retrieve the data with the "retrieve_pdls". Once your thread has access to the piddle data, any modifications will operate directly on the shared memory, which is exactly what shared data is supposed to do. When you are completely done using a piece of data, you need to explicitly remove the data from the shared pool with the "free_pdls" function. Otherwise your data will continue to consume memory until the originating thread terminates, or put differently, you will have a memory leak.

This module lets you share two sorts of piddle data. You can share data for a piddle that is based on actual physical memory, such as the result of "zeroes" in PDL::Core. You can also share data using memory mapped files. (Note: PDL v2.4.11 and higher support memory mapped piddles on all major platforms, including Windows.) There are other sorts of piddles whose data you cannot share. You cannot directly share slices (though a simple "sever" in PDL::Core or "copy" in PDL::Core will give you a piddle based on physical memory that you can share). Also, certain functions wrap external data into piddles so you can manipulate them with PDL methods. For example, see "plmap" in PDL::Graphics::PLplot and "plmeridians" in PDL::Graphics::PLplot. These you cannot share directly, but making a physical copy with "copy" in PDL::Core will give you something that you can safey share.

Physical Memory

The mechanism by which this module achieves data sharing of physical memory is remarkably cheap. It's even cheaper then a simple affine transformation. The sharing works by creating a new shell of a piddle for each call to "retrieve_pdls" and setting that piddle's memory structure to point back to the same locations of the original (shared) piddle. This means that you can share piddles that are created with standard constructors like "zeroes" in PDL::Core, "pdl" in PDL::Core, and "sequence" in PDL::Basic, or which are the result of operations and function evaluations for which there is no data flow, such as "cat" in PDL::Core (but not "dog" in PDL::Core), arithmetic, "copy" in PDL::Core, and "sever" in PDL::Core. When in doubt, sever your piddle before sharing and everything should work.

There is an important nuance to sharing physical memory: The memory will always be freed when the originating thread terminates, even if it terminated cleanly. This can lead to segmentation faults when one thread exits and frees its memory before another thread has had a chance to finish calculations on the shared data. It is best to use barrier synchronization to avoid this (via PDL::Parallel::threads::SIMD), or to share data solely from your main thread.

Memory Mapped Data

The mechanism by which this module achieves data sharing of memory mapped files is exactly how you would share data across threads or processes using PDL::IO:::FastRaw. However, there are a couple of important caveats to using memory mapped piddles with PDL::Parallel::threads. First, you must load PDL::Parallel::threads before loading PDL::IO::FastRaw:

# Good
use PDL::Parallel::threads qw(retrieve_pdls);
use PDL::IO::FastRaw;

# BAD
use PDL::IO::FastRaw;
use PDL::Parallel::threads qw(retrieve_pdls);

This is necessary because PDL::Parallel::threads has to perform a few internal tweaks to PDL::IO::FastRaw before you load its fuctions into your local package.

Furthermore, any memory mapped files must have header files associated with the data file. That is, if the data file is foo.dat, you must have a header file called foo.dat.hdr. This is overly restrictive and in the future the module may perform more internal tweaks to PDL::IO::FastRaw to store whatever options were used to create the original piddle. But for the meantime, be sure that you have a header file for your raw data file.

There is much less nuance to sharing memory mapped data across threads compared to directly sharing physical memory as discussed above. When you ask for a thread-local copy of that file, you get your very own fully baked memory-mapped piddle that gets freed when the piddle goes out of scope. This means you cannot get memory leaks. Furthermore, the data underlying the piddle come from a file and not from a shared space in RAM. That means there is no "originating thread", and you cannot trigger a segmentation fault by trying to access memory that has disappeared, because... there's nothing that can disappear.

    You may ask yourself why loading this module must come before loading the FastRaw module. The reason is that PDL::IO::FastRaw exports a few methods to your namespace, and PDL::Parallel::threads modifies one of those exported functions. If you pull in FastRaw before this module, this module won't have been able to work its magic on FastRaw first, and the functions in your package won't be the ones needed for proper sharing of memory mapped data. Put differently, the earlier you can manage to use PDL::Parallel::threads, the better.

Package and Name Munging

PDL::Parallel::threads lets you associate your data with a specific text name. Put differently, it provides a global namespace for data. Users of the C programming language will immediately notice that this means there is plenty of room for developers using this module to choose the same name for their data. Without some combination of discipline and help, it would be easy for shared memory names to clash. One solution to this would be to require users (i.e. you) to choose names that include thier current package, such as My-Module-workspace or, following perlpragma, My::Module/workspace instead of just workspace. This is sometimes called name mangling. Well, I decided that this is such a good idea that PDL::Parallel::threads does the second form of name mangling for you automatically! Of course, you can opt out, if you wish.

The basic rules are that the package name is prepended to the name of the shared memory as long as the name is only composed of word characters, i.e. names matching /^\w+$/. Here's an example demonstrating how this works:

package Some::Package;
use PDL;
use PDL::Parallel::threads 'retrieve_pdls';

# Stored under '??foo'
sequence(20)->share_as('??foo');

# Shared as 'Some::Package/foo'
zeroes(100)->share_as('foo');

sub do_something {
  # Retrieve 'Some::Package/foo'
  my $copy_of_foo = retrieve_pdls('foo');
  
  # Retrieve '??foo':
  my $copy_of_weird_foo = retrieve_pdls('??foo');
  
  # ...
}

# Move to a different package:
package Other::Package;
use PDL::Parallel::threads 'retrieve_pdls';

sub something_else {
  # Retrieve 'Some::Package/foo'
  my $copy_of_foo = retrieve_pdls('Some::Package/foo');
  
  # Retrieve '??foo':
  my $copy_of_weird_foo = retrieve_pdls('??foo');
  
  # ...
}

The upshot of all of this is that if you use some module that also uses PDL::Parallel::threads, namespace clashes are highly unlikely to occur as long as you (and the author of that other module) use simple names, like the sort of thing that works for variable names.

FUNCTIONS

This module provides three stand-alone functions and adds one new PDL method.

share_pdls

Shares piddle data across threads using the given names.

share_pdls (name => piddle|filename, name => piddle|filename, ...)

This function takes key/value pairs where the value is the piddle to store or the file name to memory map, and the key is the name under which to store the piddle or file name. You can later retrieve the memory (or a piddle mapped to the given file name) with the "retrieve_pdls" method.

Sharing a piddle with physical memory increments the data's reference count; you can decrement the reference count by calling "free_pdls" on the given name. In general this ends up doing what you mean, and freeing memory only when you are really done using it. Memory mapped data does not need to worry about reference counting as there is always a persistent copy on disk.

my $data1 = zeroes(20);
my $data2 = ones(30);
share_pdls(foo => $data1, bar => $data2);

This can be combined with constructors and fat commas to allocate a collection of shared memory that you may need to use for your algorithm:

share_pdls(
    main_data => zeroes(1000, 1000),
    workspace => zeroes(1000),
    reduction => zeroes(100),
);

share_pdls does not pay attention to bad values. There is no technical reason for this: it simply hadn't occurred to me until I had to write the bad-data documentation. Expect it to happen in a forthcoming release. :-)

share_as

Method to share a piddle's data across threads under the given name.

piddle->share_as(name)

This PDL method lets you directly share a piddle. It does the exact same thing as "shared_pdls", but it's invocation is a little different:

# Directly share some constructed memory
sequence(20)->share_as('baz');

# Share individual piddles:
my $data1 = zeroes(20);
my $data2 = ones(30);
$data1->share_as('foo');
$data2->share_as('bar');

Like many other PDL methods, this method returns the just-shared piddle. This can lead to some amusing ways of storing partial calculations partway through a long chain:

my $results = $input->sumover->share_as('pre_offset') + $offset;

# Now you can get the result of the sumover operation
# before that offset was added, by calling:
my $pre_offset = retrieve_pdls('pre_offset');

This function achieves the same end as "share_pdls": There's More Than One Way To Do It, because it can make for easier-to-read code. In general I recommend using the share_as method when you only need to share a single piddle memory space.

share_as does not pay attention to bad values. There is no technical reason for this: it simply hadn't occurred to me until I had to write the bad-data documentation. Expect it to happen in a forthcoming release. :-)

retrieve_pdls

Obtain piddles providing access to the data shared under the given names.

my ($copy1, $copy2, ...) = retrieve_pdls (name, name, ...)

This function takes a list of names and returns a list of piddles that provide access to the data shared under those names. In scalar context the function returns the piddle corresponding with the first named data set, which is usually what you mean when you use a single name. If you specify multiple names but call it in scalar context, you will get a warning indicating that you probably meant to say something differently.

my $local_copy = retrieve_pdls('foo');
my @both_piddles = retrieve_pdls('foo', 'bar');
my ($foo, $bar) = retrieve_pdls('foo', 'bar');

retrieve_pdls does not pay attention to bad values. There is no technical reason for this: it simply hadn't occurred to me until I had to write the bad-data documentation. Expect it to happen in a forthcoming release. :-)

free_pdls

Frees the shared memory (if any) associated with the named shared data.

free_pdls(name, name, ...)

This function marks the memory associated with the given names as no longer being shared, handling all reference counting and other low-level stuff. You generally won't need to worry about the return value. But if you care, you get a list of values---one for each name---where a successful removal gets the name and an unsuccessful removal gets an empty string.

So, if you say free_pdls('name1', 'name2') and both removals were successful, you will get ('name1', 'name2') as the return values. If there was trouble removing name1 (because there is no memory associated with that name), you will get ('', 'name2') instead. This means you can handle trouble with perl greps and other conditionals:

my @to_remove = qw(name1 name2 name3 name4);
my @results = free_pdls(@to_remove);
if (not grep {$_ eq 'name2'} @results) {
    print "That's weird; did you remove name2 already?\n";
}
if (not $results[2]) {
    print "Couldn't remove name3 for some reason\n";
}

This function simply removes a piddle's memory from the shared pool. It does not interact with bad values in any way. But then again, it does not interfere with or screw up bad values, either.

DIAGNOSTICS

share_pdls: expected key/value pairs

You called share_pdl with an odd number of arguments, which means that you could not have supplied key/value pairs. Double-check that every piddle (or filename) that you supply is preceeded by it's shared name.

share_pdls: you already have data associated with '$name'

You tried to share some data under $name, but some data is already associated with that name. Typo? You can avoid namespace clashes with other modules by using simple names and letting PDL::Parallel::threads mangle the name internally for you.

share_pdls: Could not share a piddle under name '$name' because ...
... the piddle is a slice.

You tried to share a slice, which is not allowed. Try severing or copying your slice, then share it.

... the piddle does not have any allocated memory (but is not a slice?).

You tried to share a piddle that does not have any memory associated with it. I'm actually not sure how you can do this, so if you managed to create such a piddle, you probably already know what's going on. :-)

... the piddle has no datasv, which means it's probably a special piddle.

You tried to share a piddle that has no datasv. This usually happens when you try to wrap a piddle around some externally provided data. It may also happen when you've managed to get data from PDL::IO::FastRaw and you've used the wrong loading order (see "Memory Mapped Data"), or perhaps when you try to share data that you've mapped using PDL::IO::FlexRaw.

... the piddle's data does not come from the datasv.

You tried to share a piddle that has a funny internal structure, in which the data does not point to the buffer portion of the datasv. I'm not sure how that could happen without triggering a more specific error, so I hope you know what's going on if you get this. :-)

When share_pdls gets a scalar, it expects that to be a file to share as memory mapped data. For key '$name', '$to_store' was given, but ...
... there is no associated header file

The header file must have the name "$to_store.hdr". If it doesn't, this module won't be able to map the file.

... you do not have permissions to read the associated header file

There seems to be a permissions issue and this module cannot open the header file associated with your mapped data. Check the permissions?

... you do not have write permissions for that file

Yes, ostensibly you can work with a memory mapped file that is read only, but that's complicated and I didn't want to have to figure out how to mark your shared piddle as read-only. Patches welcome!

... the file does not exist

The file to memory map doesn't exist. Typo, perhaps?

share_pdls passed data under '$name' that it doesn't know how to store

share_pdls only knows how to store memory mapped files and raw data piddles. It'll croak if you try to share other kinds of piddles, and it'll throw this error if you try to share anythin else, like a hashref.

retrieve_pdls: '$name' was created in a thread that has ended or is detached

In some other thread, you added some data to the shared pool. If that thread ended without you freeing that data (or the thread has become a detached thread), then we cannot know if the data is available. You should always free your data from the data pool when you're done with it, to avoid this error.

retrieve_pdls could not find data associated with '$name'

Pretty simple: either data has never been added under this name, or data under this name has been removed.

retrieve_pdls: requested many piddles... in scalar context?

This is just a warning. You requested multiple piddles (sent multiple names) but you called the function in scalar context. Why do such a thing?

LIMITATIONS

I have tried to make it clear, but in case you missed it, this module does not let you share slices or specially marked piddles. If you need to share a slice, you should sever or copy the slice first.

Another limitation is that you cannot share memory mapped files that require features of PDL::IO::FlexRaw. That is a cool module that lets you pack multiple piddles into a single file, but simple cross-thread sharing is not trivial and is not (yet) supported.

If you are dealing with a physical piddle (i.e. not memory mapped), you have to be a bit careful about how the memory gets freed. If you don't call free_pdls on the data, it will persist in memory until the end of the originating thread, which means you have a classic memory leak. On the other hand, if another thread creates a thread-local copy of the data before the originating thread ends, but then tries to access the data after the originating thread ends, you will get a segmentation fault.

Finally, you must load PDL::Parallel::threads before loading PDL::IO::FastRaw if you wish to share your memory mapped piddles. Also, you must have a .hdr file for your data file, which is not strictly necessary when using mapfraw. Hopefully that limitation will be lifted in forthcoming releases of this module.

BUGS

None known at this point.

SEE ALSO

PDL::Parallel::CPU, MPI, PDL::Parallel::MPI, OpenCL, threads, threads::shared

AUTHOR, COPYRIGHT, LICENSE

This module was written by David Mertens. The documentation is copyright (C) David Mertens, 2012. The source code is copyright (C) Northwestern University, 2012. All rights reserved.

This module is distributed under the same terms as Perl itself.

DISCLAIMER OF WARRANTY

Parallel computing is hard to get right, and it can be exacerbated by errors in the underlying software. Please do not use this software in anything that is mission-critical unless you have tested and verified it yourself. I cannot guarantee that it will perform perfectly under all loads. I hope this is useful and I wish you well in your usage thereof, but BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.