NAME

PDL::Graphics::Prima::DataSet - the way we think about data

SYNOPSIS

-distribution => ds::Set(
    $data, plotType => pset::CDF
),
-lines => ds::Pair(
    $x, $y, plotType => [ppair::Lines, ppair::Diamonds]
),
-contour => ds::Grid(
    $matrix, bounds => [$left, $bottom, $right, $top],
             y_edges => $ys, x_bounds => [$left, $right],
             x_edges => $xs, y_bounds => [$bottom, $top],
             plotType => pgrid::Matrix(palette => $palette),
),
-image => ds::Image(
    $image, format => 'string',
            ... ds::Grid bounder options ...
            plotType => pimage::Basic,
),
-function => ds::Func(
    $func_ref, xmin => $left, xmax => $right, N_points => 200,
),
-func_grid => ds::FGrid(
    $matrix, ... same as for ds::Grid ...
             N_points => 200,
             N_points => [200, 300],
),

DESCRIPTION

PDL::Graphics::Prima differentiates between a few kinds of data: Sets, Pair collections, and Grids. A Set is an unordered collection of data, such as the heights of a class of students. A Pair collection is an collection of x/y pairs, such as a time series. A Grid is, well, a matrix, such as the pixel colors of a photograph.

working here - this needs to be cleaned up!

In addition, there are two derived kinds of datasets when you wish to specify a function instead of raw set of data. For example, to plot an analytic function, you could use a Function instead of Pairs. This has the advantage that if you zoom in on the function, the curve is recalculated and looks smooth instead of jagged. Similarly, if you can describe a surface by a function, you can plot that function using a function grid, i.e. FGrid.

Once upon a time, this made sense, but it needs to be revised:

At the moment there are two kinds of datasets. The piddle-based datasets have
piddles for their x- and y-data. The function-based datasets create their
x-values on the fly and evaluate their y-values using the supplied function
reference. The x-values are generated using the C<sample_evenly> function which
belongs to the x-axis's scaling object/class. As such, any scaling class needs
to implement a C<sample_evenly> function to support function-based datasets.

Base Class

The Dataset base class provides a few methods that work for all datasets. These include accessing the associated widget and drawing the data.

widget

The widget associated with the dataset.

draw

Calls all of the drawing functions for each plotType of the dataset. This also applies all the global drawing options (like color, for example) that were supplied to the dataset.

compute_collated_min_max_for

This function is part of the collated min/max chain of function calls that leads to reasonable autoscaling. The Plot widget asks each dataSet to determine its collated min and max by calling this function, and this function simply agregates the collated min and max results from each of its plotTypes.

In general, you needn't worry about collated min and max calculations unless you are trying to grapple with the autoscaling system or are creating a new plotType.

working here - link to more complete documentation of the collation and autoscaling systems.

new

This is the unversal constructor that is called by the short-name constructors introduced below. This handles the uniform packaging of plotTypes (for example, allowing the user to say plotType = ppair::Diamonds> instead of the more verbose plotTypes = [ppair::Diamonds]>). In general, you (the user) will not need to invoke this constructor directly.

check_plot_types

Checks that the plotType(s) passed to the constructor or added at runtime are built on the data type tha we expect. Derived classes must specify their plotType_base_class key before calling this function.

init

Called by new to initialize the dataset. This function is called on the new dataset just before it is returned from new.

If you create a new dataSet, you should provide an init function that performs the following:

supply a default plotType

If the user supplied something to either the plotType or plotTypes keys, then new will be sure you have you will already have that something in an array reference in $self->{plotTypes}. However, if they did not supply either key, you should supply a default. You should have something that looks like this:

$self->{plotTypes} = [pset::CDF] unless exists $self->{plotTypes};
check the plot types

After supplying a default plot type, you should check that the provided plot types are derived from the acceptable base plot type class. You would do this with code like this:

$self->check_plot_types(@{self->{plotTypes}});

This is your last step to validate or pre-calculate anything. For example, you must provide functions to return your data, and you should probably make guarantees about the kinds of data that such accessors return, such as the data always being a piddle. If that is the case, then it might not be a bad idea to say in your init function something like this:

$self->{data} = PDL::Core::topdl($self->{data});

Sets

Sets are unordered collections of sample data. The typical use case of set data is that you have a population of things and you want to analyze their agregate properties. For example, you might be interested in the distribution of tree heights at your Christmas Tree Farm, or the distribution of your students' (or your classmates') test scores from the mid-term. Those collections of data are called Sets and PDL::Graphics::Prima provides a number of ways of visualizing sets, as discussed under "Sets" in PDL::Graphics::Prima::PlotType. Here, I discuss how to create and manipulate Set dataSet objects.

Note that shape of pluralized properties (i.e. colors) should thread-match the shape of the data excluding the data's first dimension. That is, if I want to plot the cumulative distributions for three different batches using three different line colors, my data would have shape (N, 3) and my colors piddle would have shape (3).

ds::Set - short-name constructor
ds::Set($data, option => value, ...)

The short-name constructor to create sets. The data can be either a piddle of values or an array reference of values (which will be converted to a piddle during initialization).

expected_plot_class

Sets expect plot type objects that are derived from PDL::Graphics::Prima::PlotType::Set.

get_data

Returns the piddle containing the data. This is used mostly by the plotTypes to retrieve the data in order to display it. You can also use it to retrieve the data piddle if it makes your code more legible.

my $heights = load_height_data();
...
my $plot = $wDisplay->insert('Plot',
    -heights => ds::Set($heights),
    ...
);

# Retrieve and print the data:
print "heights are ", $plot->dataSets->{heights}, "\n";

A subtle point: notice that you can change the data within the piddle, and you can even change the piddle's shape, but you cannot use this to replace the piddle itself.

Pair

Pairwise datasets are collections of paired x/y data. A typical Pair dataset is the sort of thing you would visualize with an x/y plot: a time series such as the series of high temperatures for each day in a month or the x- and y-coordinates of a bug walking across your desk. PDL::Graphics::Prima provides many ways of visualizing Pair datasets, as discussed under "Pair" in PDL::Graphics::Prima::PlotType.

The dimensions of pluralized properties (i.e. colors) should thread-match the dimensions of the data. An important exception to this is ppair::Lines, in which case you must specify how you want properties to thread.

The default plot type is ppair::Diamonds.

ds::Pair - short-name constructor
ds::Pair($x_data, $y_data, option => value, ...)

The short-name constructor to create pairwise datasets. The x- and y-data can be either piddles or array references (which will be converted to a piddle during initialization).

expected_plot_class

Pair datasets expect plot type objects that are derived from PDL::Graphics::Prima::PlotType::Pair.

get_xs, get_ys, get_data

Returns piddles with the x, y, or x-y data. The last function returns two piddles in a list.

get_data_as_pixels

Uses the reals_to_pixels functions for the x- and y- axes to convert the values of the x- and y- data to actual pixel positions in the widget.

Grids

Grids are collections of data that is regularly ordered in two dimensions. Put differently, it is a structure in which the data is described by two indices. The analogous mathematical structure is a matrix and the analogous visual is an image. PDL::Graphics::Prima provides a few ways to visualize grids, as discussed under "Grids" in PDL::Graphics::Prima::PlotType. The default plot type is pgrid::Color.

This is the least well thought-out dataSet. As such, it may change in the future. All such changes will, hopefully, be backwards compatible.

At the moment, there is only one way to visualize grid data: pseq::Matrix. Although I can conceive of a contour plot, it has yet to be implemented. As such, it is hard to specify the dimension requirements for dataset-wide properties. There are a few dataset-wide properties discussed in the constructor, however, so see them for some examples.

ds::Grid - short-name constructor
ds::Grid($matrix, option => value, ...)

The short-name constructor to create grids. The data should be a piddle of values or something which topdl can convert to a piddle (an array reference of array references).

The current cross-plot-type options include the bounds settings. You can either specify a bounds key or one key from each column:

x_bounds   y_bounds
x_centers  y_centers
x_edges    y_edges
bounds

The value associated with the bounds key is a four-element anonymous array:

bounds => [$left, $bottom, $right, $top]

The values can either be scalars or piddles that indicate the corners of the grid plotting area. If the latter, it is possible to thread over the bounds by having the shape of (say) $left thread-match the shape of your grid's data, excluding the first two dimensions. That is, if your $matrix has a shape of (20, 30, 4, 5), the piddle for $left can have shapes of (1), (4), (4, 1), (1, 5), or (4, 5).

At the moment, if you specify bounds, linear spacing from the min to the max is used. In the future, a new key may be introduced to allow you to specify the spacing as something besides linear.

x_bounds, y_bounds

The values associated with x_bounds and y_bounds are anonymous arrays with two elements containing the same sorts of data as the bounds array.

x_centers, y_centers

The value associated with x_centers (or y_centers) should be a piddle with increasing values of x (or y) that give the mid-points of the data. For example, if we have a matrix with shape (3, 4), x_centers would have 3 elements and y_edges would have 4 elements:

   -------------------
y3 | d03 | d13 | d23 |
   -------------------
y2 | d02 | d12 | d22 |
   -------------------
y1 | d01 | d11 | d21 |
   -------------------
y0 | d00 | d10 | d20 |
   -------------------
     x0    x1    x2

Some plot types may require the edges. In that case, if there is more than one point, the plot guesses the scaling of the spacing between points (choosing between logarithmic or linear) and appropriate bounds for the given scaling are calculated using interpolation and extrapolation. The plot will croak if there is only one point (in which case interpolation is not possible). If the spacing for your grid is neither linear nor logarithmic, you should explicitly specify the edges, as discussed next.

At the moment, the guess work assumes that all the scalings for a given Grid dataset are either linear or logarithmic, even though it's possible to mix the scaling using threading. (It's hard to do that by accident, so if that last bit seems confusing, then you probably don't need to worry about tripping on it.) Also, I would like for the plot to croak if the scaling does not appear to be either linear or logarithmic, but that is not yet implemented.

x_edges, y_edges

The value associated with x_edges (or y_edges) should be a piddle with increasing values of x (or y) that give the boundary edges of data. For example, if we have a matrix with shape (3, 4), x_edges would have 3 + 1 = 4 elements and y_edges would have 4 + 1 = 5 elements:

y4 -------------------
   | d03 | d13 | d23 |
y3 -------------------
   | d02 | d12 | d22 |
y2 -------------------
   | d01 | d11 | d21 |
y1 -------------------
   | d00 | d10 | d20 |
y0 -------------------
   x0    x1    x2    x3

Some plot types may require the data centers. In that case, if there are only two edges, a linear interpolation is used. If there are more than two points, the plot will try to guess the spacing, choosing between linear and logarithmic, and use the appropriate interpolation.

The note above about regarding guess work for x_centers and y_centers applies here, also.

expected_plot_class

Grids expect plot type objects that are derived from PDL::Graphics::Prima::PlotType::Grid.

get_data

Returns the piddle containing the data.

guess_scaling_for

Takes a piddle and tries to guess the scaling from the spacing. Returns a string indicating the scaling, either "linear" or "log", as well as the spacing term.

working here - clarify that last bit with an example

Image

Color formats are case insensitive; default is rgb

Func

PDL::Graphics::Prima provides a special pair dataset that takes a function reference instead of a set of data. The function should take a piddle of x-values as input and compute and return the y-values. You can specify the number of data points by supplying

N_points => value

in the list of key-value pairs that initialize the dataset. Most of the functionality is inherited from PDL::Graphics::Prima::DataSet::Pair, but there are a few exceptions.

ds::Func - short-name constructor
ds::Func($subroutine, option => value, ...)

The short-name constructor to create function datasets. The subroutine must be a reference to a subroutine, or an anonymous sub. For example,

# Reference to a subroutine,
# PDL's exponential function:
ds::Func (\&PDL::exp)

# Using an anonymous subroutine:
ds::Func ( sub {
    my $xs = shift;
    return $xs->exp;
})
get_xs, get_ys

These functions override the default Pair behavior by generating the x-data and using that to compute the y-data. The x-data is uniformly sampled according to the x-axis scaling.

compute_collated_min_max_for

This function is supposed to provide information for autoscaling. This is a sensible thing to do for the the y-values of functions, but it makes no situation with the x-values since these are taken from the x-axis min and max already.

This could be smarter, methinks, so please give me your ideas if you have them. :-)

DataSet::Collection

The dataset collection is the thing that actually holds the datasets in the plot widget object. The Collection is a tied hash, so you access all of its data members as if they were hash elements. However, it does some double-checking for you behind the scenes to make sure that whenever you add a dataset to the Collection, that you added a real DataSet object and not some arbitrary thing.

working here - this needs to be clarified

RESPONSIBILITIES

The datasets and the dataset collection have a number of responsibilities, and a number of things for whch they are not responsible.

The dataset container is responsible for:

knowing the plot widget

The container always maintains knowledge of the plot widget to which it belongs. Put a bit differently, a dataset container cannot belong to multiple plot widgets (at least, not at the moment).

informing datasets of their container and plot widget

When a dataset is added to a dataset collection, the collection is responsible for informing the dataset of the plot object and the dataset collection to which the dataset belongs.

Datasets themselves are responsible for:

knowing and managing the plotTypes

The datasets are responsible for maintaining the list of plotTypes that are to be applied to their data.

knowing per-dataset properties

Drawing properties can be specified on a per-dataset scope. The dataset is responsible for maintaining a list of these properties and providing them to the plot types when they perform drawing operations.

knowing the dataset container and the plot widget

All datasets know the dataset container and the plot widget to which they belong. Although they could retrieve the widget through a method on the container, the

informing plotTyes' plot widget

The plot types all know the widget (and dataset) to which they belong, and it is the

managing the drawing operations of plotTypes

Although datasets themselves do not need to draw anything, they do call the drawing operations of the different plot types that they contain.

knowing and supplying the data

A key responsibility for the dataSets is holding the data that are drawn by the plot types. Althrough the plot types may hold specialized data, the dataset holds the actual data the underlies the plot types and provides a specific interface for the plot types to access that data.

On the other hand, datasets are not responsible for knowing or doing any of the following:

knowing axes

The plot object is responsible for knowing the x- and y-axis objects. However, if the axis system is changed to allow for multiple x- and y-axes, then this burden will shift to the dataset as it will need to know which axis to use when performing data <-> pixel conversions.

TODO

Add optional bounds to function-based DataSets.

Captitalization for plotType, etc.

Use PDL documentation conventions for signatures, ref, etc.

Additional datset, a two-tone grid. Imagine that you want to overlay the population density of a country and the average rainfall (at the granularity of counties, let's say). You could use the intensity of the red channel to indicate population and the intensity of blue to indicate rainfall. Highly populated areas with low rainfall would be bright red, while highly populated areas with high rainfall would be purple, and low populated areas with high rainfall would be blue. The color scale would be indicated with a square with a color gradient (rather than a horizontal or vertical bar with a color gradient, as in a normal ColorGrid). Anyway, this differs from a normal grid dataset because it would require two datasets, one for each tone.

AUTHOR

David Mertens (dcmertens.perl@gmail.com)

SEE ALSO

This is a component of PDL::Graphics::Prima. This library is composed of many modules, including:

PDL::Graphics::Prima

Defines the Plot widget for use in Prima applications

PDL::Graphics::Prima::Axis

Specifies the behavior of axes (but not the scaling)

PDL::Graphics::Prima::DataSet

Specifies the behavior of DataSets

PDL::Graphics::Prima::Internals

A dumping ground for my partial documentation of some of the more complicated stuff. It's not organized, so you probably shouldn't read it.

PDL::Graphics::Prima::Limits

Defines the lm:: namespace

PDL::Graphics::Prima::Palette

Specifies a collection of different color palettes

PDL::Graphics::Prima::PlotType

Defines the different ways to visualize your data

PDL::Graphics::Prima::Scaling

Specifies different kinds of scaling, including linear and logarithmic

PDL::Graphics::Prima::Simple

Defines a number of useful functions for generating simple and not-so-simple plots

LICENSE AND COPYRIGHT

Portions of this module's code are copyright (c) 2011 The Board of Trustees at the University of Illinois.

Portions of this module's code are copyright (c) 2011-2012 Northwestern University.

This module's documentation are copyright (c) 2011-2012 David Mertens.

All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.