NAME
PDL::Graphics::Prima::DataSet - the way we think about data
SYNOPSIS
-distribution => ds::Set(
$data, plotType => pset::CDF
),
-lines => ds::Pair(
$x, $y, plotType => [ppair::Lines, ppair::Diamonds]
),
-contour => ds::Grid(
$matrix, bounds => [$left, $bottom, $right, $top],
y_edges => $ys, x_bounds => [$left, $right],
x_edges => $xs, y_bounds => [$bottom, $top],
plotType => pgrid::Matrix(palette => $palette),
),
-image => ds::Image(
$image, format => 'string',
... ds::Grid bounder options ...
plotType => pimage::Basic,
),
-function => ds::Func(
$func_ref, xmin => $left, xmax => $right, N_points => 200,
),
-func_grid => ds::FGrid(
$matrix, ... same as for ds::Grid ...
N_points => 200,
N_points => [200, 300],
),
DESCRIPTION
PDL::Graphics::Prima
differentiates between a few kinds of data: Sets, Pair collections, and Grids. A Set is an unordered collection of data, such as the heights of a class of students. A Pair collection is an collection of x/y pairs, such as a time series. A Grid is, well, a matrix, such as the pixel colors of a photograph.
working here - this needs to be cleaned up!
In addition, there are two derived kinds of datasets when you wish to specify a function instead of raw set of data. For example, to plot an analytic function, you could use a Function instead of Pairs. This has the advantage that if you zoom in on the function, the curve is recalculated and looks smooth instead of jagged. Similarly, if you can describe a surface by a function, you can plot that function using a function grid, i.e. FGrid.
Once upon a time, this made sense, but it needs to be revised:
At the moment there are two kinds of datasets. The piddle-based datasets have
piddles for their x- and y-data. The function-based datasets create their
x-values on the fly and evaluate their y-values using the supplied function
reference. The x-values are generated using the C<sample_evenly> function which
belongs to the x-axis's scaling object/class. As such, any scaling class needs
to implement a C<sample_evenly> function to support function-based datasets.
Base Class
The Dataset base class provides a few methods that work for all datasets. These include accessing the associated widget and drawing the data.
- widget
-
The widget associated with the dataset.
- draw
-
Calls all of the drawing functions for each plotType of the dataset. This also applies all the global drawing options (like
color
, for example) that were supplied to the dataset. - compute_collated_min_max_for
-
This function is part of the collated min/max chain of function calls that leads to reasonable autoscaling. The Plot widget asks each dataSet to determine its collated min and max by calling this function, and this function simply agregates the collated min and max results from each of its plotTypes.
In general, you needn't worry about collated min and max calculations unless you are trying to grapple with the autoscaling system or are creating a new plotType.
working here - link to more complete documentation of the collation and autoscaling systems.
- new
-
This is the unversal constructor that is called by the short-name constructors introduced below. This handles the uniform packaging of plotTypes (for example, allowing the user to say
plotType =
ppair::Diamonds> instead of the more verboseplotTypes =
[ppair::Diamonds]>). In general, you (the user) will not need to invoke this constructor directly. - check_plot_types
-
Checks that the plotType(s) passed to the constructor or added at runtime are built on the data type tha we expect. Derived classes must specify their
plotType_base_class
key before calling this function. - init
-
Called by new to initialize the dataset. This function is called on the new dataset just before it is returned from
new
.If you create a new dataSet, you should provide an
init
function that performs the following:- supply a default plotType
-
If the user supplied something to either the
plotType
orplotTypes
keys, thennew
will be sure you have you will already have that something in an array reference in$self->{plotTypes}
. However, if they did not supply either key, you should supply a default. You should have something that looks like this:$self->{plotTypes} = [pset::CDF] unless exists $self->{plotTypes};
- check the plot types
-
After supplying a default plot type, you should check that the provided plot types are derived from the acceptable base plot type class. You would do this with code like this:
$self->check_plot_types(@{self->{plotTypes}});
This is your last step to validate or pre-calculate anything. For example, you must provide functions to return your data, and you should probably make guarantees about the kinds of data that such accessors return, such as the data always being a piddle. If that is the case, then it might not be a bad idea to say in your
init
function something like this:$self->{data} = PDL::Core::topdl($self->{data});
Sets
Sets are unordered collections of sample data. The typical use case of set data is that you have a population of things and you want to analyze their agregate properties. For example, you might be interested in the distribution of tree heights at your Christmas Tree Farm, or the distribution of your students' (or your classmates') test scores from the mid-term. Those collections of data are called Sets and PDL::Graphics::Prima provides a number of ways of visualizing sets, as discussed under "Sets" in PDL::Graphics::Prima::PlotType. Here, I discuss how to create and manipulate Set dataSet objects.
Note that shape of pluralized properties (i.e. colors
) should thread-match the shape of the data excluding the data's first dimension. That is, if I want to plot the cumulative distributions for three different batches using three different line colors, my data would have shape (N, 3) and my colors piddle would have shape (3).
- ds::Set - short-name constructor
-
ds::Set($data, option => value, ...)
The short-name constructor to create sets. The data can be either a piddle of values or an array reference of values (which will be converted to a piddle during initialization).
- expected_plot_class
-
Sets expect plot type objects that are derived from
PDL::Graphics::Prima::PlotType::Set
. - get_data
-
Returns the piddle containing the data. This is used mostly by the plotTypes to retrieve the data in order to display it. You can also use it to retrieve the data piddle if it makes your code more legible.
my $heights = load_height_data(); ... my $plot = $wDisplay->insert('Plot', -heights => ds::Set($heights), ... ); # Retrieve and print the data: print "heights are ", $plot->dataSets->{heights}, "\n";
A subtle point: notice that you can change the data within the piddle, and you can even change the piddle's shape, but you cannot use this to replace the piddle itself.
Pair
Pairwise datasets are collections of paired x/y data. A typical Pair dataset is the sort of thing you would visualize with an x/y plot: a time series such as the series of high temperatures for each day in a month or the x- and y-coordinates of a bug walking across your desk. PDL::Graphics::Prima provides many ways of visualizing Pair datasets, as discussed under "Pair" in PDL::Graphics::Prima::PlotType.
The dimensions of pluralized properties (i.e. colors
) should thread-match the dimensions of the data. An important exception to this is ppair::Lines
, in which case you must specify how you want properties to thread.
The default plot type is ppair::Diamonds
.
- ds::Pair - short-name constructor
-
ds::Pair($x_data, $y_data, option => value, ...)
The short-name constructor to create pairwise datasets. The x- and y-data can be either piddles or array references (which will be converted to a piddle during initialization).
- expected_plot_class
-
Pair datasets expect plot type objects that are derived from
PDL::Graphics::Prima::PlotType::Pair
. - get_xs, get_ys, get_data
-
Returns piddles with the x, y, or x-y data. The last function returns two piddles in a list.
- get_data_as_pixels
-
Uses the reals_to_pixels functions for the x- and y- axes to convert the values of the x- and y- data to actual pixel positions in the widget.
Grids
Grids are collections of data that is regularly ordered in two dimensions. Put differently, it is a structure in which the data is described by two indices. The analogous mathematical structure is a matrix and the analogous visual is an image. PDL::Graphics::Prima provides a few ways to visualize grids, as discussed under "Grids" in PDL::Graphics::Prima::PlotType. The default plot type is pgrid::Color
.
This is the least well thought-out dataSet. As such, it may change in the future. All such changes will, hopefully, be backwards compatible.
At the moment, there is only one way to visualize grid data: pseq::Matrix
. Although I can conceive of a contour plot, it has yet to be implemented. As such, it is hard to specify the dimension requirements for dataset-wide properties. There are a few dataset-wide properties discussed in the constructor, however, so see them for some examples.
- ds::Grid - short-name constructor
-
ds::Grid($matrix, option => value, ...)
The short-name constructor to create grids. The data should be a piddle of values or something which topdl can convert to a piddle (an array reference of array references).
The current cross-plot-type options include the bounds settings. You can either specify a
bounds
key or one key from each column:x_bounds y_bounds x_centers y_centers x_edges y_edges
- bounds
-
The value associated with the
bounds
key is a four-element anonymous array:bounds => [$left, $bottom, $right, $top]
The values can either be scalars or piddles that indicate the corners of the grid plotting area. If the latter, it is possible to thread over the bounds by having the shape of (say)
$left
thread-match the shape of your grid's data, excluding the first two dimensions. That is, if your$matrix
has a shape of (20, 30, 4, 5), the piddle for$left
can have shapes of (1), (4), (4, 1), (1, 5), or (4, 5).At the moment, if you specify bounds, linear spacing from the min to the max is used. In the future, a new key may be introduced to allow you to specify the spacing as something besides linear.
- x_bounds, y_bounds
-
The values associated with
x_bounds
andy_bounds
are anonymous arrays with two elements containing the same sorts of data as thebounds
array. - x_centers, y_centers
-
The value associated with
x_centers
(ory_centers
) should be a piddle with increasing values of x (or y) that give the mid-points of the data. For example, if we have a matrix with shape (3, 4),x_centers
would have 3 elements andy_edges
would have 4 elements:------------------- y3 | d03 | d13 | d23 | ------------------- y2 | d02 | d12 | d22 | ------------------- y1 | d01 | d11 | d21 | ------------------- y0 | d00 | d10 | d20 | ------------------- x0 x1 x2
Some plot types may require the edges. In that case, if there is more than one point, the plot guesses the scaling of the spacing between points (choosing between logarithmic or linear) and appropriate bounds for the given scaling are calculated using interpolation and extrapolation. The plot will croak if there is only one point (in which case interpolation is not possible). If the spacing for your grid is neither linear nor logarithmic, you should explicitly specify the edges, as discussed next.
At the moment, the guess work assumes that all the scalings for a given Grid dataset are either linear or logarithmic, even though it's possible to mix the scaling using threading. (It's hard to do that by accident, so if that last bit seems confusing, then you probably don't need to worry about tripping on it.) Also, I would like for the plot to croak if the scaling does not appear to be either linear or logarithmic, but that is not yet implemented.
- x_edges, y_edges
-
The value associated with
x_edges
(ory_edges
) should be a piddle with increasing values of x (or y) that give the boundary edges of data. For example, if we have a matrix with shape (3, 4),x_edges
would have 3 + 1 = 4 elements andy_edges
would have 4 + 1 = 5 elements:y4 ------------------- | d03 | d13 | d23 | y3 ------------------- | d02 | d12 | d22 | y2 ------------------- | d01 | d11 | d21 | y1 ------------------- | d00 | d10 | d20 | y0 ------------------- x0 x1 x2 x3
Some plot types may require the data centers. In that case, if there are only two edges, a linear interpolation is used. If there are more than two points, the plot will try to guess the spacing, choosing between linear and logarithmic, and use the appropriate interpolation.
The note above about regarding guess work for x_centers and y_centers applies here, also.
- expected_plot_class
-
Grids expect plot type objects that are derived from
PDL::Graphics::Prima::PlotType::Grid
. - get_data
-
Returns the piddle containing the data.
- guess_scaling_for
-
Takes a piddle and tries to guess the scaling from the spacing. Returns a string indicating the scaling, either "linear" or "log", as well as the spacing term.
working here - clarify that last bit with an example
Image
Color formats are case insensitive; default is rgb
Func
PDL::Graphics::Prima provides a special pair dataset that takes a function reference instead of a set of data. The function should take a piddle of x-values as input and compute and return the y-values. You can specify the number of data points by supplying
N_points => value
in the list of key-value pairs that initialize the dataset. Most of the functionality is inherited from PDL::Graphics::Prima::DataSet::Pair
, but there are a few exceptions.
- ds::Func - short-name constructor
-
ds::Func($subroutine, option => value, ...)
The short-name constructor to create function datasets. The subroutine must be a reference to a subroutine, or an anonymous sub. For example,
# Reference to a subroutine, # PDL's exponential function: ds::Func (\&PDL::exp) # Using an anonymous subroutine: ds::Func ( sub { my $xs = shift; return $xs->exp; })
- get_xs, get_ys
-
These functions override the default Pair behavior by generating the x-data and using that to compute the y-data. The x-data is uniformly sampled according to the x-axis scaling.
- compute_collated_min_max_for
-
This function is supposed to provide information for autoscaling. This is a sensible thing to do for the the y-values of functions, but it makes no situation with the x-values since these are taken from the x-axis min and max already.
This could be smarter, methinks, so please give me your ideas if you have them. :-)
DataSet::Collection
The dataset collection is the thing that actually holds the datasets in the plot widget object. The Collection is a tied hash, so you access all of its data members as if they were hash elements. However, it does some double-checking for you behind the scenes to make sure that whenever you add a dataset to the Collection, that you added a real DataSet object and not some arbitrary thing.
working here - this needs to be clarified
RESPONSIBILITIES
The datasets and the dataset collection have a number of responsibilities, and a number of things for whch they are not responsible.
The dataset container is responsible for:
- knowing the plot widget
-
The container always maintains knowledge of the plot widget to which it belongs. Put a bit differently, a dataset container cannot belong to multiple plot widgets (at least, not at the moment).
- informing datasets of their container and plot widget
-
When a dataset is added to a dataset collection, the collection is responsible for informing the dataset of the plot object and the dataset collection to which the dataset belongs.
Datasets themselves are responsible for:
- knowing and managing the plotTypes
-
The datasets are responsible for maintaining the list of plotTypes that are to be applied to their data.
- knowing per-dataset properties
-
Drawing properties can be specified on a per-dataset scope. The dataset is responsible for maintaining a list of these properties and providing them to the plot types when they perform drawing operations.
- knowing the dataset container and the plot widget
-
All datasets know the dataset container and the plot widget to which they belong. Although they could retrieve the widget through a method on the container, the
- informing plotTyes' plot widget
-
The plot types all know the widget (and dataset) to which they belong, and it is the
- managing the drawing operations of plotTypes
-
Although datasets themselves do not need to draw anything, they do call the drawing operations of the different plot types that they contain.
- knowing and supplying the data
-
A key responsibility for the dataSets is holding the data that are drawn by the plot types. Althrough the plot types may hold specialized data, the dataset holds the actual data the underlies the plot types and provides a specific interface for the plot types to access that data.
On the other hand, datasets are not responsible for knowing or doing any of the following:
- knowing axes
-
The plot object is responsible for knowing the x- and y-axis objects. However, if the axis system is changed to allow for multiple x- and y-axes, then this burden will shift to the dataset as it will need to know which axis to use when performing data <-> pixel conversions.
TODO
Add optional bounds to function-based DataSets.
Captitalization for plotType, etc.
Use PDL documentation conventions for signatures, ref, etc.
Additional datset, a two-tone grid. Imagine that you want to overlay the population density of a country and the average rainfall (at the granularity of counties, let's say). You could use the intensity of the red channel to indicate population and the intensity of blue to indicate rainfall. Highly populated areas with low rainfall would be bright red, while highly populated areas with high rainfall would be purple, and low populated areas with high rainfall would be blue. The color scale would be indicated with a square with a color gradient (rather than a horizontal or vertical bar with a color gradient, as in a normal ColorGrid). Anyway, this differs from a normal grid dataset because it would require two datasets, one for each tone.
AUTHOR
David Mertens (dcmertens.perl@gmail.com)
SEE ALSO
This is a component of PDL::Graphics::Prima. This library is composed of many modules, including:
- PDL::Graphics::Prima
-
Defines the Plot widget for use in Prima applications
- PDL::Graphics::Prima::Axis
-
Specifies the behavior of axes (but not the scaling)
- PDL::Graphics::Prima::DataSet
-
Specifies the behavior of DataSets
- PDL::Graphics::Prima::Internals
-
A dumping ground for my partial documentation of some of the more complicated stuff. It's not organized, so you probably shouldn't read it.
- PDL::Graphics::Prima::Limits
-
Defines the lm:: namespace
- PDL::Graphics::Prima::Palette
-
Specifies a collection of different color palettes
- PDL::Graphics::Prima::PlotType
-
Defines the different ways to visualize your data
- PDL::Graphics::Prima::Scaling
-
Specifies different kinds of scaling, including linear and logarithmic
- PDL::Graphics::Prima::Simple
-
Defines a number of useful functions for generating simple and not-so-simple plots
LICENSE AND COPYRIGHT
Portions of this module's code are copyright (c) 2011 The Board of Trustees at the University of Illinois.
Portions of this module's code are copyright (c) 2011-2012 Northwestern University.
This module's documentation are copyright (c) 2011-2012 David Mertens.
All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.