NAME
DBIx::SQLCrosstab - creates a server-side cross tabulation from a database
SYNOPSIS
use DBIx::SQLCrosstab;
my $dbh=DBI->connect("dbi:driver:database"
"user","password", {RaiseError=>1})
or die "error in connection $DBI::errstr\n";
my $params = {
dbh => $dbh,
op => [ ['SUM', 'salary'] ],
from => 'person INNER JOIN departments USING (dept_id)',
rows => [
{ col => 'country'},
],
cols => [
{
id => 'dept',
value =>'department',
from =>'departments'
},
{
id => 'gender', from => 'person'
}
]
};
my $xtab = DBIx::SQLCrosstab->new($params)
or die "error in creation ($DBIx::SQLCrosstab::errstr)\n";
my $query = $xtab->get_query("#")
or die "error in query building $DBIx::SQLCrosstab::errstr\n";
# use the query or let the module do the dirty job for you
my $recs = $xtab->get_recs
or die "error in execution $DBIx::SQLCrosstab::errstr\n";
# do something with records, or use the child class
# DBIx::SQLCrosstab::Format to produce well
# formatted HTML or XML output
#
my $xtab = DBIx::SQLCrosstab::Format->new($params)
or die "error in creation ($DBIx::SQLCrosstab::errstr)\n";
if ($xtab->get_query and $xtab->get_recs) {
print $xtab->as_html;
my $xml_data = $xtab->as_xml;
}
DESCRIPTION
DBIx::SQLCrosstab produces a SQL query to interrogate a database and generate a cross-tabulation report. The amount of parameters needed to achieve the result is kept to a minimum. You need to indicate which columns and rows to cross and from which table(s) they should be taken. Acting on your info, DBIx::SQLCrosstab creates an appropriate query to get the desired result. Compared to spreadsheet based cross-tabulations, DBIx::SQLCrosstab has two distinct advantages, i.e. it keeps the query in the database work space, fully exploiting the engine capabilities, and does not limit the data extraction to one table.
See http://gmax.oltrelinux.com/cgi-bin/xtab.cgi for an interactive example.
Cross tabulation basics
Cross tabulations are statistical reports where the values from one or more given columns are used as column headers, and GROUP functions are applied to retrieve totals that apply to such values.
SELECT
id, name, gender, dept
FROM
person
INNER JOIN depts ON (depts.dept_id=perspn.dept_id)
+----+--------+--------+-------+
| id | name | gender | dept |
+----+--------+--------+-------+
| 1 | John | m | pers |
| 2 | Mario | m | pers |
| 7 | Mary | f | pers |
| 8 | Bill | m | pers |
| 3 | Frank | m | sales |
| 5 | Susan | f | sales |
| 6 | Martin | m | sales |
| 4 | Otto | m | dev |
| 9 | June | f | dev |
+----+--------+--------+-------+
A simple example will clarify the concept. Given the above raw data, a count of employees by dept and gender would look something like this:
+-------+----+----+-------+
| dept | m | f | total |
+-------+----+----+-------+
| dev | 1 | 1 | 2 |
| pers | 3 | 1 | 4 |
| sales | 2 | 1 | 3 |
+-------+----+----+-------+
The query to create this result is
SELECT
dept,
COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) as m,
COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) as f,
COUNT(*) as total
FROM
person
INNER JOIN depts ON (person.dept_id = depts.dept_id)
GROUP BY
dept
Although this query doesn't look easy, it is actually quite easy to create and the resulting data is straightforward. Creating the query requires advance knowledge of the values for the "gender" column, which can be as easy as m/f or as complex as male/female/unknown/undeclared/ former male/former female/pending (don't blame me. This is a "real" case!). Give the uncertainity, the method to get the column values id to issue a preparatory query
SELECT DISTINCT gender FROM person
Then we can use the resulting values to build the final query
my $query = "SELECT dept \n";
$query .=
",COUNT(CASE WHEN gender = '$_' THEN id ELSE NULL END) AS $_ \n"
for @$columns;
$query .= ",COUNT(*) as total \n"
. "FROM person INNER JOIN depts \n"
. "ON (person.dept_id=depts.dept_id) \n"
. "GROUP BY dept\n";
If you have to do it once, you can just use the above idiom and you are done. But if you have several cases, and your cross-tab has more than one level, then you could put this module to good use. Notice that, to create this query, you needed also the knowledge of which column to test (gender) and to which column apply the GROUP function (id)
Multi-level cross tabulations
If single-level cross tables haven't worried you, multiple level tables should give you something to think. In addition to everything said before, multiple level crosstabs have:
- query composition complexity. Each column is the combination
of several conditions, one for each level;
- column subtotals, to be inserted after the appropriate section;
- row subtotals, to be inserted after the relevant rows;
- explosive increase of column number. For a three-level crosstab
where each level has three values you get 27 columns. If you
include sub-totals, your number rises to 36. If you have just a few
levels with five or six values , you may be counting rows by the
hundreds;
- visualization problems. While the result set from the DBMS
is a simple matrix, the conceptual table has a visualization tree
at the top (for columns) and a visualization tree at the left
side (for rows).
+----+----+--------------------+--------------------+--+
| A | B | C1 | C2 | | 1
| | +--------------------+--------------------+ |
| | | D1 | D2 | | D1 | D2 | | | 2
| | +--------+--------| +--------+--------+ | |
| | |E1 E2 T |E1 E2 T |T |E1 E2 T |E1 E2 T |T |T | 3
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| A1 | B1 | | | | | | | | | | | | | | | | 4
| |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | B2 | | | | | | | | | | | | | | | | 5
| |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | T | | | | | | | | | | | | | | | | 6
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| A2 | B1 | | | | | | | | | | | | | | | | 7
| |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | B2 | | | | | | | | | | | | | | | | 8
| |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | T | | | | | | | | | | | | | | | | 9
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| T | -- | | | | | | | | | | | | | | | | 10
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
a b c d e f g h i g k l m n o p q
Some definitions
columns headers : 1-3
column headers 1st level : 1
column headers 2nd level : 2
column headers 3rd level : 3
row headers : a, b
row header 1st level : a
row header 2nd level : b
row sub totals : 6, 9
row total : 10
column sub totals : e, h, i, l, o, p
column total : q
Column headers choice strategies
The easiest way of choosing columns is to tell DBIx::SQLCrosstab to get the values from a given column and let it extract them and combine the values from the various levels to make the appropriate conditions. Sometimes this is not desirable. For instance, if your column values come from the main dataset, the one that will be used for the crosstab, you are querying the database twice with two possibly expensive queries. Sometimes you can't help it, but there are a few workarounds.
If uou know the values in advance, you can pass them to the SQLCrosstab object, saving one query. This is when the values come from a constraint, for example. You may have a "grade" column and you know that the values can only be "A" to "F", or a "game result" column with values "1", "2", "x", "suspended", "camncelled". Or you can run the full query once, and when you are satisfied that the column values are the ones you know they should be, you pass the values for the subsequent calls.
The list option is also useful when you only want to make a crosstab for a given set of values, rather than having a mammoth result table with values you don't need.
Hidden JOINs
The normal case, in a well normalized database, should be to get the column values from a lookup table. If such table was created dynamically, so that it only contains values referred from the main table, then there is no problem. If, on the contrary, the lookup table was pre-loaded with all possible values for a given column, you may come out with a huge number of values, of which only a few were used. In this case, you need to run the query with a JOIN to the main table, to exclude the unused values.
When yoo use a lookup table, though, you can optimize the main query, by removing a JOIN that you have already used in the preparatory query. Let's see an example.
You have the table "person", which includes "dept_id", a foreign key referencing the table "depts". If you pass SQLCrosstab a column description including a {value} key, it will get from the lookup table both {id} and {value}, so that, instead of creating a main query with columns like
,COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END) AS 'pers'
It will create this:
,COUNT(CASE WHEN dept_id = '1' THEN id ELSE NULL END) AS 'pers'
or (depending on the parameters you passed) this:
,COUNT(CASE WHEN dept_id = '1' THEN id ELSE NULL END) AS fld001 -- pers
The difference is that in the first case your final query needs a JOIN to depts, while in the second case it won't need it. Therefore the final query, the expensive one, will be much faster. The reasoning is, once you went through a lookup table to get the distinct values, you should not use that table again in the main query.
Class methods
- new
-
Create a new DBIx::SQLCrosstab object.
my $xtab = DBIx::SQLCrosstab->new($params) or die "creation error $DBIx::SQLCrosstab::errstr\n"
$params is a hash reference containing at least the following parameters:
dbh either a valid database handler (DBI::db) or a hash reference with the appropriate parameters to create one dbh => { dsn => "dbi:driver:database", user => "username", password => "secretpwd", params => {RaiseError => 1} } op the operation to perform (SUM, COUNT, AVG, MIN, MAX, STD, VAR, provided that your DBMS supports them) and the column to summarize. It must be an array reference, with each item a pair of operation/column. E.g.: op => [ [ 'COUNT', 'id'], ['SUM', 'salary'] ], *** WARNING *** Use of this parameter as a scalar is still supported but it is DEPRECATED. op_col The column on which to perform the operation *** DEPRECATED *** Use {op} as an array reference instead. from Where to get the data. It could be as short as a table name or as long as a comlex FROM statement including INNER and OUTER JOINs. The syntax is not checked. It must be accepted by the DBMS you are using underneath. rows a reference to an array of hashes, each defining one header for a crosstab row level. The only mandatory key is {col} identifying the column name Optionally, you can add an {alias} key, to be used with the AS keyword. cols a reference to an array of hashes, each defining one header for a cross tab column level. Two keys are mandatory {id} the column name {from} where to get it from. If the {group} option is used, the other columns can have a value of "1" instead. Optionally, the following keys can be added {group} If this option is set, then all the columns are queried at once, with the {from} statement of the first column definition. {alias} an alias for the column name. Useful for calculated fields. {value} an additional column, related to {id} whose values you want to use instead of {id} as column headers. See below "The hidden join" for more explanation. {col_list} Is a referenece to an array of values that will be used as column headers, instead of querying the database. If you know the values in advance, or if you want to use only a few known ones, then you can specify them in this list. Each element in col_list must be a hash reference with at least an {id} key. A {value} key is optional. {exclude_value} Is a reference to an array of values to exclude from the headers. Unlike the general option "col_exclude", this option will remove all the combinations containing the given values. {where} to limit the column to get {orderby} to order the columns in a different way. The following parameters are optional. where a WHERE clause that will be added to the resulting query to limit the amount of data to fetch. having Same as WHERE, but applies to the grouped values add_op either a scalar or an array reference containing one or more functions to be used in addition to the main 'op'. For example, if 'op' is 'COUNT', you may set add_op to ['SUM', 'AVG'] and the crosstab engine will produce a table having the count, sum and average of the value in 'op_col'. *** DEPRECATED *** Use {op} as an array reference instead. title A title for the crosstab. Will be used for HTML and XML creation remove_if_null remove_if_zero Remove from the record set all the columns where all values are respectively NULL or zero. Notice that there is a difference between a column with all zeroes and a column with all NULLs. All zeroes with a SUM means that all the values were 0, while all NULLs means that no records matching the given criteria were met. However, it also depends on the DBMS. According to ANSI specifications, SUM/AVG(NULL) should return NULL, while COUNT(NULL) should return 0. Rather than assuming strict compliance, I leave you the choice. col_exclude Is a reference to an array of columns to be excluded from the query. The values must be complete column names. To know the column names, you can use the "add_real_names" option and then the get_query method. add_real_names Add the real column names as comments to the query text. In order to avoid conflicts with the database, the default behavior is to create fake column names (fld001, fld002, etc) that will be later replaced. This feature may cause problems with the database engines that don't react well to embedded comments. use_real_names use the real column values as column names. This may be a problem if the column value contains characters that are not allowed in column names. Even though the names are properly quoted, it id not 100% safe. row_total row_sub_total If activated, adds a total row at the end of the result set or the total rows at the end of each row level. Your DBMS must support the SQL UNION keyword for this option to work. ******** CAVEAT! ******** Be aware that these two options will double the server load for each row level beyond 1, plus one additional query for the grand total. The meaning of this warning is that the query generated by DBIx::SQLCrosstab will contain one UNION query with a different GROUP BY clause for each row level. The grand total is a UNION query without GROUP BY clause. If your dataset is several million records large, you may consider skipping these options and perform subtotals and grand total in the client. For less than one million records, any decent database engine should be able to execute the query in an acceptable timeframe. col_total col_sub_total If activated, add a total column at the end of the result set or the total columns at the end of each column level. RaiseError If actviated, makes all errors fatal. Normally, errors are trapped and recorded in $DBIx::SQLCrosstab. RaiseError will raise an exception instead. PrintError If activated, will issue a warning with the message that caused the exception, but won't die. ************************************************ The following options only apply when creating a DBIx::SQLCrosstab::Format object. ************************************************ commify Used for HTML and XML output. If true, will insert commas as thousand separators in all recordset numbers. complete_html_page Returns HTML header and end tags, so that the resulting text is a complete HTML page. only_html_header Returns only the header part of the table, without records. Useful to create templates. add_colors If true, default colors are applied to the resulting table. text => "#009900", # green number => "#FF0000", # red header => "#0000FF", # blue footer => "#000099", # darkblue text_color number_color header_color footer_color Change the default colors to custom ones table_border table_cellspacing table_cellpadding Change the settings for HTML table borders. Defaults are: border => 1 cellspacing => 0 cellpadding => 2
- set_param
-
Allows to set one or more parameters that you couldn't pass with the constructor.
$xtab->set_param( cols => [ { id => 'dept', from => 'departments' } ] ) or die "error setting parameter: DBIx::SQLCrosstab::errstr\n"; $xtab->set_param( remove_if_null => 1, remove_if_zero => 1, title => 'Some nice number crunching' ) or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";
You can use this method together with a dummy constructor call:
my $xtab = DBIx::SQLCrosstab->new ('STUB') or die "can't create ($DBIx::SQLCrosstab::errstr)\n"; $xtab->set_param( dbh => $dbh, op => 'SUM', op_col => 'amount', cols => $mycolumns, rows => $myrows, from => 'mytable' ) or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";
- get_params
-
Returns a string containing te parameters to replicate the current DBIx::SQLCrosstab object. The data is represented as Perl code, and it can be evaluated as such. The variable's name is 'params'. It does not include the 'dbh' parameter.
my $params = $xtab->get_params or warn "can't get params ($DBIx::SQLCrosstab::errstr)";
- save_params
-
Saves the parameters necessary to rebuild the current object to a given file. This function stores what is returned by get_params into a text file. Notice that the 'dbh' option is not saved.
unless ($xtab->save_params('myparams.pl') die "can't save current params ($DBIx::SQLCrosstab::errstr)";
- load_params
-
Loads previously saved parameters into the current object. Remember that 'dbh' is not restored, and must be set separately with set_param().
my $xtab = DBIx::SQLCrosstab->new('stub') or die "$DBIx::SQLCrosstab::errstr"; $xtab->load_params('myparams.pl') or die "$DBIx::SQLCrosstab::errstr"; $xtab->set_param( dbh => $dbh ) or die "$DBIx::SQLCrosstab::errstr";
- get_query
-
Returns the query to get the final cross-tabulation, or undef in case of errors. Check $DBIx::SQLCrosstab::errstr for the reason. You may optionally pass a parameter for the character to be used as separator between column names. The default is a pound sign ('#'). If the separator character is present in any of the column values (i.e. the values from a candidate column header), the engine will try in sequence '#', '/', '-', '=', doubling them if necessary, and eventually giving up only if all these characters are present in any column values. If this happens, then you need to pass an appropriate character, or group of charecters that you are reasonably sure doesn't recur in column values.
- get_recs
-
Executes the query and returns the recordset as an arrayref of array references, or undef on failure. After this method is called, several attributes become available: - recs the whole recordset - NAME an arrayref with the list of column names - LENGTH an arrayref with the maximum size of each column - NUM_OF_FIELDS an integer with the number of felds in the result set
Class attributes
There are attributes that are available for external consumption. Like the DBI, these attributes become available after a given event.
- {NAME}
-
This attribute returns the raw column names for the recordset as an array reference. Even if {use_real_names} was not defined, this attribute returns the real names rather than fld001, fld002, and so on. It is available after get_recs() was called.
my $fnames = $xtab->{NAMES}; print "$_\n" for @{$fnames};
- {recs}
-
This attribute contains the raw recordset as returned from the database. Available after get_recs().
- {NUM_OF_FIELDS}
-
The number of fields in the recordset. Available after get_recs().
- {LENGTH}
-
Contains an array reference to the maximum lengths of each column. The length is calculated taking into account the length of the column name and the length of all values in that column. Available after get_recs().
DEPENDENCIES
DBI
a DBD driver
Tree::DAG_Node
EXPORT
None by default.
AUTHOR
Giuseppe Maxia (<gmax_at_cpan.org>)
SEE ALSO
DBI
An article at OnLamp, "Generating Database Server-Side Cross Tabulations" (http://www.onlamp.com/pub/a/onlamp/2003/12/04/crosstabs.html) and one at PerlMonks, "SQL Crosstab, a hell of a DBI idiom" (http://www.perlmonks.org/index.pl?node_id=313934).
COPYRIGHT
Copyright 2003 by Giuseppe Maxia (<gmax_at_cpan.org>)
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.