NAME
DataWarehouse - Perl Data Warehouse Toolkit
DESCRIPTION
Perl is a language optimized for scanning arbitrary text files, extracting information from those text files, and generating reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).
A Data Warehouse (DW) is a repository of an organization's data, designed to facilitate reporting and analysis at different levels of aggregation.
The main source of the data is extracted, cleaned, transformed, and made available to managers and other business professionals for data mining, online analytical processing, market research and decision support. This process is called ETL (Extract/Transform/Load).
The Perl Data Warehouse Toolkit is a generic set of tools created to manage Data Warehouses meta-information, making simple ETL and DW tasks easy, and complex tasks possible.
ETL
The ETL ("Extract/Transform/Load") process consists of 3 steps:
Extract
Download information from Source Systems, and save locally. It is important that the information is saved locally to avoid multiple accesses to the Source Systems, if the ETL process needs to be restarted for any reason.
Transform
The data from the Source Systems may be inconsistent. The good news is that, in many cases, that can be fixed! Data should be verified and cleaned. Again, the output should be saved locally, to avoid duplicate work. It also helps to document any transformations that may have been performed.
Load
That's the final step, and the goal of the ETL process: publish the data! We start loading the Dimension Tables, paying special attention to the Slowly Changing Dimensions -- the changes for which we want keep history. After the dimensions are loaded, we can proceed with the Fact Tables.
Once the data is published, users should be able to generate dynamic reports using the Data Warehouse Navigator.
DATA WAREHOUSE NAVIGATOR
The Data Warehouse Navigator is a simple web interface to the Data Warehouse.
It uses the Data Warehouse Metadata (information about facts and dimensions) to present a simple user interface, which can be used to generate dynamic reports.
Ideally, it should present a kind of pivot-table functionality, and the capacity to produce tables and charts.
But, before we can really put the Data Warehouse in production, we need to address the performance problem.
If your Data Warehouse grow past several millions of rows, you'll start to notice some performance degradation. Simple queries can take minutes to complete. That makes the Data Warehouse harder to use.
People expect instant responses.
Fortunatelly, there is a simple solution for that!
AGGREGATES
Aggregate tables can improve the performance of a data warehouse by hundreds or thousands of times. A query that took 10 minutes to complete, could finish in less than one second.
The trick consists in storing the information in pre-calculated, summarized tables, so you have to access much less information.
We're trading disk space for speed: the typical Data Warehouse will store almost its size in aggregate tables.
Here's a simple example.
Imagine that you need a report of "sales by month in 2010". This is the original pseudo-query:
SELECT
sales,
SUM(sales)
FROM Sales
WHERE year=2010;
If your company makes 100,000 transactions a month, that simple query will have to query over 1,200,000 rows to produce a simple sum.
Now, imagine that we have a pre-calculated Sales_by_month table:
SELECT
sales,
SUM(sales)
FROM Sales_by_month
WHERE year=2010;
Now, you're going to get the exact same response by querying 12 rows.
AGGREGATE NAVIGATOR
WARNING
THIS IS ALPHA SOFTWARE. It could drink your beer and eat your hamster!
AUTHOR
Nelson Ferraz, <nferraz at gmail.com>
BUGS
Please report any bugs or feature requests to bug-dw at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=DataWarehouse. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc DataWarehouse::Fact
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2010 Nelson Ferraz.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.