NAME
ETL::Pipeline::Input - Role for ETL::Pipeline input sources
SYNOPSIS
use Moose;
with 'ETL::Pipeline::Input';
sub run {
# Add code to read your data here
...
}
DESCRIPTION
An input source feeds the extract part of ETL. This is where data comes from. These are your data sources.
A data source may be anything - a file, a database, or maybe a socket. Each format is an ETL::Pipeline input source. For example, Excel files represent one input source. Perl reads every Excel file the same way. With a few judicious attributes, we can re-use the same input source for just about any type of Excel file.
ETL::Pipeline defines an input source as a Moose object with at least one method - run
. This role basically defines the requirement for the run method. It should be consumed by all input source classes. ETL::Pipeline relies on the input source having this role.
How do I create an input source?
- 1. Start a new Perl module. I recommend putting it in the
ETL::Pipeline::Input
namespace. ETL::Pipeline will pick it up automatically. - 2. Make your module a Moose class -
use Moose;
. - 3. Consume this role -
with 'ETL::Pipeline::Input';
. - 4. Write the "run" method. "run" follows this basic algorithmn...
-
- a. Open the source.
- b. Loop reading the records. Each iteration should call "record" in ETL::Pipeline to trigger the transform step.
- c. Close the source.
- 5. Add any attributes for your class.
The new source is ready to use, like this...
$etl->input( 'YourNewSource' );
You can leave off the leading ETL::Pipeline::Input::.
When ETL::Pipeline calls "run", it passes the ETL::Pipeline object as the only parameter.
Why this way?
Input sources mostly follow the basic algorithm of open, read, process, and close. I originally had the role define methods for each of these steps. That was a lot of work, and kind of confusing. This way, the input source only needs one code block that does all of these steps - in one place. So it's easier to troubleshoot and write new sources.
In the work that I do, we have one output destination that rarely changes. It's far more common to write new input sources - especially customized sources. Making new sources easier saves time. Making it simpler means that more developers can pick up those tasks.
Does ETL::Pipeline only work with files?
No. ETL::Pipeline::Input works for any source of data, such as SQL queries, CSV files, or network sockets. Tailor the run
method for whatever suits your needs.
Because files are most common, ETL::Pipeline comes with a helpful role - ETL::Pipeline::Input::File. Consume ETL::Pipeline::Input::File in your inpiut source to access some standardized attributes.
Upgrading from older versions
ETL::Pipeline version 3 is not compatible with input sources from older versions. You will need to rewrite your custom input sources.
- Merge the
setup
,finish
, andnext_record
methods into "run". - Have "run" call
$etl-
record> in place ofnext_record
. - Adjust attributes as necessary.
METHODS & ATTRIBUTES
path (optional)
If you define this, the standard logging will include it. The attribute is named for file inputs. But it can return any value that is meaningful to your users.
position (optional)
If you define this, the standard logging includes it with error or informational messages. It can be any value that helps users locate the correct place to troubleshoot.
run (required)
You define this method in the consuming class. It should open the file, read each record, call "record" in ETL::Pipeline after each record, and close the file. This method is the workhorse. It defines the main ETL loop. "record" in ETL::Pipeline acts as a callback.
I say file. It really means input source - whatever that might be.
Some important things to remember about run
...
run
receives one parameter - the ETL::Pipeline object.- Should include all the code to open, read, and close the input source.
- After reading a record, call "record" in ETL::Pipeline.
If your code encounters an error, run can call "status" in ETL::Pipeline with the error message. "status" in ETL::Pipeline should automatically include the record count with the error message. You should add any other troubleshooting information such as file names or key fields.
$etl->status( "ERROR", "Error message here for id $id" );
For fatal errors, I recommend using the croak
command from Carp.
source
The location in the input source of the current record. For example, for files this would be the file name and character position. The consuming class can set this value in its run method.
Logging uses this when displaying errors or informational messages. The value should be something that helps the user troubleshoot issues. It can be whatever is appropriate for the input source.
NOTE: Don't capitalize the first letter, unless it's supposed to be. Logging will upper case the first letter if it's appropriate.
SEE ALSO
ETL::Pipeline, ETL::Pipeline::Input::File, ETL::Pipeline::Output
AUTHOR
Robert Wohlfarth <robert.j.wohlfarth@vumc.org>
LICENSE
Copyright 2021 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.