ABOUT ETLp
ETLp is a framework that is configuration driven. Processing is defined within the context of a pipeline job, and a job may in turn call another job. All processing is logged and is viewable from the Runtime Audit Browser. Any errors or warnings are automatically emailed, so basic housekeeping tasks do not need to be performed by the developers.
There are two kinds of ETLp jobs:
A serial job performs all of its defined tasks in order just once
An iterative job performs all of its tasks once for each file that is processed
For example of a serial job might be
FTP a tarball from a remote site.
Untar the retrieved file.
Call an iterative job to process the files.
The called iterative job might be to loop through each file:
Loop through each file and validate its content.
Load the file into a database.
Compress the processed file.
DEFINING ETLp JOBS
A job is defined in a section within an ''application'' configuration file. A jobs are specified using Apache-style configuration syntax.
Each job consists of three phases: pre_process, process and post_process although all are optional. Each phase consists of one of more task items that perform the work. The phases are conceptual, and they make no physical difference to the way that the pipeline functions - they simply help organise or structure the processing.
An item is the unit that defines the actual work. All items have a type which informs ETLp of the plugin that should be used to perform the task.
ETLp comes with a number of plugins that perform common tasks. Additional functionality can be added through custom plugins if required.
Control Files
Control files are used to define the format of data files. These definitions are used to validate files and also by the csv loader to map file fields to database columns.
Configuring the environment
ETLp expects a standard application directory setup. All environment variables are defined in the env.conf file. Further information can be found in the section "Creating Your Application Structure" in ETLp::Manual::Install.
Executing Jobs
A job is started by calling etlp <configuration file name> <section>. The configuration file should have an extension of "conf", but the extension does not need to be specified when invoking etlp, as the extension is assumed. For examples:
etlp sales load_files
Defining Jobs
All jobs are defined using the Apache configuration syntax.
Iterative Jobs
All iterative jobs must contain some or all of the following settings within the config section
filename_format. A regular expression that will identify the data files to be processed
incoming_dir. The directory where the job will find files to process
archive_dir. Where successfully processed files will be written to
fail_dir. Where files that fail processing will be moved to
table_name. The name of the table that the data will be loaded into
controlile_dir. The directory that hosts the control file
controlfile. The name of the file that defines the data file format
on_error. What to do when an error occurs
Note The on_error value can be one of the following:
die. Raise an error and then exit.
skip. Raise an error, stop processing the current file, and move to the next file.
ignore. Report the error, but carry an processing.
If no value for on_error is specified, the default is die. In addition, the on_error parameter can be set within an item, and the item-level setting will over-ride the job-level setting for that item.
These config parameters may be used by the plugins, but any number of configuration settings can be made and then referred to in the item sections
An example iterative job might be:
<process_customers> type = iterative
<config>
filename_format = (customer\d.csv)(?:\.gz)?$
incoming_dir = data/incoming
archive_dir = data/archive
fail_dir = data/fail
table_name = stg_customer
controlfile_dir = conf/control
controlfile = customer.ctl
on_error = die
</config>
<pre_process>
<item>
name = decompress customer file
type = gunzip
</item>
<item>
name = validate customer file
type = validate
file_type = csv
skip = 1
<csv_options>
allow_whitespace = 1
sep_char = |
</csv_options>
</item>
</pre_process>
<process>
<item>
name = load customer file
type = csv_loader
skip = 1
<csv_options>
allow_whitespace = 1
sep_char = |
</csv_options>
</item>
</process>
<post_process>
<item>
name = compress customer file
type = gzip
</item>
</post_process>
</process_customers>
Iterative Plugins
The standard plugins or iterative processing are:
csv_loader. Loads delimited files.
gunzip. Gunzips files.
gzip. Gzips files.
sql_loader. Load files using Oracle SQL*Loader.
os. Run an Operating System Command.
perl. Call a perl subroutine.
plsql Invoke an Oracle stored procedure or packaged procedure.
steady_state_check. Wait for a file's size to stop changing.
validate. Validate a delimited file against the control file definition.
Additional plugins can be added. See Writing Custom Plugins
Serial Jobs
Typically, serial jobs don't require a config section but may include one if there is a need to share settings or parameter values across mutiple items. Below is an example serial job:
<get_customer_file>
type = serial
<config>
next = customer load_files
</config>
<process>
<item>
name = get customer files
type = os
command = %app_root%/bin/ftp_customer_file.pl
</item>
<item>
name = Untar customer files
type = os
command = tar zxf %app_root%/data/incoming/customer.tar.gz
</item>
</process>
</get_customer_file>
Note The on_error value can be set in the config section:
die. Raise an error and then exit.
ignore. Report the error, but carry an processing.
Unlike iterative jobs, serial jobs have no concept of "skip," since it does not make sense within this context. The default is ''die''.
Serial Plugins
The standard plugins or iterative processing are:
os. Run an Operating System Command
perl. Call a perl subroutine
plsql. Invoke an Oracle stored procedure or packaged procedure
watch. File watcher. Watch a directory, waiting for the appearance of one or more files that match the specified pattern
Additional plugins can be added. See Writing Custom Plugins
USING CONFIGURATION PARAMETERS
The value of any parameters specified in the config section can be referenced be specifying the parameter within percentage signs. For example:
mv %basename(filename)% %archive_dir%
Default parameters
The following parameters are maintained by the ETLp framework itself
app_root. The application root directory.
filename. The current file being processed. Applicable to iterative jobs only.
basename(filename) || The current file being processed, stripped of the path information. Applicable to iterative jobs only.
Environment Configuration Settings
If ''allow_env_vars'' is set in the environment configuration file (env.conf), then values in env.conf can be referred to by their parameter names. This can be useful if there is sensitive information that should not be placed in the application configuration files - for instance, the application configuration files are committed to a source control system, but you need passwords to connect to other systems as part of the ETL processing.
However, allowing applications to access environment parameters means that all settings are accessible to all plugins, including custom ones. This may be considered a security issue, which is why these parameters are not accessible by default.
CHAINING JOBS
Use the ''next'' config parameter to call another job upon completion of the current job:
<my_current_job>
<config>
next = <<config name>> <<section name>>
<config>
</my_current_job>
Note that the next job will run even if the current job dies with errors.