NAME
ETL::Pipeline::Input::File - Role for file based input sources
SYNOPSIS
# In the input source...
use Moose;
with 'ETL::Pipeline::Input';
with 'ETL::Pipeline::Input::File';
...
# In the ETL::Pipeline script...
ETL::Pipeline->new( {
work_in => {root => 'C:\Data', iname => qr/Ficticious/},
input => ['Excel', iname => qr/\.xlsx?$/ ],
mapping => {Name => 'A', Address => 'B', ID => 'C' },
constants => {Type => 1, Information => 'Demographic' },
output => ['SQL', table => 'NewData' ],
} )->process;
# Or with a specific file...
ETL::Pipeline->new( {
work_in => {root => 'C:\Data', iname => qr/Ficticious/},
input => ['Excel', iname => 'ExportedData.xlsx' ],
mapping => {Name => 'A', Address => 'B', ID => 'C' },
constants => {Type => 1, Information => 'Demographic' },
output => ['SQL', table => 'NewData' ],
} )->process;
DESCRIPTION
This role adds functionality and attributes common to all file based input sources. It is a quick and easy way to create new sources with the ability to search directories. Useful when the file name changes.
ETL::Pipeline::Input::File works with a single source file. To process an entire directory of files, use ETL::Pipeline::Input::FileListing instead.
METHODS & ATTRIBUTES
Arguments for "input" in ETL::Pipeline
ETL::Pipeline::Input::File accepts any of the tests provided by Path::Iterator::Rule. The value of the argument is passed directly into the test. For boolean tests (e.g. readable, exists, etc.), pass an undef
value.
ETL::Pipeline::Input::File automatically applies the file
filter. Do not pass file
through "input" in ETL::Pipeline.
iname
is the most common one that I use. It matches the file name, supports wildcards and regular expressions, and is case insensitive.
# Search using a regular expression...
$etl->input( 'Excel', iname => qr/\.xlsx$/ );
# Search using a file glob...
$etl->input( 'Excel', iname => '*.xlsx' );
The code throws an error if no files match the criteria. Only the first match is used. If you want to match more than one file, use ETL::Pipeline::Input::File::List instead.
path
Optional. When passed to "input" in ETL::Pipeline, this file becomes the input source. No search or matching is performed. If you specify a relative path, it is relative to "data_in".
Once the object has been created, this attribute holds the file that matched search criteria. It should be used by your input source class as the file name.
# File inside of "data_in"...
$etl->input( 'Excel', path => 'Data.xlsx' );
# Absolute path name...
$etl->input( 'Excel', path => 'C:\Data.xlsx' );
# Inside the input source class...
open my $io, '<', $self->path;
skipping
Optional. skipping jumps over a certain number of rows/lines in the beginning of the file. Report formats often contain extra headers - even before the column names. skipping ignores those and starts processing at the data.
Note: skipping is applied before reading column names.
skipping accepts either an integer or code reference. An integer represents the number of rows/records to ignore. For a code reference, the code discards records until the subroutine returns a true value.
# Bypass the first three rows.
$etl->input( 'Excel', skipping => 3 );
# Bypass until we find something in column 'C'.
$etl->input( 'Excel', skipping => sub { hascontent( $_->get( 'C' ) ) } );
The exact nature of the record depends on the input file. For example files, Excel files will send a data row as a hash. But a CSV file would send a single line of plain text with no parsing. See the input source to find out exactly what it sends.
If your input source implements skipping, you can pass whatever parameters you want. For consistency, I recommend passing the raw data. If you are jumping over report headers, they may not be formatted.
SEE ALSO
ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File::List, Path::Iterator::Rule
AUTHOR
Robert Wohlfarth <robert.j.wohlfarth@vumc.org>
LICENSE
Copyright 2021 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.