NAME
ETL::Pipeline::Input::File - Role for file based input sources
SYNOPSIS
# In the input source...
use Moose;
with 'ETL::Pipeline::Input::File';
...
# In the ETL::Pipeline script...
ETL::Pipeline->new( {
work_in => {search => 'C:\Data', find => qr/Ficticious/},
input => ['Excel', matching => qr/\.xlsx?$/ ],
mapping => {Name => 'A', Address => 'B', ID => 'C' },
constants => {Type => 1, Information => 'Demographic' },
output => ['SQL', table => 'NewData' ],
} )->process;
# Or with a specific file...
ETL::Pipeline->new( {
work_in => {search => 'C:\Data', find => qr/Ficticious/},
input => ['Excel', file => 'ExportedData.xlsx' ],
mapping => {Name => 'A', Address => 'B', ID => 'C' },
constants => {Type => 1, Information => 'Demographic' },
output => ['SQL', table => 'NewData' ],
} )->process;
DESCRIPTION
ETL::Pipeline::Input::File provides methods and attributes common to file based input sources. It makes file searches available for any file format. With ETL::Pipeline::Input::File, you can...
For setting an exact path, see the "path" attribute. For searches, see the "find" attribute.
File vs. DataFile
ETL::Pipeline::Input::DataFile extends ETL::Pipeline::Input::File. This role, ETL::Pipeline::Input::File makes no assumptions about the file format. It works CSV text files, MS Access databases, spread sheets, XML, or any other format found on disk.
ETL::Pipeline::Input::DataFile assumes that each record is stored on one row. And the data is divided into fields (columns). Basically,
METHODS & ATTRIBUTES
Arguments for "input" in ETL::Pipeline
matching
matching locates the first file that matches the given pattern. The pattern can be a glob or regular expression. matching sets "file" to the first file that matches. Search patterns are case insensitive.
# Search using a regular expression...
$etl->input( 'Excel', matching => qr/\.xlsx$/i );
# Search using a file glob...
$etl->input( 'Excel', matching => '*.xlsx' );
For very weird cases, matching also accepts a code reference. matching executes the subroutine against the file names. matching sets "file" to the first file where the subroutine returns a true value.
matching passes two parameters into the subroutine...
- The ETL::Pipeline object
- The Path::Class::File object
# File larger than 2K...
$etl->input( 'Excel', matching => sub {
my ($etl, $file) = @_;
return (!$file->is_dir && $file->size > 2048 ? 1 : 0);
} );
matching searches inside the "data_in" in ETL::Pipeline directory.
file
file holds a Path::Class::File object pointing to the input file. If "input" in ETL::Pipeline does not set file, then the "matching" attribute searches the file system for a match. If "input" in ETL::Pipeline sets file, then "matching" is ignored.
file is relative to "data_in" in ETL::Pipeline, unless you set it to an absolute path name. With "matching", the search is always limited to "data_in" in ETL::Pipeline.
# File inside of "data_in"...
$etl->input( 'Excel', file => 'Data.xlsx' );
# Absolute path name...
$etl->input( 'Excel', file => 'C:\Data.xlsx' );
SEE ALSO
ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::TabularFile
AUTHOR
Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>
LICENSE
Copyright 2016 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.