NAME

ETL::Pipeline::Input::File - Role for file based input sources

SYNOPSIS

# In the input source...
use Moose;
with 'ETL::Pipeline::Input::File';
...

# In the ETL::Pipeline script...
ETL::Pipeline->new( {
  work_in   => {search => 'C:\Data', find => qr/Ficticious/},
  input     => ['Excel', matching => qr/\.xlsx?$/          ],
  mapping   => {Name => 'A', Address => 'B', ID => 'C'     },
  constants => {Type => 1, Information => 'Demographic'    },
  output    => ['SQL', table => 'NewData'                  ],
} )->process;

# Or with a specific file...
ETL::Pipeline->new( {
  work_in   => {search => 'C:\Data', find => qr/Ficticious/},
  input     => ['Excel', file => 'ExportedData.xlsx'       ],
  mapping   => {Name => 'A', Address => 'B', ID => 'C'     },
  constants => {Type => 1, Information => 'Demographic'    },
  output    => ['SQL', table => 'NewData'                  ],
} )->process;

DESCRIPTION

ETL::Pipeline::Input::File provides methods and attributes common to file based input sources. It makes file searches available for any file format. With ETL::Pipeline::Input::File, you can...

Specify the exact path to the file.
Or search the file system for a matching name.

For setting an exact path, see the "path" attribute. For searches, see the "find" attribute.

File vs. DataFile

ETL::Pipeline::Input::DataFile extends ETL::Pipeline::Input::File. This role, ETL::Pipeline::Input::File makes no assumptions about the file format. It works CSV text files, MS Access databases, spread sheets, XML, or any other format found on disk.

ETL::Pipeline::Input::DataFile assumes that each record is stored on one row. And the data is divided into fields (columns). Basically,

METHODS & ATTRIBUTES

Arguments for "input" in ETL::Pipeline

matching

matching locates the first file that matches the given pattern. The pattern can be a glob or regular expression. matching sets "file" to the first file that matches. Search patterns are case insensitive.

# Search using a regular expression...
$etl->input( 'Excel', matching => qr/\.xlsx$/i );

# Search using a file glob...
$etl->input( 'Excel', matching => '*.xlsx' );

For very weird cases, matching also accepts a code reference. matching executes the subroutine against the file names. matching sets "file" to the first file where the subroutine returns a true value.

matching passes two parameters into the subroutine...

The ETL::Pipeline object
The Path::Class::File object
# File larger than 2K...
$etl->input( 'Excel', matching => sub {
  my ($etl, $file) = @_;
  return (!$file->is_dir && $file->size > 2048 ? 1 : 0);
} );

matching searches inside the "data_in" in ETL::Pipeline directory.

file

file holds a Path::Class::File object pointing to the input file. If "input" in ETL::Pipeline does not set file, then the "matching" attribute searches the file system for a match. If "input" in ETL::Pipeline sets file, then "matching" is ignored.

file is relative to "data_in" in ETL::Pipeline, unless you set it to an absolute path name. With "matching", the search is always limited to "data_in" in ETL::Pipeline.

# File inside of "data_in"...
$etl->input( 'Excel', file => 'Data.xlsx' );

# Absolute path name...
$etl->input( 'Excel', file => 'C:\Data.xlsx' );

SEE ALSO

ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::TabularFile

AUTHOR

Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>

LICENSE

Copyright 2016 (c) Vanderbilt University Medical Center

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.