NAME
ETL::Pipeline::Input::Tabular - Sequential input in rows and columns
SYNOPSIS
# In the input source...
use Moose;
with 'ETL::Pipeline::Input::Tabular';
...
DESCRIPTION
ETL::Pipeline::Input::Tabular provides a common interface where the data is in a table or columns. Spreadsheets and CSV files are considered tabular.
While ETL::Pipeline::Input::Tabular works with any sequential input source, ETL::Pipeline::Input::Files would be the most common.
METHODS & ATTRIBUTES
Arguments for "input" in ETL::Pipeline
no_column_names
By default, ETL::Pipeline::Input::Tabular assumes that the first data row has column names (headers) and not real data. If your data does not have column names, set this boolean flag to true.
$etl->input( 'Excel', no_column_names => 1 );
skipping
skipping jumps over a certain number of records in the beginning of the file. Report formats often contain extra headers - even before the column names. skipping ignores those and starts processing at the data.
skipping accepts either an integer or code reference. An integer represents the number of rows/records to ignore. For a code reference, the code discards records until the subroutine returns a true value.
# Bypass the first three rows.
$etl->input( 'Excel', skipping => 3 );
# Bypass until we find something in column 'C'.
$etl->input( 'Excel', skipping => sub { hascontent( $_->get( 'C' ) ) } );
Other Methods & Attributes
get_column_names
This method reads the column name row, parses it, and sets "column_names". ETL::Pipeline::Input::TabularFile knows nothing about the internal storage of individual records. It relies on the implementing class for that ability. That's where get_column_names comes into play.
get_column_names should call "add_column" for each column name.
sub get_column_names {
my ($self) = @_;
$self->next_record;
# Loop through all of the fields...
$self->add_column( $value, $field );
}
column_names
column_names holds a list of the column names as read from the file. The list is kept in file order. Duplicate names are allowed. column_names is filled when "get_column_names" calls the "add_column" method.
When "mapping" in ETL::Pipeline calls "get" in ETL::Pipeline::Input, this role intercepts the call. The role translates column names or regular expressions into actual field names. "get" in ETL::Pipeline::Input returns a list of values from all fields that match.
add_column
"get_column_names" calls this method once for every column name. add_column puts the column name into "column_names".
"get_column_names" passes in the column name as the first parameter and the field name as the second. The field name is optional. "get" in ETL::Pipeline::Input will use the "column_names" index as the field name by default.
# Add column names for fields 0 and 1. No field name means that "get" uses
# the index numbers - 0 and 1.
$self->add_column( 'First' );
$self->add_column( 'Second' );
# Add column names for fields 'A' and 'B'. Always pass the field name if
# it's a string.
$self->add_column( 'First', 'A' );
$self->add_column( 'Second', 'B' );
Note: add_column trims leading and trailing whitespace from column names.
reset_column_names
This method wipes out the existing column names. It can be used from "get_column_names".
$self->reset_column_names;
SEE ALSO
ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File
AUTHOR
Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>
LICENSE
Copyright 2016 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.