NAME

ETL::Pipeline::Input::Tabular - Sequential input in rows and columns

SYNOPSIS

# In the input source...
use Moose;
with 'ETL::Pipeline::Input::Tabular';
...

DESCRIPTION

ETL::Pipeline::Input::Tabular provides a common interface where the data is in a table or columns. Spreadsheets and CSV files are considered tabular.

While ETL::Pipeline::Input::Tabular works with any sequential input source, ETL::Pipeline::Input::Files would be the most common.

METHODS & ATTRIBUTES

Arguments for "input" in ETL::Pipeline

no_column_names

By default, ETL::Pipeline::Input::Tabular assumes that the first data row has column names (headers) and not real data. If your data does not have column names, set this boolean flag to true.

$etl->input( 'Excel', no_column_names => 1 );

skipping

skipping jumps over a certain number of records in the beginning of the file. Report formats often contain extra headers - even before the column names. skipping ignores those and starts processing at the data.

skipping accepts either an integer or code reference. An integer represents the number of rows/records to ignore. For a code reference, the code discards records until the subroutine returns a true value.

# Bypass the first three rows.
$etl->input( 'Excel', skipping => 3 );

# Bypass until we find something in column 'C'.
$etl->input( 'Excel', skipping => sub { hascontent( $_->get( 'C' ) ) } );

Other Methods & Attributes

get_column_names

This method reads the column name row, parses it, and sets "column_names". ETL::Pipeline::Input::TabularFile knows nothing about the internal storage of individual records. It relies on the implementing class for that ability. That's where get_column_names comes into play.

get_column_names should call "add_column" for each column name.

sub get_column_names {
  my ($self) = @_;
  $self->next_record;
  # Loop through all of the fields...
    $self->add_column( $value, $field );
}

column_names

column_names holds a list of the column names as read from the file. The list is kept in file order. Duplicate names are allowed. column_names is filled when "get_column_names" calls the "add_column" method.

When "mapping" in ETL::Pipeline calls "get" in ETL::Pipeline::Input, this role intercepts the call. The role translates column names or regular expressions into actual field names. "get" in ETL::Pipeline::Input returns a list of values from all fields that match.

add_column

"get_column_names" calls this method once for every column name. add_column puts the column name into "column_names".

"get_column_names" passes in the column name as the first parameter and the field name as the second. The field name is optional. "get" in ETL::Pipeline::Input will use the "column_names" index as the field name by default.

# Add column names for fields 0 and 1. No field name means that "get" uses
# the index numbers - 0 and 1.
$self->add_column( 'First' );
$self->add_column( 'Second' );

# Add column names for fields 'A' and 'B'. Always pass the field name if
# it's a string.
$self->add_column( 'First', 'A' );
$self->add_column( 'Second', 'B' );

Note: add_column trims leading and trailing whitespace from column names.

reset_column_names

This method wipes out the existing column names. It can be used from "get_column_names".

$self->reset_column_names;

SEE ALSO

ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File

AUTHOR

Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>

LICENSE

Copyright 2016 (c) Vanderbilt University Medical Center

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.