NAME

ETL::Pipeline::Input::Excel - Input source for Microsoft Excel spreadsheets

SYNOPSIS

use ETL::Pipeline;
ETL::Pipeline->new( {
  input   => ['Excel', matching => qr/\.xlsx$/i],
  mapping => {First => 'A', Second => qr/ID\s*Num/i},
  output  => ['UnitTest']
} )->process;

DESCRIPTION

ETL::Pipeline::Input::Excel defines an input source for reading MS Excel spreadsheets. It uses Spreadsheet::XLSX or Spreadsheet::ParseExcel, depending on the file type (XLSX or XLS).

METHODS & ATTRIBUTES

Arguments for "input" in ETL::Pipeline

ETL::Pipeline::Input::DelimitedText implements ETL::Pipeline::Input::File and ETL::Pipeline::Input::TabularFile. It supports all of the attributes from these roles.

return_blank_rows

Spreadsheet::XLSX or Spreadsheet::ParseExcel can't identify merged rows. Merged rows simply appear as blanks. So by default, ETL::Pipeline::Input::Excel skips over blank rows. Merged rows look like one record. When counting headers, do not count empty rows.

This boolean attribute overrides the default behaviour. ETL::Pipeline::Input::Excel returns blank rows as an empty record.

worksheet

worksheet reads data from a specific worksheet. By default, ETL::Pipeline::Input::Excel uses the first worksheet.

worksheet accepts a string or regular expression. As a string, worksheet looks for an exact match. As a regular expression, worksheet finds the first worksheet whose name matches the regular expression. Note that worksheet stops looking once it finds the first mach.

ETL::Pipeline::Input::Excel throws an error if it cannot find a worksheet with a matching name.

password

password works with encrypted files. ETL::Pipeline::Input::Excel decrypts the file automatically.

Warning: password only works with Excel 2003 file (XLS). Encrypted XLSX files always fail. Spreadsheet::XLSX does not support encryption.

Called from "process" in ETL::Pipeline

get

get retrieves one field from the current record. get accepts one parameter. That parameter can be an index number, a column name, or a regular expression to match against column names.

$etl->get( 0 );
$etl->get( 'A' );
$etl->get( 'First' );
$etl->get( qr/\bfirst\b/i );

next_record

Read one record from the file for processing. next_record returns a boolean. True means success. False means it reached the end of the file.

while ($input->next_record) {
  ...
}

get_column_names

get_column_names reads the field names from the first row in the file. "get" can match field names using regular expressions.

configure

configure opens the MS Excel spread sheet for reading. It creates the correct worksheet object for XLS versus XLSX. XLS and XLSX files are different formats. ETL::Pipeline::Input::Excel uses the correct module for this specific file.

Both Excel parsers use coulmn numbers, starting with zero. configure automatically creates aliases for the column letters.

finish

finish closes the file.

Other Methods & Attributes

row

row is the next row in the spreadsheet for reading. Because ETL::Pipeline::Input::Excel skips blank rows, the "record_number" may not match the row number.

tab

tab holds the current worksheet object. The Excel parsers return an object for the tab (worksheet) with the data. It is set by "find_worksheet".

SEE ALSO

ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File, ETL::Pipeline::Input::Tabular

AUTHOR

Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>

LICENSE

Copyright 2016 (c) Vanderbilt University Medical Center

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.