NAME
ETL::Pipeline::Input::Excel - Input source for Microsoft Excel spreadsheets
SYNOPSIS
use ETL::Pipeline;
ETL::Pipeline->new( {
input => ['Excel', matching => qr/\.xlsx$/i],
mapping => {First => 'A', Second => qr/ID\s*Num/i},
output => ['UnitTest']
} )->process;
DESCRIPTION
ETL::Pipeline::Input::Excel defines an input source for reading MS Excel spreadsheets. It uses Spreadsheet::XLSX or Spreadsheet::ParseExcel, depending on the file type (XLSX or XLS).
METHODS & ATTRIBUTES
Arguments for "input" in ETL::Pipeline
ETL::Pipeline::Input::DelimitedText implements ETL::Pipeline::Input::File and ETL::Pipeline::Input::TabularFile. It supports all of the attributes from these roles.
return_blank_rows
Spreadsheet::XLSX or Spreadsheet::ParseExcel can't identify merged rows. Merged rows simply appear as blanks. So by default, ETL::Pipeline::Input::Excel skips over blank rows. Merged rows look like one record. When counting headers, do not count empty rows.
This boolean attribute overrides the default behaviour. ETL::Pipeline::Input::Excel returns blank rows as an empty record.
worksheet
worksheet reads data from a specific worksheet. By default, ETL::Pipeline::Input::Excel uses the first worksheet.
worksheet accepts a string or regular expression. As a string, worksheet looks for an exact match. As a regular expression, worksheet finds the first worksheet whose name matches the regular expression. Note that worksheet stops looking once it finds the first mach.
ETL::Pipeline::Input::Excel throws an error if it cannot find a worksheet with a matching name.
password
password works with encrypted files. ETL::Pipeline::Input::Excel decrypts the file automatically.
Warning: password only works with Excel 2003 file (XLS). Encrypted XLSX files always fail. Spreadsheet::XLSX does not support encryption.
Called from "process" in ETL::Pipeline
get
get retrieves one field from the current record. get accepts one parameter. That parameter can be an index number, a column name, or a regular expression to match against column names.
$etl->get( 0 );
$etl->get( 'A' );
$etl->get( 'First' );
$etl->get( qr/\bfirst\b/i );
next_record
Read one record from the file for processing. next_record returns a boolean. True means success. False means it reached the end of the file.
while ($input->next_record) {
...
}
get_column_names
get_column_names reads the field names from the first row in the file. "get" can match field names using regular expressions.
configure
configure opens the MS Excel spread sheet for reading. It creates the correct worksheet object for XLS versus XLSX. XLS and XLSX files are different formats. ETL::Pipeline::Input::Excel uses the correct module for this specific file.
Both Excel parsers use coulmn numbers, starting with zero. configure automatically creates aliases for the column letters.
finish
finish closes the file.
Other Methods & Attributes
row
row is the next row in the spreadsheet for reading. Because ETL::Pipeline::Input::Excel skips blank rows, the "record_number" may not match the row number.
tab
tab holds the current worksheet object. The Excel parsers return an object for the tab (worksheet) with the data. It is set by "find_worksheet".
SEE ALSO
ETL::Pipeline, ETL::Pipeline::Input, ETL::Pipeline::Input::File, ETL::Pipeline::Input::Tabular
AUTHOR
Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>
LICENSE
Copyright 2016 (c) Vanderbilt University Medical Center
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.