NAME
Spreadsheet::Reader::ExcelXML - Read xlsx/xlsm/xml extention Excel files
SYNOPSIS
The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder of the package
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::Reader::ExcelXML;
my $parser = Spreadsheet::Reader::ExcelXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );
if ( !defined $workbook ) {
die $parser->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
last;# In order not to read all sheets
}
###########################
# SYNOPSIS Screen Output
# 01: Row, Col = (0, 0)
# 02: Value = Category
# 03: Unformatted = Category
# 04:
# 05: Row, Col = (0, 1)
# 06: Value = Total
# 07: Unformatted = Total
# 08:
# 09: Row, Col = (0, 2)
# 10: Value = Date
# 11: Unformatted = Date
# 12:
# 13: Row, Col = (1, 0)
# 14: Value = Red
# 16: Unformatted = Red
# 17:
# 18: Row, Col = (1, 1)
# 19: Value = 5
# 20: Unformatted = 5
# 21:
# 22: Row, Col = (1, 2)
# 23: Value = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25:
# More intermediate rows ...
# 82:
# 83: Row, Col = (6, 2)
# 84: Value = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################
DESCRIPTION
This is an Excel spreadsheet reading package that should parse all excel files with the extentions .xlsx, .xlsm, .xml Excel 2003 xml (SpreadsheetML ) that can be opened in the Excel 2007+ applications. The quick-start example provided in the SYNOPSIS attempts to follow the example from Spreadsheet::ParseExcel (.xls binary file reader) as close as possible. There are additional methods and other approaches that can be used by this package for spreadsheet reading but the basic access to data from newer xml based Excel files can be as simple as above.
This is not the only perl package able to parse .xlsx files on METACPAN. For now it does appear to be the only package that will parse .xlsm and Excel 2003 .xml workbooks.
There is some documentation throughout this package for users who intend to extend the package but the primary documentation is intended for the person who uses the package as is. Parsing through an Excel workbook is done with three levels of classes;
Workbook level (This doc)
General attribute settings that affect parsing of the file in general
The place to set workbook level output formatting
Object methods to retreive document level metadata and worksheets
Worksheet level
Object methods to return specific cell instances/data
Access to some worksheet level format information (more access pending)
The place to customize data output formats targeting specific cell ranges
Cell level
Access to the cell contents
Access to the cell formats (more access pending)
There are some differences from the Spreadsheet::ParseExcel package. For instance in the SYNOPSIS the '$parser' and the '$workbook' are actually the same class for this package. You could therefore combine both steps by calling ->new with the 'file' attribute called out. The test for load success would then rely on the method file_opened. Afterward it is still possible to call ->error on the instance. Another difference is the data formatter and specifically date handling. This package leverages Spreadsheet::Reader::Format to allows for a simple pluggable custom output format that is very flexible as well as handling dates in the Excel file older than 1-January-1900. I leveraged coercions from Type::Tiny to do this but anything that follows that general format will work here.
The why and nitty gritty of design choices I made are in the Architecture Choices section. Some pitfalls are outlined in the Warnings section. Read the full documentation for all opportunities!
Primary Methods
These are the primary ways to use this class. They can be used to open a workbook, investigate information at the workbook level, and provide ways to access sheets in the workbook.
All methods are object methods and should be implemented on the object instance.
Example:
my @worksheet_array = $workbook_instance->worksheets;
parse( $file_name|$file_handle, $formatter )
Definition: This is a convenience method to match "parse($filename, $formatter)" in Spreadsheet::ParseExcel. It is one way to set the file attribute [and the formatter_inst attribute].
Accepts:
$file = see the L<file|/file> attribute for valid options (required) (required)
[$formatter] = see the L<formatter_inst|/formatter_inst> attribute for valid options (optional)
Returns: an instance of the package (not cloned) when passing with the xlsx file successfully opened or undef for failure.
worksheets
Definition: This method will return an array (not an array reference) containing a list of references to all worksheets in the workbook as objects. This is not a reccomended method because it builds all worksheet instance and returns an array of objects. It is provided for compatibility to Spreadsheet::ParseExcel. For alternatives see the get_worksheet_names method and the worksheet methods. It also only returns the tabular worksheets in the workbook. All chart sheets are ignored!
Accepts: nothing
Returns: an array ref of Worksheet objects for all worksheets in the workbook.
worksheet( $name )
Definition: This method will return an object to read values in the identified worksheet. If no value is passed to $name then the 'next' worksheet in physical order is returned. 'next' will NOT wrap It also only iterates through the 'worksheets' in the workbook (not the 'chartsheets').
Accepts: the $name string representing the name of the worksheet object you want to open. This name is the word visible on the tab when opening the spreadsheet in Excel. (not the underlying zip member file name - which can be different. It will not accept chart tab names.)
Returns: a Worksheet object with the ability to read the worksheet of that name. It returns undef and sets the error attribute if a 'chartsheet' is requested. Or in 'next' mode it returns undef if past the last sheet.
Example: using the implied 'next' worksheet;
while( my $worksheet = $workbook->worksheet ){
print "Reading: " . $worksheet->name . "\n";
# get the data needed from this worksheet
}
file_name
Definition: If you pass a file $location/$name string to the attribute file then before the package converts it to a file handle it will store the string. You can retreive that string with this method. This is true if you pass a string to the parse method as well.
Accepts: nothing
Returns: the $location/$name file string if available.
file_opened
Definition: This method is the test for success that should be used when opening a workbook using the ->new method. This allows for the object to store the error without dying entirely.
Accepts: nothing
Returns: 1 if the workbook file was successfully opened
Example:
use Spreadsheet::Reader::ExcelXML qw( :just_the_data );
my $workbook = Spreadsheet::Reader::ExcelXML->new( file => 'TestBook.xlsx' );
if ( !$workbook->file_opened ) {
die $workbook->error(), "\n";
}
for my $worksheet ( $workbook->worksheets ) {
print "Reading worksheet named: " . $worksheet->get_name . "\n";
while( 1 ){
my $cell = $worksheet->get_next_value;
print "Cell is: $cell\n";
last if $cell eq 'EOF';
}
}
get_sheet_names
Definition: This method returns an array ref of all the sheet names (tabs) in the workbook in order. (It includes chartsheets.)
Accepts: nothing
Returns: an array ref of strings
worksheet_name( $position )
Definition: This returns the name of the worksheet in that $position. (counting from zero) interspersed chartsheets in the file are not considered to hold a position by this accounting.
Accepts: $position (an integer)
Returns: the worksheet name
Example: To return only worksheet positions 2 through 4 without building them all at once
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
get_worksheet_names
Definition: This method returns an array ref of all the worksheet names (tabs) in the workbook in order. (No chartsheets.)
Accepts: nothing
Returns: an array ref of strings
Example: Another way to parse a workbook without building all the sheets at once is;
for $sheet_name ( @{$workbook->worksheet_names} ){
my $worksheet = $workbook->worksheet( $sheet_name );
# Read the worksheet here
}
worksheet_count
Definition: This returns the total number of recorded worksheets
Accepts: nothing
Returns: $total - a count of all worksheets (only)
Attributes
Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For additional lesser used workbook options see Secondary Methods. There are several grouped default values for these attributes documented in the Flags section.
Example
$workbook_instance = Spreadsheet::Reader::ExcelXML->new( %attributes )
note: if the file information is not included in the initial %attributes then it must be set by one of the attribute setter methods below or the parse method before the rest of the package can be used.
file
Definition: This attribute holds the file handle for the top level workbook. If a file name is passed it is coerced into an IO::File handle and stored that way. The originaly file name can be retrieved with the method file_name.
Default no default
Required: yes
Range any unencrypted xlsx|xlsm|xml file that can be opened in Microsoft Excel 2007+.
attribute methods Methods provided to adjust this attribute
set_file( $file|$file_handle )
Definition: change the file value in the attribute (this will reboot the workbook instance)
error_inst
Definition: This attribute holds an 'error' object instance. It should have several methods for managing errors. Currently no error codes or error language translation options are available but this should make implementation of that easier.
Default: a Spreadsheet::Reader::ExcelXML::Error instance with the attributes set as;
( should_warn => 0 )
Range: See the 'Exported methods' section below for methods required by the workbook. The error instance must also be able to extract the error string from a passed error object as well. For now the current implementation will attempt ->as_string first and then ->message if an object is passed.
attribute methods Methods provided to manage this attribute
get_error_inst
Definition: returns this instance
has_error_inst
Definition: indicates in the error instance has been set
Exported methods:
The following methods are exported (delegated) to the workbook level from the stored instance of this class. Links are provided to the default implemenation;
"error" in Spreadsheet::Reader::ExcelXML::Error
"set_error" in Spreadsheet::Reader::ExcelXML::Error
"clear_error" in Spreadsheet::Reader::ExcelXML::Error
"set_warnings" in Spreadsheet::Reader::ExcelXML::Error
"if_warn" in Spreadsheet::Reader::ExcelXML::Error
"should_spew_longmess" in Spreadsheet::Reader::ExcelXML::Error
formatter_inst
Definition: This attribute holds a 'formatter' object instance. This instance does all the heavy lifting to transform raw text into desired output. It does include a role that interprets the excel format string into a Type::Tiny coercion. The default case is actually built from a number of different elements using MooseX::ShortCut::BuildInstance on the fly so you can just call out the replacement base class or role rather than fully building the formatter prior to calling new on the workbook. However the naming of the interface |http://www.cs.utah.edu/~germain/PPS/Topics/interfaces.html> is locked and should not be tampered with since it manages the methods to be imported into the workbook;
Default An instance built with MooseX::ShortCut::BuildInstance from the following arguments (note the instance itself is not built here) { superclasses => ['Spreadsheet::Reader::ExcelXML::FmtDefault'], # base class add_roles_in_sequence =>[qw( Spreadsheet::Reader::ExcelXML::ParseExcelFormatStrings # role containing the heavy lifting methods Spreadsheet::Reader::ExcelXML::FormatInterface # the interface )], package => 'FormatInstance', # a formality more than anything }
Range: A replacement formatter instance or a set of arguments that will lead to building an acceptable formatter instance. See the 'Exported methods'section below for all methods required methods for the workbook. The FormatInterface is required by name so a replacement of that role requires the same name.
attribute methods Methods provided to manage this attribute
get_formatter_inst
Definition: returns the stored formatter instance
set_formatter_inst
Definition: sets the formatter instance
Exported methods:
Additionally the following methods are exported (delegated) to the workbook level from the stored instance of this class. Links are provided to the default implemenation;
Example: name_the_workbook_uses_to_access_the_method => Link to the default source of the method
get_formatter_region => "get_excel_region" in Spreadsheet::Reader::Format::FmtDefault
has_target_encoding => "has_target_encoding" in Spreadsheet::Reader::Format::FmtDefault
get_target_encoding => "get_target_encoding" in Spreadsheet::Reader::Format::FmtDefault
set_target_encoding => "set_target_encoding( $encoding )" in Spreadsheet::Reader::Format::FmtDefault
change_output_encoding => "change_output_encoding( $string )" in Spreadsheet::Reader::Format::FmtDefault
set_defined_excel_formats => "set_defined_excel_formats( %args )" in Spreadsheet::Reader::Format::FmtDefault
get_defined_conversion => "get_defined_conversion( $position )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings
parse_excel_format_string => "parse_excel_format_string( $string, $name )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings
set_date_behavior => "set_date_behavior( $bool )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings
set_european_first => "set_european_first( $bool )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings
set_formatter_cache_behavior => "set_cache_behavior( $bool )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings
set_workbook_for_formatter => "set_workbook_inst( $instance )" in Spreadsheet::Reader::Format::ParseExcelFormatStrings
count_from_zero
Definition: Excel spreadsheets count from 1. Spreadsheet::ParseExcel counts from zero. This allows you to choose either way.
Default 1
Range 1 = counting from zero like Spreadsheet::ParseExcel, 0 = Counting from 1 like Excel
attribute methods Methods provided to adjust this attribute
counting_from_zero
Definition: a way to check the current attribute setting
file_boundary_flags
Definition: When you request data to the right of the last column or below the last row of the data this package can return 'EOR' or 'EOF' to indicate that state. This is especially helpful in 'while' loops. The other option is to return 'undef'. This is problematic if some cells in your table are empty which also returns undef. The determination for what constitues the last column and row is selected with the attributes empty_is_end, values_only , and spaces_are_empty.
Default 1
Range 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when requesting a position that is out of bounds
attribute methods Methods provided to adjust this attribute
boundary_flag_setting
Definition: a way to check the current attribute setting
empty_is_end
Definition: The excel convention is to read the table left to right and top to bottom. Some tables have an uneven number of columns with real data from row to row. This allows the several methods that excersize a 'next' function to wrap after the last element with data rather than going to the max column. This also can combine with the attribute file_boundary_flags to trigger 'EOR' flags after the last data element and before the sheet max column when not implementing 'next' functionality. It will also return 'EOF' if the remaining rows are empty even if the max row is farther on.
Default 0
Range 0 = treat all columns short of the max column for the sheet as being in the table, 1 = treat all cells after the last cell with data as past the end of the row. This will be most visible when boundary flags are turned on or next functionality is used in the context of the attribute values_only.
attribute methods Methods provided to adjust this attribute
is_empty_the_end
Definition: a way to check the current attribute setting
values_only
Definition: Excel will store information about a cell even if it only contains formatting data. In many cases you only want to see cells that actually have values. This attribute will change the package behaviour regarding cells that have formatting stored against that cell but no actual value. If values in the cells exist as zero length strings or spaces only you can also set those to empty with the attribute spaces_are_empty.
Default 0
Range 1 = return 'undef' for cells with formatting only, 0 = return the result of empty_return_type (or cell objects) for cells that only contain formatting.
attribute methods Methods provided to adjust this attribute
get_values_only
Definition: a way to check the current attribute setting
from_the_edge
Definition: Some data tables start in the top left corner. Others do not. I don't reccomend that practice but when aquiring data in the wild it is often good to adapt. This attribute sets whether the file percieves the min_col and min_row as the top left edge of the sheeto or from the top row with data and starting from the leftmost column with data.
Default 1
Range 1 = treat the top left corner of the sheet as the beginning of rows and columns even if there is no data in the top row or leftmost column, 0 = Set the minimum row and minimum columns to be the first row and first column with data
attribute methods Methods provided to adjust this attribute
starts_at_the_edge
Definition: returns the attribute state
cache_positions
Definition: Using the standard architecture this parser would go back and read the sharedStrings and styles files sequentially from the beginning each time it had to access a sub elelement. This trade-off is generally not desired for these two files since the data is generally stored in a less than sequential fasion. The solution is to cache these files as they are read the first time so that a second pass through is not necessary to retreive an earlier element. The only time this doesn't make sence is if either of the files would overwhelm RAM if cached. The package has file size break points below which the files will cache. The thinking is that above these points the RAM is at risk of being overwhelmed and that not crashing and slow is better than a possible out-of-memory state. This attribute allows you to change those break points based on the target machine you are running on. The breaks are set on the byte size of the sub file not on the cached expansion of the sub file. In general the styles file is cached into a hash and the shared strings file is cached into an array ref. The attribute group_return_type also affects the size of the cache for the sharedStrings file since it will not cache the string formats unless the attribute is set to 'instance'. There is also a setting for caching worksheet data. Some worksheet row position settings will always be cached in order to speed up multiple reads over the same sheet or to query meta data about the rows. However, this cache level is set lower since the row caching creates much deeper data structures.
Default
{
shared_strings_interface => 5242880,# 5 MB
styles_interface => 5242880,# 5 MB
worksheet_interface => 2097152,# 2 MB
}
attribute methods Methods provided to adjust this attribute
cache_positions
Definition: returns the full attribute settings as a hashref
get_cache_size( (shared_strings_interface|styles_interface|worksheet_interface) )
Definition: return the max file size allowed to cache for the indicated interface
set_cache_size( $target_interface = $max_file_size )>
Definition: set the $max_file_size in bytes to be cached for the indicated $target_interface
has_cache_size( $target_interface )
Definition: returns true if the $target_interface has a cache size set
show_sub_file_size
Definition: Especially for zip (xlsx and xlsm) files you may not know how big the file is and want to the package to tell you what size it thinks the file is. This attribute turns on a warning statment that prints to STDERR with information on the size of potientially cached files.
Default 0
Range 0 = don't warn the file size, 1 = send the potentially cached file sizes to STDERR for review
group_return_type
Definition: Traditionally ParseExcel returns a cell object with lots of methods to reveal information about the cell. In reality the extra information is not used very much (witness the popularity of Spreadsheet::XLSX). Because many users don't need or want the extra cell formatting information it is possible to get either the raw xml value, the raw visible cell value (seen in the Excel format bar), or the formatted cell value returned either the way the Excel file specified or the way you specify instead of a Cell instance with all the data. All empty cells return undef no matter what.
Default instance
Range instance = returns a populated Spreadsheet::Reader::ExcelXML::Cell instance, xml_value = returns the string stored in the xml file - for xml based sheets this can sometimes be different thant the visible value in the cell or formula bar. unformatted = returns just the raw visible value of the cell shown in the Excel formula bar, value = returns just the formatted value stored in the excel cell
attribute methods Methods provided to adjust this attribute
get_group_return_type
Definition: a way to check the current attribute setting
empty_return_type
Definition: Traditionally Spreadsheet::ParseExcel returns an empty string for cells with unique formatting but no stored value. It may be that the more accurate way of returning undef works better for you. This will turn that behaviour on.
Default empty_string
Range empty_string = populates the unformatted value with '' even if it is set to undef undef_string = if excel stores undef for an unformatted value it will return undef
attribute methods Methods provided to adjust this attribute
get_empty_return_type
Definition: a way to check the current attribute setting
spread_merged_values
Definition: In Excel you visibly see the value of the primary cell in a merged range displayed in all the cells. This attribute lets the code see the primary value show in each of the merged cells. There is some mandatory caching to pull this off so it will consume more memory.
Default 0 (To match the Excel formula bar, VBscript, and Spreadsheet::ParseExcel)
Range 0 = don't spread the primary value, 1 = spread the primary value
attribute methods Methods provided to adjust this attribute
spreading_merged_values
Definition: a way to check the current attribute setting
skip_hidden
Definition: Like the previous attribute this attempts to match a visual effect in Excel. Even though hidden cells still contain values you can't see them visibly. This allows you to skip hidden rows and columns (not hidden sheets). The one gotcha is Excel will place the primary value in the new primary merged cell (formula bar) if a merge range is only partially obscured to include the original primary cell. This package can't do that. Either spread the primary to all cells or none.
Default 0 (To match VBscript and Spreadsheet::ParseExcel)
Range 0 = don't skip hidden rows and columns, 1 = skip hidden rows and columns
attribute methods Methods provided to adjust this attribute
should_skip_hidden
Definition: a way to check the current attribute setting
spaces_are_empty
Definition: Some auto file generators tend to add empty strings or strings with spaces to fill empty cells. There may be some visual value in this but they can slow down parsing scripts. this attribute allows the sheet to treat spaces as empty or undef instead of cells with values.
Default 0 (To match Excel and Spreadsheet::ParseExcel)
Range 0 = cells with zero length strings and spaces are considered to have 'values", 1 = There must be something other than spaces or a zero length string for the cell to have value.
attribute methods Methods provided to adjust this attribute
are_spaces_empty
Definition: a way to check the current attribute setting
merge_data
Definition: For zip based worksheets the merge data is stored at the end of the file. In order for the parser to arrive at that point it has to read through the whole sheet first. For big worksheet files this is very slow. If you are willing to not know or implement cell merge information then turn this off and the sheet should load much faster.
Default 1 (collect merge data)
Range 1 = The merge data is parsed from the worksheet file when it is opened, 0 = No merge data is parsed. The effect is equal to the cell merges dissapearing.
attribute methods Methods provided to adjust this attribute
collecting_merge_data
Definition: a way to check the current attribute setting
FLAGS
The parameter list (Attributes) that are possible to pass to ->new is somewhat long. Therefore you may want a shortcut that aggregates some set of attribute settings that are not the defaults but wind up being boilerplate. I have provided possible alternate sets like this and am open to providing others that are suggested. The flags will have a : in front of the identifier and will be passed to the class in the 'use' statement for consumption by the import method. The flags can be stacked and where there is conflict between the flag settings the rightmost passed flag setting is used. If everything in the flag but one or two settings are desirable still use the flag and then overwrite those settings when calling new.
Example;
use Spreadsheet::Reader::ExcelXML v0.2 qw( :alt_default :debug );
:alt_default
This is intended for a deep look at data and skip formatting cells.
Default attribute differences
values_only => 1
count_from_zero => 0
empty_is_end => 1
:just_the_data
This is intended for a shallow look at data with value formatting implemented
Default attribute differences
count_from_zero => 0
values_only => 1
empty_is_end => 1
group_return_type => 'value'
from_the_edge => 0
empty_return_type => 'undef_string'
spaces_are_empty => 1
merge_data => 0
column_formats => 0
:just_raw_data
This is intended for a shallow look at raw text and skips all formatting including number formats.
Default attribute differences
count_from_zero => 0
values_only => 1
empty_is_end => 1
group_return_type => 'xml_value'
from_the_edge => 0,
empty_return_type => 'undef_string'
spaces_are_empty => 1
merge_data => 0
column_formats => 0
:like_ParseExcel
This is a way to force some of the other groups back to instance and count from zero
Default attribute differences
count_from_zero => 1
group_return_type => 'instance'
:debug
This is a way to turn on as much reporting as possible
Default attribute differences
error_inst ->
error_inst =>{
superclasses => ['Spreadsheet::Reader::ExcelXML::Error'],
package => 'ErrorInstance',
should_warn => 1,
}
show_sub_file_size => 1
:lots_of_ram
This opens the caching size allowances way up
Default attribute differences
cache_positions =>{
shared_strings_interface => 209715200,# 200 MB
styles_interface => 209715200,# 200 MB
worksheet_interface => 209715200,# 200 MB
},
:less_ram
This tightens caching size allowances way down
Default attribute differences
cache_positions =>{
shared_strings_interface => 10240,# 10 KB
styles_interface => 10240,# 10 KB
worksheet_interface => 1024,# 1 KB
},
Secondary Methods
These are additional ways to use this class. They can be used to open an .xlsx workbook. They are also ways to investigate information at the workbook level. For information on how to retrieve data from the worksheets see the Worksheet and Cell documentation. For additional workbook options see the Secondary Methods and the Attributes sections. The attributes section specifically contains all the methods used to adjust the attributes of this class.
All methods are object methods and should be implemented on the object instance.
Example:
my @worksheet_array = $workbook_instance->worksheets;
get_epoch_year
Definition: This returns the epoch year defined by the Excel workbook. The epoch year affects the way dates are processed in the formatter Spreadsheet::Reader::Format
Accepts: nothing
Returns: 1900 = Windows Excel or 1904 = Apple Excel
has_epoch_year
Definition: This indicates if an epoch_year has been determined for the workbook (yet)
Accepts: nothing
Returns: 1 = yes there is one, 0 = nothing (yet)
get_sheet_name( $Int )
Definition: This method returns the sheet name for a given physical position in the workbook from left to right. It counts from zero even if the workbook is in 'count_from_one' mode. B(It will return chart names but chart tab names cannot currently be converted to worksheets). You may actually want worksheet_name instead of this function.
Accepts: integers
Returns: the sheet name (both worksheet and chartsheet )
sheet_count
Definition: This returns the total number of recorded sheets
Accepts: nothing
Returns: $total - a count of all sheets (including chartsheets and worksheets)
get_sheet_info( $name )
Definition: This returns any stored metadata about the sheet in a hashref
Accepts: $name
Returns: a hashref of sheet metadata ( a pretty thin list still )
get_rel_info( $relId )
Definition: This returns the sheet name for the $relId
Accepts: $relId ex; 'rId5'
Returns: The sheet $name associated with that relId
get_id_info( $Id )
Definition: This returns the sheet name for the $Id
Accepts: $Id (an integer) ex; '2'
Returns: The sheet $name associated with that Id
get_chartsheet_names
Definition: This method returns an array ref of all the chartsheet names (tabs) in the workbook in order. (No worksheets.)
Accepts: nothing
Returns: an array ref of strings
chartsheet_name( $position )
Definition: This returns the name of the chartsheet in that $position. (counting from zero) interspersed worksheets in the file are not considered to hold a position by this accounting.
Accepts: $position (an integer)
Returns: the chartsheet name
chartsheet_count
Definition: This returns the total number of recorded chartsheets
Accepts: nothing
Returns: $total - a count of all chartsheets (only)
creator
Definition: Returns the recorded creator of the file from the parsed metadata
Accepts: nothing
Returns: a string
modified_by
Definition: Returns the recorded last entity to modify the file from the parsed metadata
Accepts: nothing
Returns: a string
date_created
Definition: Returns the date that Excel recorded for the file creation
Accepts: nothing
Returns: a string (YYYY-MM-DD)
date_modified
Definition: Returns the date that Excel recorded for the last file modification
Accepts: nothing
Returns: a string (YYYY-MM-DD)
in_the_list
Definition: This is a predicate method that indicates if the 'next' worksheet function has been implemented at least once.
Accepts:nothing
Returns: true = 1, false = 0
start_at_the_beginning
Definition: This restarts the 'next' worksheet at the first worksheet. This method is only useful in the context of the worksheet function.
Accepts: nothing
Returns: nothing
Architecture Choices
This is yet another package for parsing Excel xml or 2007+ (and 2003+ xml) workbooks. There are two other options for 2007+ XLSX parsing (but not 2003 xml parsing) on CPAN. (Spreadsheet::ParseXLSX and Spreadsheet::XLSX) In general if either of them already work for you without issue then there is probably no compelling reason to switch to this package. However, the goals of this package which may provide differentiation are five fold. First, as close as possible produce the same output as is visible in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere as close as is reasonable to the Spreadsheet::ParseExcel API (where it doesn't conflict with the first objective) so that less work would be needed to integrate ParseExcel and this package. An addendum to the second goal is this package will not expose elements of the object hash for use by the consuming program. This package will either return an unblessed hash with the equivalent elements to the Spreadsheet::ParseExcel output (instead of a class instance) or it will provide methods to provide these sets of data. The third goal is to read the excel files in a 'just in time' manner without storing all the data in memory. The intent is to minimize the footprint of large file reads. Initially I did this using XML::LibXML but it eventually proved to not play well with Moose ( or perl? ) garbage collection so this package uses a pure perl xml parser. In general this means that design decisions will generally sacrifice speed to keep RAM consumption low. Since the data in the sheet is parsed just in time the information that is not contained in the primary meta-data headers will not be available for review until the sheet parses to that point. In cases where the parser has made choices that prioritize speed over RAM savings there will generally be an attribute available to turn that decision off. Fourth, Excel files get abused in the wild. In general the Microsoft (TM) Excel application handles these mangled files gracefully. The goal is to be able to read any xml based spreadsheet Excel can read from the supported extention list. Finally, this parser supports the Excel 2003 xml format. All in all this package solves many of the issues I found parsing Excel in the wild. I hope it solves some of yours as well.
Warnings
1.This package uses Archive::Zip. Not all versions of Archive::Zip work for everyone. I have tested this with Archive::Zip 1.30. Please let me know if this does not work with a sucessfully installed (read passed the full test suit) version of Archive::Zip newer than that.
2. Not all workbook sheets (tabs) are created equal! Some Excel sheet tabs are only a chart. These tabs are 'chartsheets'. The methods with 'worksheet' in the name only act on the sub set of tabs that are worksheets. Future methods with 'chartsheet' in the name will focus on the subset of sheets that are chartsheets. Methods with just 'sheet' in the name have the potential to act on both. The documentation for the chartsheet level class is found in Spreadsheet::Reader::ExcelXML::Chartsheet (still under construction). All chartsheet classes do not provide access to cells.
3. This package supports reading xlsm files (Macro enabled Excel 2007+ workbooks). xlsm files allow for binaries to be embedded that may contain malicious code. However, other than unzipping the excel file no work is done by this package with the sub-file 'vbaProject.bin' containing the binaries. This package does not provide an API to that sub-file and I have no intention of doing so. Therefore my research indicates there should be no risk of virus activation while parsing even an infected xlsm file with this package but I encourage you to use your own judgement in this area. caveat utilitor!
4. This package will read some files with 'broken' xml. In general this should be transparent but in the case of the maximum row value and the maximum column value for a worksheet it can cause some surprising problems. This includes the possibility that the maximum values are initially stored as 'undef' if the sheet does not provide them in the metadata as expected. These values are generally never available in Excel 2003 xml files. The answer to the methods "row_range" in Spreadsheet::Reader::ExcelXML::Worksheet and "col_range" in Spreadsheet::Reader::ExcelXML::Worksheet will then change as more of the sheet is parsed. You can use the attribute file_boundary_flags or the methods "get_next_value" in Spreadsheet::Reader::ExcelXML::Worksheet or "fetchrow_arrayref" in Spreadsheet::Reader::ExcelXML::Worksheet as alternates to pre-testing for boundaries when iterating. The primary cause of these broken XML elements in Excel 2007+ files are non-XML applications writing to or editing the underlying xml. If you have an example of other broken xml files readable by the Excel application that are not parsable by this package please submit them to my github repo so I can work to improve this package. If you don't want your test case included with the distribution I will use it to improve the package without publishing it.
5. I reserve the right to tweak the sub file caching breakpoints over the next few releases. The goal is to have a default that appears to be the best compromise by 2017-1-1.
6. This package provides support for SpreadsheetML (Excel 2003) .xml extention documents. These files should include the header;
<?mso-application progid="Excel.Sheet"?>
to indicate their intended format. Please submit any cases that appear to behave differently than expected for .xml extention files that are readable by the Excel application. I am also interested in cases where an out of memory error occurs with an .xml extension file. This warning will stay till 2017-1-1.
7. This package uses two classes at the top to handle cleanup for some self referential object organization that I use. As a result the action taken on this package is (mostly) implemented in Spreadsheet::Reader::ExcelXML::Workbook code. I documented most of that code API here. If you want to look at the raw code go there.
BUILD / INSTALL from Source
0. Using cpanm is much easier than a source build!
cpanm Spreadsheet::Reader::ExcelXML
And then if you feel kindly App::cpanminus::reporter
cpanm-reporter
1. Download a compressed file with this package code from your favorite source
2. Extract the code from the compressed file.
If you are using tar on a .tar.gz file this should work:
tar -zxvf Spreadsheet-Reader-ExcelXML-v0.xx.tar.gz
3. Change (cd) into the extracted directory
4. Run the following
(for Windows find what version of make was used to compile your perl)
perl -V:make
(then for Windows substitute the correct make function (s/make/dmake/g)? below)
perl Makefile.PL
make
make test
make install # As sudo/root
make clean
SUPPORT
TODO
1. Write a chartsheet parser and functions
2. Add a pivot table reader (Not just read the values from the sheet)
3. Add calc chain methods
4. Add more exposure to workbook/worksheet formatting values
AUTHOR
Jed Lund
jandrew@cpan.org
CONTRIBUTORS
This is the (likely incomplete) list of people who have helped make this distribution what it is, either via code contributions, patches, bug reports, help with troubleshooting, etc. A huge 'thank you' to all of them. Most were contributors to Spreadsheet::XLSX::Reader::LibXML but the contributions have (hopefully) not been lost.
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2016, 2017 by Jed Lund
DEPENDENCIES
Moose - 2.1213
MooseX::ShortCut::BuildInstance - 1.032
Type::Tiny - 1.000
version - 0.077
SEE ALSO
Spreadsheet::Read - generic Spreadsheet reader
Spreadsheet::ParseExcel - Excel binary files from 2003 and earlier
Spreadsheet::ParseXLSX - Excel version 2007 and later
Spreadsheet::XLSX - Excel version 2007 and later (Very rough)
All lines in this package that use Log::Shiras are commented out