NAME
Spreadsheet::XLSX::Reader::LibXML - Read xlsx spreadsheet files with LibXML
SYNOPSIS
The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::XLSX::Reader::LibXML;
my $parser = Spreadsheet::XLSX::Reader::LibXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );
if ( !defined $workbook ) {
die $parser->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
last;# In order not to read all sheets
}
###########################
# SYNOPSIS Screen Output
# 01: Row, Col = (0, 0)
# 02: Value = Category
# 03: Unformatted = Category
# 04:
# 05: Row, Col = (0, 1)
# 06: Value = Total
# 07: Unformatted = Total
# 08:
# 09: Row, Col = (0, 2)
# 10: Value = Date
# 11: Unformatted = Date
# 12:
# 13: Row, Col = (1, 0)
# 14: Value = Red
# 16: Unformatted = Red
# 17:
# 18: Row, Col = (1, 1)
# 19: Value = 5
# 20: Unformatted = 5
# 21:
# 22: Row, Col = (1, 2)
# 23: Value = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25:
# More intermediate rows ...
# 82:
# 83: Row, Col = (6, 2)
# 84: Value = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################
DESCRIPTION
This is another package for parsing Excel 2007+ workbooks. The goals of this package are three fold. First, as close as possible produce the same output as is visible in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere as close as is reasonable to the Spreadsheet::ParseExcel API (where it doesn't conflict with the first objective) so that less work would be needed to integrate ParseExcel and this package. Third, to provide an XLSX sheet parser that is built on XML::LibXML. The other two primary options for XLSX parsing on CPAN use either a one-off XML parser (Spreadsheet::XLSX) or XML::Twig (Spreadsheet::ParseXLSX). In general if either of them already work for you without issue then there is no reason to change to this package. I personally found some bugs and functionality boundaries in both that I wanted to improve, and by the time I had educated myself enough to make improvement suggestions including root causing the bugs to either the XML parser or the reader logic I had written this.
In the process of learning and building I also wrote some additional features for this parser that are not found in the Spreadsheet::ParseExcel package. For instance in the SYNOPSIS the '$parser' and the '$workbook' are actually the same class. You could combine both steps by calling ->new with the 'file_name' (or 'file_handle') attribute called out. Afterward it is still possible to call ->error on the instance. The test in that case for load success would be $instance->has_file_name(handle) Another improvement (from my perspective) is date handling. This package allows for a simple pluggable custom output format that is very flexible as well as handling dates older than 1-January-1900. I leveraged coercions from Type::Tiny to do this but anything that follows that general format will work here. Additionally, this is a Moose based package. As such it is designed to be (fairly) extensible by writing roles and adding them to this package rather than requiring that you extend the package to some new branch. Read the full documentation for all opportunities!
In the realm of extensibility, XML::LibXML has multiple ways to read an XML file but this release only has an XML::LibXML::Reader parser option. Future iterations could include a DOM parser option. Additionally this package does not (yet) provide the same access to the formatting elements provided in Spreadsheet::ParseExcel. That is on the longish and incomplete TODO list.
The package operates on the workbook with three primary tiers of classes. All other classes in this package are for architectual extensibility.
---> Workbook level
---> Worksheet level*
---> Cell level* - optional
Warnings
1. Archive-Zip versions greater than 1.30 appear to be broken. This package requires Archive::Zip so I reccomend Archive-Zip-1.30.
2. This package requires that you can load XML::LibXML which requires the libxml2 and 'libxml2-devel' libraries. I have included Alien::LibXML in the build profile in an attempt to resolve any library issues but being new to usage of Alien libraries in general I'm certain I got it quite right. Many PC's have these already but if thise fails to load please log an issue in my repo on github>.
3. Earlier versions of this package would extract the .xlsx file to a temp directory and release the file lock on the original file while still retaining the information for access by the parser. In order to resolve some some temp dir cleanup issues this package no longer releases the lock on the file during reading. (It will release the file lock and clean up any temp directories when the class is closed) (warning will be removed after 2015-June-25)
*4. Not all workbook sheets (tabs) are created equal! Some Excel sheet tabs are only a chart. These tabs are 'chartsheets'. The methods with 'worksheet' in the name only act on the sub set of tabs that are worksheets. Future methods with 'chartsheet' in the name will focus on the subset of sheets that are chartsheets. Methods with just 'sheet' in the name have the potential to act on both. The documentation for the chartsheet level class is found in Spreadsheet::XLSX::Reader::LibXML::Chartsheet (still under construction). All chartsheet classes do not provide access to cells.
Attributes
Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For ways to manage the workbook when opened see the Primary Methods. For additional lesser used workbook options see Secondary Methods.
Example
$workbook_instance = Spreadsheet::XLSX::Reader::LibXML->new( %attributes )
note: if a file name or file handle for an .xlsx file are not included in the initial %attributes then one of them must be set by method before the rest of the package can be used.
file_name
Definition: This attribute holds the full file name and path for the xlsx file to be parsed.
Default no default - either this or a file handle must be provided to read a file
Range any unencrypted xlsx file that can be opened in Microsoft Excel
attribute methods Methods provided to adjust this attribute
set_file_name
Definition: change the file name value in the attribute (this will reboot the workbook instance)
has_file_name
Definition: this is used to see if the workbook loaded correctly using the file_name option to open an Excel .xlsx file.
file_handle
Definition: This attribute holds a copy of the passed file handle reference.
Default no default - either this or a file name must be provided to read a file
Range any unencrypted xlsx file handle that can be opened in Microsoft Excel
attribute methods Methods provided to adjust this attribute
set_file_handle
Definition: change the set file handle (this will reboot the workbook instance)
has_file_handle
Definition: this is used to see if the workbook loaded correctly when using the file_handle option to open an Excel .xlsx file.
error_inst
Definition: This attribute holds an 'error' object instance. It should have several methods for managing errors. Currently no error codes or error language translation options are available but this should make implementation of that easier.
Default: a Spreadsheet::XLSX::Reader::LibXML::Error instance with the attributes set as;
( should_warn => 0 )
Range: The minimum list of methods to implement for your own instance is;
error set_error clear_error set_warnings if_warn
The error instance must be able to extract the error string from a passed error object as well. For now the current implementation will attempt ->as_string first and then ->message if an object is passed.
attribute methods Methods provided to adjust this attribute
get_error_inst
Definition: returns this instance
error
Definition: delegated method from the class used to get the most recently logged error string
set_error
Definition: delegated method from the class used to set a new error string (or pass an error object for extraction of the error string)
clear_error
Definition: delegated method from the class used to clear the current error string
set_warnings
Definition: delegated method from the class used to turn on or off real time warnings when errors are set
if_warn
Definition: delegated method from the class used to extend this package and see if warnings should be emitted.
sheet_parser
Definition: This sets the way the .xlsx file is parsed. For now the only choice is 'reader'.
Default 'reader'
Range 'reader'
attribute methods Methods provided to adjust this attribute
set_parser_type
Definition: the way to change the parser type
get_parser_type
Definition: returns the currently set parser type
count_from_zero
Definition: Excel spreadsheets count from 1. Spreadsheet::ParseExcel counts from zero. This allows you to choose either way.
Default 1
Range 1 = counting from zero like Spreadsheet::ParseExcel, 0 = Counting from 1 like Excel
attribute methods Methods provided to adjust this attribute
counting_from_zero
Definition: a way to check the current attribute setting
set_count_from_zero
Definition: a way to change the current attribute setting
file_boundary_flags
Definition: When you request data to the right of the last column or below the last row of the data this package can return 'EOR' or 'EOF' to indicate that state. This is especially helpful in 'while' loops. The other option is to return 'undef'. This is problematic if some cells in your table are empty which also returns undef. What is determined to be the last column and row is determined by the attribute empty_is_end.
Default 1
Range 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when requesting a position that is out of bounds
attribute methods Methods provided to adjust this attribute
boundary_flag_setting
Definition: a way to check the current attribute setting
change_boundary_flag
Definition: a way to change the current attribute setting
empty_is_end
Definition: The excel convention is to read the table left to right and top to bottom. Some tables have an uneven number of columns with real data from row to row. This allows the several methods that excersize a 'next' function to wrap after the last element with data rather than going to the max column. This also triggers 'EOR' flags after the last data element and before the sheet max column when not implementing 'next' functionality.
Default 0
Range 1 = treat all columns short of the max column for the sheet as being in the table, 0 = end each row after the last cell with data rather than going to the max sheet column
attribute methods Methods provided to adjust this attribute
is_empty_the_end
Definition: a way to check the current attribute setting
set_empty_is_end
Definition: a way to set the current attribute setting
values_only
Definition: Excel will store information about a cell even if it only contains formatting data. In many cases you only want to see cells that actually have values. This attribute will change the package behaviour regarding cells that have formatting stored against that cell but no actual value.
Default 0
Range 1 = skip cells with formatting only and treat them as completely empty, 0 = return informat about cells that only contain formatting
attribute methods Methods provided to adjust this attribute
get_values_only
Definition: a way to check the current attribute setting
set_values_only
Definition: a way to set the current attribute setting
from_the_edge
Definition: Some data tables start in the top left corner. Others do not. I don't reccomend that practice but when aquiring data in the wild it is often good to adapt. This attribute sets whether the file reads from the top left edge or from the top row with data and starting from the leftmost column with data.
Default 1
Range 1 = treat the top left corner of the sheet as the beginning of rows and columns even if there is no data in the top row or leftmost column, 0 = Set the minimum row and minimum columns to be the first row and first column with data
attribute methods Methods provided to adjust this attribute
set_from_the_edge
Definition: a way to set the current attribute setting
default_format_list
Definition: This is the attribute containing the role that manages the way number formats and encoding conversions are implemented for this sheet. This is a departure from Spreadsheet::ParseExcel for two reasons. First, it doesn't use the same modules. Second, this requires a Moose role (not a class) with two methods where ParseExcel uses an object instance.
Default Spreadsheet::XLSX::Reader::LibXML::FmtDefault
Range a Moose role with the methods 'get_defined_excel_format' and 'change_output_encoding' it should be noted that libxml2 which is the underlying code for XML::LibXML always attempts to get the data into perl friendly strings based on the xml file encoding setting. That means this role should only tweak the data on the way out of memory and does not operate on the data on its way into memory.
attribute methods Methods provided to adjust this attribute
get_default_format_list
Definition: a way to check the current attribute setting
set_default_format_list
Definition: a way to set the current attribute setting
format_string_parser
Definition: This is a Moose role that interprets the excel format string into a Type::Tiny coercion. If you don't like the output or the method you can write your own Moose Role and add it here.
Default Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
Range a Moose role with the method 'parse_excel_format_string'
attribute methods Methods provided to adjust this attribute
get_format_string_parser
Definition: a way to check the current attribute setting
set_format_string_parser
Definition: a way to set the current attribute setting
group_return_type
Definition: Traditionally ParseExcel returns a cell object with lots of methods to reveal information about the cell. In reality the extra information is not used very much (witness the popularity of Spreadsheet::XLSX). Because many users don't need or want the extra cell formatting information it is possible to get either the raw cell value or the formatted cell value returned either the way the Excel file specified or the way you specify instead of a Cell instance with all the data. . See "custom_formats" in Spreadsheet::XLSX::Reader::LibXML::Worksheet to insert custom targeted formats for use with the parser. All empty cells return undef no matter what.
Default instance
Range instance = returns a populated Spreadsheet::XLSX::Reader::LibXML::Cell instance, unformatted = returns just the raw value of the cell with no modifications, value = returns just the formatted value stored in the excel cell
attribute methods Methods provided to adjust this attribute
get_group_return_type
Definition: a way to check the current attribute setting
set_group_return_type
Definition: a way to set the current attribute setting
empty_return_type
Definition: Traditionally Spreadsheet::ParseExcel returns an empty string for cells with unique formatting but no stored value. It may be that the more accurate way of returning undef works better for you. This will turn that behaviour on. If Excel stores an empty string having this attribute set to 'undef_string' will still return the empty string!
Default empty_string
Range empty_string = populates the unformatted value with '' even if it is set to undef undef_string = if excel stores undef for an unformatted value it will return undef
attribute methods Methods provided to adjust this attribute
get_empty_return_type
Definition: a way to check the current attribute setting
set_empty_return_type
Definition: a way to set the current attribute setting
Primary Methods
These are the primary ways to use this class. They can be used to open an .xlsx workbook. They are also ways to investigate information at the workbook level. For information on how to retrieve data from the worksheets see the Worksheet and Cell documentation. For additional workbook options see the Secondary Methods and the Attributes sections. The attributes section specifically contains all the methods used to adjust the attributes of this class.
All methods are object methods and should be implemented on the object instance.
Example:
my @worksheet_array = $workbook_instance->worksheets;
parse( $file_name|$file_handle, $formatter )
Definition: This is a convenience method to match "parse($filename, $formatter)" in Spreadsheet::ParseExcel. It only works if the file_name or file_handle attribute was not set with ->new. It is one way to set the 'file_name' or 'file_handle' attribute [and the default_format_list attribute]. You cannot pass both a file name and a file handle simultaneously to this method.
Accepts:
$file = a valid xlsx file [or a valid xlsx file handle] (required)
[$formatter] = see the default_format_list attribute for valid options (optional)
Returns: itself when passing with the xlsx file loaded to the workbook level or undef for failure.
worksheets
Definition: This method will return an array (not an array reference) containing a list of references to all worksheets in the workbook. This is not a reccomended method. It is provided for compatibility to Spreadsheet::ParseExcel. For alternatives see the get_worksheet_names method and the worksheet methods. For now it also only returns the tabular worksheets in the workbook. All chart worksheets are ignored! (future inclusion will included a backwards compatibility policy)
Accepts: nothing
Returns: an array ref of Worksheet objects for all worksheets in the workbook.
worksheet( $name )
Definition: This method will return an object to read values in the worksheet. If no value is passed to $name then the 'next' worksheet in physical order is returned. 'next' will NOT wrap It also only iterates through the 'worksheets' in the workbook (but not the 'chartsheets').
Accepts: the $name string representing the name of the worksheet object you want to open. This name is the word visible on the tab when opening the spreadsheet in Excel. (not the underlying zip member file name - which can be different. It will not accept chart tab names.)
Returns: a Worksheet object with the ability to read the worksheet of that name. It returns undef and sets the error attribute if a 'chartsheet' is requested. Or in 'next' mode it returns undef if past the last sheet.
Example: using the implied 'next' worksheet;
while( my $worksheet = $workbook->worksheet ){
print "Reading: " . $worksheet->name . "\n";
# get the data needed from this worksheet
}
in_the_list
Definition: This is a predicate method that indicates if the 'next' worksheet function has been implemented at least once.
Accepts:nothing
Returns: true = 1, false = 0 once
start_at_the_beginning
Definition: This restarts the 'next' worksheet at the first worksheet. This method is only useful in the context of the worksheet function.
Accepts: nothing
Returns: nothing
worksheet_count
Definition: This method returns the count of worksheets (excluding charts) in the workbook.
Accepts:nothing
Returns: an integer
get_worksheet_names
Definition: This method returns an array ref of all the worksheet names in the workbook. (It excludes chartsheets.)
Accepts: nothing
Returns: an array ref
Example: Another way to parse a workbook without building all the sheets at once is;
for $sheet_name ( @{$workbook->worksheet_names} ){
my $worksheet = $workbook->worksheet( $sheet_name );
# Read the worksheet here
}
get_sheet_names
Definition: This method returns an array ref of all the sheet names (tabs) in the workbook. (It includes chartsheets.)
Accepts: nothing
Returns: an array ref
get_chartheet_names
Definition: This method returns an array ref of all the chartsheet names in the workbook. (It excludes worksheets.)
Accepts: nothing
Returns: an array ref
sheet_name( $Int )
Definition: This method returns the sheet name for a given physical position in the workbook from left to right. It counts from zero even if the workbook is in 'count_from_one' mode. B(It will return chart names but chart tab names cannot currently be converted to worksheets). You may actually want worksheet_name instead of this function.
Accepts: integers
Returns: the sheet name (both workbook and worksheet)
Example: To return only worksheet positions 2 through 4
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
sheet_count
Definition: This method returns the count of all sheets in the workbook (worksheets and chartsheets).
Accepts: nothing
Returns: a count of all sheets
worksheet_name( $Int )
Definition: This method returns the worksheet name for a given order in the workbook from left to right. It does not count any 'chartsheet' positions as valid. It counts from zero even if the workbook is in 'count_from_one' mode.
Accepts: integers
Returns: the worksheet name
Example: To return only worksheet positions 2 through 4 and then parse them
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
worksheet_count
Definition: This method returns the count of all worksheets in the workbook (not including chartsheets).
Accepts: nothing
Returns: a count of all worksheets
chartsheet_name( $Int )
Definition: This method returns the chartsheet name for a given order in the workbook from left to right. It does not count any 'worksheet' positions as valid. It counts from zero even if the workbook is in 'count_from_one' mode.
Accepts: integers
Returns: the chartsheet name
chartsheet_count
Definition: This method returns the count of all chartsheets in the workbook (not including worksheets).
Accepts: nothing
Returns: a count of all chartsheets
error
Definition: This returns the most recent error message logged by the package. This method is mostly relevant when an unexpected result is returned by some other method.
Accepts:nothing
Returns: an error string.
Secondary Methods
These are the additional methods that include ways to extract additional information about the .xlsx file and ways to modify workbook and worksheet parsing that are less common. Note that all methods specifically used to adjust workbook level attributes are listed in the Attribute section. This section primarily contains methods for or delegated from private attributes set up during the workbook load process.
parse_excel_format_string( $format_string )
Roundabout delegation from "parse_excel_format_string( $string )" in Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
creator
Definition: Retrieve the stored creator string from the Excel file.
Accepts nothing
Returns A string
date_created
Definition: returns the date the file was created
Accepts nothing
Returns A string
modified_by
Definition: returns the user name of the person who last modified the file
Accepts nothing
Returns A string
date_modified
Definition: returns the date when the file was last modified
Accepts nothing
Returns A string
get_epoch_year
Definition: This returns the epoch year defined by the Excel workbook.
Accepts: nothing
Returns: 1900 = Windows Excel or 1904 = Apple Excel
get_shared_string_position
Roundabout delegation from "get_shared_string_position( $position )" in Spreadsheet::XLSX::Reader::LibXML::SharedStrings
get_format_position
Roundabout delegation from "get_format_position( $position, [$header] )" in Spreadsheet::XLSX::Reader::LibXML::Styles
set_defined_excel_format_list
Roundabout delegation from "set_defined_excel_format_list" in Spreadsheet::XLSX::Reader::LibXML::FmtDefault
change_output_encoding
Roundabout delegation from "change_output_encoding( $string )" in Spreadsheet::XLSX::Reader::LibXML::FmtDefault
set_cache_behavior
Roundabout delegation from "cache_formats" in Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
get_date_behavior
Roundabout delegation from "datetime_dates" in Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
set_date_behavior
Roundabout delegation from "datetime_dates" in Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
FLAGS
The parameter list (attributes) that are possible to pass to ->new is somewhat long. Therefore you may want a shortcut that aggregates some set of attribute settings that are not the defaults but wind up being boilerplate. I have provided possible alternate sets like this and am open to providing others that are suggested. The flags will have a : in front of the identifier and will be passed to the class in the 'use' statement for consumption by the import method. Only the first flag is currently accepted.
Example;
use Spreadsheet::XLSX::Reader::LibXML v0.34.4 qw( :alt_default );
:alt_default
This is intended for a deep look at data and skip formatting cells.
Default attribute differences
values_only => 1
count_from_zero => 0
empty_is_end => 1
:just_the_data
This is intended for a shallow look at data and skip formatting.
Default attribute differences
values_only => 1
count_from_zero => 0
empty_is_end => 1
group_return_type => 'value'
BUILD / INSTALL from Source
0. Please note that using cpanm is much easier than a source build! (but it will not always give the latest github version)
cpanm Spreadsheet-XLSX-Reader-LibXML
1. This package uses Alien::LibXML to try and ensure that the mandatory prerequisite XML::LibXML will load. The biggest gotcha here is that older (<5.20.0.2) versions of Strawberry Perl and some other Win32 perls may not support the script 'pkg-config' which is required. You can resolve this by installation PkgConfig as 'pkg-config'. I have included the short version of that process below but download the full PkgConfig distribution and read README.win32 file for other options and much more explanation.
this will conflict with any existing pkg-config installed
C:\> cpanm PkgConfig --configure-args=--script=pkg-config
It may be that you still need to use a system package manager to load the 'libxml2-devel' library. If this is the case or you experience any other installation issues please submit them to github especially if they occur prior to starting the test suit as these failures will not auto push from CPAN Testers so I won't know to fix them!
2. Download a compressed file with this package code from your favorite source
3. Extract the code from the compressed file.
If you are using tar on a .tar.gz file this should work:
tar -zxvf Spreadsheet-XLSX-Reader-LibXML-v0.xx.tar.gz
4. Change (cd) into the extracted directory
5. Run the following
(for Windows find what version of make was used to compile your perl)
perl -V:make
(then for Windows substitute the correct make function (s/make/dmake/g)? below)
perl Makefile.PL
make
make test
make install # As sudo/root
make clean
SUPPORT
TODO
1. Add POD for all the new chart methods!
1. Build an 'Alien::LibXML::Devel' package to load the libxml2-devel libraries from source and require that and Alien::LibXML in the build file. So all needed requirements for XML::LibXML are met
Both libxml2 and libxml2-devel libraries are required for XML::LibXML
2. Add a pivot table reader (Not just read the values from the sheet)
3. Add calc chain methods
4. Add more exposure to workbook formatting methods
5. Build a DOM parser alternative for the sheets
(Theoretically faster than the reader but uses more memory)
AUTHOR
Jed Lund
jandrew@cpan.org
CONTRIBUTORS
This is the (likely incomplete) list of people who have helped make this distribution what it is, either via code contributions, patches, bug reports, help with troubleshooting, etc. A huge 'thank you' to all of them.
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2014, 2015 by Jed Lund
DEPENDENCIES
List::Util - 1.33
Moose - 2.1213
MooseX::ShortCut::BuildInstance - 1.032
Type::Tiny - 1.000
version - 0.077
SEE ALSO
Spreadsheet::ParseExcel - Excel version 2003 and earlier
Spreadsheet::XLSX - Excel version 2007 and later
Spreadsheet::ParseXLSX - Excel version 2007 and later
All lines in this package that use Log::Shiras are commented out