NAME
Spreadsheet::Reader::ExcelXML::Worksheet - Top level Worksheet interface
SYNOPSIS
use strict;
use warnings;
use Data::Dumper;
use Spreadsheet::Reader::ExcelXML;
my $workbook = Spreadsheet::Reader::ExcelXML->new( #similar style to Spreadsheet::XLSX
file => 't/test_files/TestBook.xlsx',# in the test folder of this package
group_return_type => 'value',
);
if ( !$workbook->file_opened ) {
die $workbook->error(), ".\n";
}
my $worksheet = $workbook->worksheet( 'Sheet5' );
$worksheet->set_custom_formats( {
2 =>'yyyy-mm-dd',
} );
my $value;
while( !$value or $value ne 'EOF' ){
$value = $worksheet->fetchrow_arrayref;
print Dumper( $value );
}
###########################
# SYNOPSIS Output
# $VAR1 = [ 'Superbowl Audibles', 'Column Labels' ];
# $VAR1 = [ 'Row Labels', 2016-02-06', '2017-02-14', '2018-02-03', 'Grand Total' ];
# $VAR1 = [ 'Blue', '10', '7', '', '17' ];
# $VAR1 = [ 'Omaha', '', '', '2', '2' ];
# $VAR1 = [ 'Red', '30', '5', '3', '38' ];
# $VAR1 = [ 'Grand Total', '40', '12', '5', '57' ];
# $VAR1 = 'EOF';
###########################
The best example for use of this module alone is the test file in this package t/Spreadsheet/Reader/ExcelXML/10-worksheet.t
DESCRIPTION
This module is the worksheet interface. Even thought it will not provide all worksheet access itself it's role is to manage a consistent way of parsing Excel worksheets. If the sheet/tab is a 'chartsheet' then please review the documentation for Chartsheets . The documentation in this file will include all publicly accesable elements. This include elements provided through other roles and the base class. The purpose of this instance is to extract worksheet level formats and cell level content for parsing. The workbook also has several Settings that affect the outcome of methods from this interface. For example the workbook level attribute setting "count_from_zero" in Spreadsheet::Reader::ExcelXML will directly affect how this data is returned when retreiving cell data. Please review all workbook level attribute documentation for information covering those possibilities. It is best to generate a worksheet instance from the workbook class using one of the various worksheet methods.
Modification of this Interface
The final worksheet instance provided by this package is an amalgam of a base class and a few roles aggregated at run time based on attribute settings from the workbook level class. The easiest way to modify behaviour in this instance is to build an additional or replacement element and have it built into the interface by the workbook. The settings for this type of injection/adjustement are stored in the raw code of Spreadsheet::Reader::ExcelXML::Workbook within the 'worksheet_interface' key of the $parser_modules variable. The file t/Spreadsheet/Reader/ExcelXML/10-worksheet.t in the distribution represents a good 'under the hood' example of the way all the elements are integrated into the larger worksheet class as a whole.
Methods
These are the various functions that are available to extract cell (values) to read or to extract worksheet level formats. There are several different methods provided to extract the same content in different ways. All the methods are object methods performed on the worksheet.
Example:
my $cell_data = $worksheet->get_cell( $row, $column );
min_row
Definition: This is the minimum row determined when the sheet is opened. This value is affected by the workbook attributes from_the_edge, and count_from_zero
Accepts: nothing
Returns: an integer $row
has_min_row
Definition: The predicate of min_row
max_row
Definition: This is the maximum row with data listed in the sheet. This value is affected by the workbook attribute count_from_zero. Warning: This value is extracted from the sheet metadata, however if your sheet has been damaged or 'adjusted' by non-microsoft code (This is more common than you would think in the data processing world) then this value may be wrong or missing when the sheet is first opened. The goal of this package is to minimize memory consumption so it will learn what the correct value is over the first pass through the sheet as you collect data but it does not attempt to validate this value in detail initially. If you have an idea of the range for a damaged sheet before you open it you can use EOF flags. Otherwise the methods get_next_value or fetchrow_arrayref are recomended.
Accepts: nothing
Returns: an integer
has_max_row
Definition: The predicate of max_row
row_range
Definition: This returns a list containing the minimum row number followed by the maximum row number. This list is affected by the workbook attributes from_the_edge, and count_from_zero Warning: This result is extracted from the sheet metadata, however if your sheet has been damaged or 'adjusted' by non-microsoft code (This is more common than you would think in the data processing world) then the return list may be wrong or missing when the sheet is first opened. The goal of this package is to minimize memory consumption so it will learn what the correct list is over the first pass through the sheet as you collect data but it does not attempt to validate this list in detail initially. If you have an idea of the range for a damaged sheet before you open it you can use EOR-EOF flags. Otherwise the methods get_next_value or fetchrow_arrayref are recomended. For missing values the minimum is set to the first row and the maximum is set to undef.
Accepts: nothing
Returns: ( $minimum_row, $maximum_row )
min_col
Definition: This is the minimum column with data listed in the sheet. This value is affected by the workbook attributes from_the_edge, and count_from_zero
Accepts: nothing
Returns: an integer
has_min_col
Definition: The predicate of min_col
max_col
Definition: This is the maximum row with data listed in the sheet. This value is affected by the workbook attribute count_from_zero Warning: This value is extracted from the sheet metadata, however if your sheet has been damaged or 'adjusted' by non-microsoft code (This is more common than you would think in the data processing world) then this value may be wrong or missing when the sheet is first opened. The goal of this package is to minimize memory consumption so it will learn what the correct value is over the first pass through the sheet as you collect data but it does not attempt to validate this value in detail initially. If you have an idea of the range for a damaged sheet before you open it you can use EOR flags. Otherwise the methods get_next_value or fetchrow_arrayref are recomended.
Accepts: nothing
Returns: an integer
has_max_col
Definition: The predicate of max_col
col_range
Definition: This returns a list containing the minimum column number followed by the maximum column number. This list is affected by the workbook attributes from_the_edge, and count_from_zero
Accepts: nothing
Returns: ( $minimum_column, $maximum_column )
get_merged_areas
Definition: This method returns an array ref of cells that are merged. This method does respond to the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML Warning: This result is extracted from the sheet metadata for 2007+ Excel files, however if you are parsing an Excel 2003 xml file this data is stored at the cell level. Since this parser reads the data 'Just In Time' it will not know about a set of merged cells until the upper left cell of the group has been read.
Accepts: nothing
Returns: An arrayref of arrayrefs of merged areas or undef if no merged areas
[ [ $start_row_1, $start_col_1, $end_row_1, $end_col_1], etc.. ]
is_column_hidden
Definition: Method indicates if the excel program would hide the identified column(s) or show it|them if the file were opened in the Microsoft Excel application. If more than one column is passed then it returns true if any of the columns are hidden in scalar context and a list of 1 and 0 values for each of the requested positions in array (list) context. This method (input) does respond to the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML. For 2003 xml files this data is stored at the cell level so the parser will not 'know' until you have read (past) the cell.
Accepts: integer values or column letter values selecting the columns in question
Returns: in scalar context it returns a boolean value indicating if any of the requested columns would be hidden by Excel. In array/list context it returns a list of boolean values for each requested column indicating it's hidden state for Excel. (1 = hidden)
Examples: (Indicating the 3rd and 4th of 6 columns are hidden)
$worksheet_instance->is_column_hidden( 0 .. 5 ),
###########################
# Example Output
# [ 0, 0, 1, 1, 0, 0 ]
###########################
$worksheet_instance->is_column_hidden( 'A', 'B', 'C', 'D', 'E', 'F' )
###########################
# Example Output
# [ 0, 0, 1, 1, 0, 0 ]
###########################
is_row_hidden
Definition: Method indicates if the excel program would hide the identified row(s) or show it|them if the file were opened in the Microsoft Excel application. If more than one row is passed then it returns true if any of the rows are hidden in scalar context and a list of 1 and 0 values for each of the requested positions in array (list) context. This method (input) does respond to the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML. Warning: This method will only be accurate after the user has read at least one cell from or past the row inspected for it's hidden state. This allows the sheet to avoid reading all the way through once before starting the cell parsing.
Accepts: integer values selecting the rows in question
Returns: in scalar context it returns a boolean value indicating if any of the requested rows would be hidden by Excel. In array/list context it returns a list of boolean values for each requested row indicating it's hidden state for Excel. (1 = hidden)
get_cell( $row, $column )
Definition: Indicate both the requested row and requested column and the information for that position will be returned. Both $row and $column are required
Accepts: the list ( $row, $column ) both required See the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML to understand which row and column are returned for $row and $colum.
Returns: see the attribute "group_return_type" in Spreadsheet::Reader::ExcelXML for details on what is returned
get_next_value
Definition: Reading left to right and top to bottom this will return the next cell with a value. This can includes cells with no value but some unique formatting such as cells that have been merged with other cells. See the attributes "values_only" in Spreadsheet::Reader::ExcelXML and "spaces_are_empty" in Spreadsheet::Reader::ExcelXML for more information.
Accepts: nothing
Returns: see the attribute "group_return_type" in Spreadsheet::Reader::ExcelXML for details on what is returned
fetchrow_arrayref( $row )
Definition: In an homage to DBI I included this function to return an array ref of the cells or values in the requested $row. If no row is requested this returns the 'next' row. In the array ref any empty cell will show as 'undef'.
Accepts: undef = next|$row = a row integer indicating the desired row See the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML to understand which row is returned for $row.
Returns: an array ref of all possible column positions in that row with data filled in per the attribute "group_return_type" in Spreadsheet::Reader::ExcelXML.
fetchrow_array( $row )
Definition: This function is just like fetchrow_arrayref except it returns an array instead of an array ref
Accepts: undef = next|$row = a row integer indicating the desired row. See the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML to understand which row is returned for $row.
Returns: an array ref of all possible column positions in that row with data filled in per the attribute "group_return_type" in Spreadsheet::Reader::ExcelXML.
set_headers( @header_row_list, [ \&header_scrubber ] )
Definition: This function is used to set headers used in the function fetchrow_hashref. It accepts a list of row numbers that will be collated into a set of headers used to build the hashref for each row. The header rows are coallated in sequence with the first position taking precedence. The list is also used to set the lowest row of the headers in the table. All rows at that level and higher will be considered out of the table and will return undef while setting the error instance. If some of the header columns do not have values then the instance will auto generate unique headers for each empty header column to fill out the header ref. [ optionally: it is possible to pass a coderef at the end of the list to scrub the headers so they make some sense.] When the headers are loaded the top level information row of any pull for 'fetchrow_hashref' will be considered the first row after the header row. Any header build will not be tested again during fetchrow_hashref the content columns will be matched to the header columns by position. for example;
my $scrubber = sub{
my $input = $_[0];
$input =~ s/\n//g if $input;
$input =~ s/\s/_/g if $input;
return $input;
};
$self->set_headers( 2, 1, $scrubber ); # Called internally as $new_value = $scrubber->( $old_value );
# Returns/stores the headers set at row 2 and 1 with values from row 2 taking precedence
# Then it scrubs the values by removing newlines and replacing spaces with underscores.
Accepts: a list of row numbers (modified as needed by the attribute state of "count_from_zero" in Spreadsheet::Reader::ExcelXML) and an optional closure . See the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML to understand which rows are used when the @header_row_list is called.
Returns: an array ref of the built headers for review.
get_last_header_row
Definition: This returns the final row of headers in the sheet. The return value is in the context of the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML. It is determined by the function 'set_headers'.
Accepts: nothing
Returns: the last (largest) row number used in the coallated header.
header_row_set
Definition: This predicate will indicate if any header build has occured from the method 'set_headers'.
Accepts: nothing
Returns: Boolean on off for set headers (1 = headers are set)
fetchrow_hashref( $row )
Definition: This function is used to return a hashref representing the data in the specified row. If no $row value is passed it will return the 'next' row of data. A call to this function without setting the headers first will return 'undef' and set the error instance.
Accepts: a target $row number for return values or undef meaning 'next' See the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML to understand which rows are targeted by $row.
Returns: a hash ref of the values for that row. This function ignores the attribute group_return_type when it is set to 'instance' and returns 'value's instead. See also the attributes min_header_col and max_header_col to pare the start and end columns of the returned hash ref.
set_custom_formats( $format_ref )
Definition: This package will generate value conversions that generally match the numerical conversions set in the Excel spreadsheet. However, it may be that you want to convert the unformatted values for certain cells, rows, or columns in some user defined way. The simplest way to do this is by storing an Excel custom number format string in this instance using 'set_custom_formats' against either a CellID, a Row Number, or a Column letter. As an example you could say;
$worksheet->set_custom_formats( {
A => '# ?/?',
} );
And any subsequent call for a $cell->value from column 'A' will attempt to convert the raw xml contents of that cell to an integer and fraction combination with one position in the denominator or less (an integer only). If the cell is text then it will act as a pass-through. Where there is overlap between the formats the instance will select one based on the following priority; cellID, column letter, row number.
For the truly adventurous you can build an object instance that has the two following methods; 'assert_coerce' and 'display_name'. Then add it to the attribute as above.
A Complicated Example: Building a converter on the fly from Type::Tiny (or use MooseX::Types)
use DateTimeX::Format::Excel;
use DateTime::Format::Flexible;
use Type::Coercion;
use Type::Tiny;
my @args_list = ( system_type => 'apple_excel' );
my $num_converter = DateTimeX::Format::Excel->new( @args_list );
# build conversion subroutines (number and strings to DateTime objects)
my $string_via = sub{
my $str = $_[0];
return DateTime::Format::Flexible->parse_datetime( $str );
};
my $num_via = sub{
my $num = $_[0];
return $num_converter->parse_datetime( $num );
};
# Combine conversion subroutines into a coercion object!
# (Note numbers are attempted first)
my $date_time_from_value = Type::Coercion->new(
type_coercion_map => [ Num, $num_via, Str, $string_via, ],
);
# Install the coercion in a type that ensures it passes through a DateTime check
$date_time_type = Type::Tiny->new(
name => 'Custom_date_type',
constraint => sub{ ref($_) eq 'DateTime' },
coercion => $date_time_from_value,
);
# Chained coercions! to handle first the $date_time_from_value coercion
# and then build a specific date string output
$string_type = Type::Tiny->new(
name => 'YYYYMMDD',
constraint => sub{
!$_ or (
$_ =~ /^\d{4}\-(\d{2})-(\d{2})$/ and
$1 > 0 and $1 < 13 and $2 > 0 and $2 < 32
)
},
coercion => Type::Coercion->new(
type_coercion_map =>[
$date_time_type->coercibles, sub{
my $tmp = $date_time_type->coerce( $_ );
$tmp->format_cldr( 'yyyy-MM-dd' )
},
],
), );
# Then set the coercions in the worksheet for targeted positions
$worksheet->set_custom_formats( {
E10 => $date_time_type,
10 => $string_type,
D14 => $string_type,
} );
Accepts: a set of $key => $value pairs where the $key can either be a row number, a column letter, or a cell ID. The $value must either be a string that complies with excel custom number formatting conventions or an object instance with two methods 'display_name' and 'assert_coerce'.
Returns: nothing
has_custom_format( $key )
Definition: This is a test to see if a custom format is stored for $key.
Accepts: $key
Returns: a boolean where 1 = formats exist
get_custom_format( $key )
Definition: This will retrieve the $value_ref of the custom format stored against the given $key. For formats saved as Excel custom format strings this will return a package built object instance that performs the conversion indicated by the custom format string.
Accepts: $key
Returns: a $value_ref for the stored format object
get_custom_formats
Definition: This returns a hashref of all stored custom formats
Accepts: nothing
Returns: a hashref of all custom formats
Attributes
These are attributes of the built instance. Most of these are provided as metadata for the specific parsed sheet and should be provided by the object that builds the worksheet instance. (The workbook) However, a few can be set after worksheet creation to affect the data retreival results. These attributes are presented first. The end of the list is the attribute that stores a reference to the workbook. Follow the link in thats attribute documentation to learn a tricksy way to adjust several workbook level settings with the worksheet instance.
min_header_col
Definition: This attribute affects the hashref that is returned in the method fetchrow_hashref. This attribute tells fetchrow_hashref what column to use to start the hash ref build. This attribute (input) does respond to the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML.
Default: undef (which is equivalent to the minimum column of the sheet)
Range: The minimum column of the sheet to or less than the max_header_col
attribute methods Methods provided to adjust this attribute
get_min_header_col
Definition: returns the value stored in the attribute
set_min_header_col
Definition: Sets a new value for the attribute
has_min_header_col
Definition: Indicates if the attribute has a stored value
max_header_col
Definition: This attribute affects the hashref that is returned in the method fetchrow_hashref. This attribute tells fetchrow_hashref what column to use to end the hash ref build. This attribute (input) does respond to the attribute "count_from_zero" in Spreadsheet::Reader::ExcelXML.
Default: undef (equal to the maximum column of the sheet)
Range: The maximum column of the sheet to or less than the min_header_col
attribute methods Methods provided to adjust this attribute
get_max_header_col
Definition: returns the value stored in the attribute
set_max_header_col
Definition: Sets a new value for the attribute
has_max_header_col
Definition: Indicates if the attribute has a stored value
file
Definition: This attribute holds the file handle for the file being read. If the full file name and path is passed to the attribute the class will coerce that into an IO::File file handle.
Default: no default - this must be provided to read a file
Required: yes
Range: any unencrypted xml file name and path or IO::File file handle set to read.
attribute methods Methods provided to adjust this attribute
set_file
Definition: change the file value in the attribute (this will reboot the file instance and should lock the file)
get_file
Definition: Returns the file handle of the file even if a file name was passed
has_file
Definition: this is used to see if the file loaded correctly.
clear_file
Definition: this clears (and unlocks) the file handle
Delegated Methods
closes the file handle
allows seek commands to be passed to the file handle
returns the next line of the file handle with '<' set as the input_record_separator ($/)
sheet_type
Definition: This is the sheet 'type' which will always be 'worksheet' (as opposed to chartsheet)
attribute methods Methods provided to adjust this attribute
get_sheet_type
Definition: returns the value stored in the attribute (worsheet)
sheet_rel_id
Definition: This is the relId of the sheet listed in the XML of the .xlsx file. You probably don't care and you should never set this value.
attribute methods Methods provided to adjust this attribute
rel_id
Definition: returns the value stored in the attribute
sheet_id
Definition: This is the Id of the sheet listed in the XML of the .xlsx file. I beleive this to be the number used in vbscript to reference the sheet. You should never set this value.
attribute methods Methods provided to adjust this attribute
sheet_id
Definition: returns the value stored in the attribute
sheet_position
Definition: This is the visual sheet position in the .xlsx file. You should never set this value.
attribute methods Methods provided to adjust this attribute
position
Definition: returns the value stored in the attribute
sheet_name
Definition: This is the visual sheet name in the .xlsx file on the tab. You should never set this value.
attribute methods Methods provided to adjust this attribute
get_name
Definition: returns the value stored in the attribute
workbook_instance
Definition: This attribute holds a reference back to the workbook instance so that the worksheet has access to the global settings managed there. As a consequence many of the workbook methods are be exposed here. This includes some setter methods for workbook attributes. Beware that setting or adjusting the workbook level attributes with methods here will be universal and affect other worksheets. So don't forget to return the old value if you want the old behavour after you are done. If that doesn't make sense then don't use these methods. (Nothing to see here! Move along.)
Default: a Spreadsheet::Reader::ExcelXML::Workbook instance
attribute methods Methods used to change the workbook instance
set_workbook_inst
Definition: Sets the attribute with a new workbook instance
Delegated attribute methods Since this list can float please follow the link to the documented delegations in that file.
SUPPORT
TODO
1. Nothing Yet!
AUTHOR
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2016 by Jed Lund
DEPENDENCIES
Spreadsheet::Reader::ExcelXML - the package
SEE ALSO
Spreadsheet::Read - generic Spreadsheet reader
Spreadsheet::ParseExcel - Excel binary version 2003 and earlier (.xls files)
Spreadsheet::XLSX - Excel version 2007 and later
Spreadsheet::ParseXLSX - Excel version 2007 and later
All lines in this package that use Log::Shiras are commented out