NAME
Spreadsheet::Reader::ExcelXML::XMLReader - A minimal pure-perl xml reader class
SYNOPSIS
package MyPackage;
use MooseX::StrictConstructor;
use MooseX::HasDefaults::RO;
# You have to 'use' or build a the Workbook here or the XMLReader won't load
# -> because the reader uses a regex to scrap imported methods
use Spreadsheet::Reader::ExcelXML::Workbook;
extends 'Spreadsheet::Reader::ExcelXML::XMLReader';
DESCRIPTION
This documentation is written to explain ways to use this module when writing your own excel spreadsheet parser. I suppose the class could be used more generally but that's not why I wrote it and for now I have no intention of providing a full xml toolbox. For Excel spreadsheet parsing generally please start at the top level documentation. Workbooks, Worksheets, and Cells.
This class is meant to be used as the base reading class for specific types of xml files. The reader for those specific files will include roles that are useful for that files content. When the file first loads it will store some available information from the header (?) nodes and move to the first file node. At that point it will check if any of the consuming roles have a method '_load_unique_bits' If so it will call that method for additional meta data collection by that role.
This class will process the xml file in a just in time fashion holding enough information to know the level and open nodes not yet closed but nothing else. The intent is to use a little RAM as possible and process the file in the most (pure perl) computationaly efficient way possible. I welcome all suggestions for improvement.
Attributes
Data passed to new when creating an instance. For modification of these attributes see the listed 'attribute methods'. For general information on attributes see Moose::Manual::Attributes. For ways to manage the instance after it is opened see the Methods.
file
Definition: This attribute holds the file handle for the file being read. If the full file name and path is passed to the attribute the class will coerce that into an IO::File file handle.
Default: no default - this must be provided to read a file
Required: yes
Range: any unencrypted xml file name and path or IO::File file handle set to read.
attribute methods Methods provided to adjust this attribute
set_file
Definition: change the file value in the attribute (this will reboot the file instance and lock the file)
get_file
Definition: Returns the file handle of the file even if a file name was passed
has_file
Definition: this is used to see if the file loaded correctly.
clear_file
Definition: this clears (and unlocks) the file handle
Delegated Methods
closes the file handle
allows seek commands to be passed to the file handle
returns the next line of the file handle with '<' set as the input_record_separator ($/)
workbook_inst
Definition: This attribute holds a reference to the top level workbook (parser). The purpose is to use some of the methods provided there.
Default: no default
Required: not strictly for this class but the attribute is provided to give self referential access to general workbook settings and methods for composed classes that inherit this a base class.
Range: isa => 'Spreadsheet::Reader::ExcelXML::Workbook'
attribute methods Methods provided to adjust this attribute
set_workbook_inst
set the attribute with a workbook instance
Delegated Methods (required) Methods delegated to this module by the attribute. All methods are delegated with the method name unchanged. Follow the link to review documentation of the provider for each method. As you can see several are delegated through the Workbook level and don't originate there.
"get_group_return_type" in Spreadsheet::Reader::ExcelXML
"counting_from_zero" in Spreadsheet::Reader::ExcelXML
"are_spaces_empty" in Spreadsheet::Reader::ExcelXML
"has_shared_strings_interface" in Spreadsheet::Reader::ExcelXML
"should_skip_hidden" in Spreadsheet::Reader::ExcelXML
"spreading_merged_values" in Spreadsheet::Reader::ExcelXML
"starts_at_the_edge" in Spreadsheet::Reader::ExcelXML
"get_empty_return_type" in Spreadsheet::Reader::ExcelXML
"get_values_only" in Spreadsheet::Reader::ExcelXML
"get_epoch_year" in Spreadsheet::Reader::ExcelXML
"get_error_inst" in Spreadsheet::Reader::ExcelXML
"has_styles_interface" in Spreadsheet::Reader::ExcelXML
"boundary_flag_setting" in Spreadsheet::Reader::ExcelXML
"is_empty_the_end" in Spreadsheet::Reader::ExcelXML
"get_rel_info" in Spreadsheet::Reader::ExcelXML
"get_sheet_info" in Spreadsheet::Reader::ExcelXML
"get_sheet_names" in Spreadsheet::Reader::ExcelXML
"collecting_merge_data" in Spreadsheet::Reader::ExcelXML
"collecting_column_formats" in Spreadsheet::Reader::ExcelXML
"set_error( $error_string )" in Spreadsheet::Reader::ExcelXML::Error
"get_defined_conversion( $position )" in Spreadsheet::Reader::Format
"set_defined_excel_formats( %args )" in Spreadsheet::Reader::Format
"parse_excel_format_string( $string, $name )" in Spreadsheet::Reader::Format
"change_output_encoding( $string )" in Spreadsheet::Reader::Format
"get_shared_string( $positive_int|$name )" in Spreadsheet::Reader::ExcelXML::SharedStrings
xml_version
Definition: This stores the xml version read from the xml header. It is read when the file handle is first set in this sheet.
Default: no default - this is auto read from the header
Required: no
Range: xml versions
attribute methods Methods provided to adjust this attribute
version
get the stored xml version
xml_encoding
Definition: This stores the data encoding of the xml file from the xml header. It is read when the file handle is first set in this sheet.
Default: no default - this is auto read from the header
Required: no
Range: valid xml file encoding
attribute methods Methods provided to adjust this attribute
encoding
get the attribute value
has_encoding
predicate for the attribute value
xml_progid
Definition: This is an attribute found in a secondary xml header that is associated with Excel 2003 xml based files. The value can be tested to see if the file was intended to be compliant with that format.
Default: no default - this is auto read from the header
Required: no
Range: a string
attribute methods Methods provided to adjust this attribute
progid
get the attribute value
has_progid
predicate for the attribute value
xml_header
Definition: This stores the primary xml header string from the xml file. It is read when the file handle is first set in this sheet. I contains both the verion and the encoding where available and is used when building subsets of the file as standalone xml.
Default: no default - this is auto read from the header
Required: no
Range: valid xml file header
attribute methods Methods provided to adjust this attribute
get_header
get the attribute value
_set_xml_header
set the attribute value
xml_doctype
Definition: This stores the DOCTYPE indicated in the XML header !DOCTYPE
Default: no default - this is auto read from the header
Required: no
Range: whatever it finds
attribute methods Methods provided to adjust this attribute
doctype
get the attribute value
has_doctype
predicate for the attribute
position_index
Definition: This attribute is available to facilitate other consuming roles and classes. Of this attributes methods only the 'clear_location' method is used in this class during the start_the_file_over method. It can be used for tracking positions with the same node name.
Default: no default - this is mostly managed by the role or child class
Required: no
Range: Integer
attribute methods Methods provided to adjust this attribute
where_am_i
get the attribute value
i_am_here
set the attribute value
clear_location
clear the attribute value
has_position
set the attribute value
file_type
Definition: This is a static attribute that shows the file type
Default: xml
attribute methods Methods provided to adjust this attribute
get_file_type
get the attribute value
stacking
Definition: a pure perl xml parser will in general be slower than the C equivalent. To provide some acceleration to arrive at a target destination you can turn of the stack trace which will include building and storing the trace elements. This breaks things so don't do it without a solid understanding of what is happening. For instance if you turn this off and then call the method parse_element The parse_element method will have to turn the stack trace back on on it's own to build the element tree. The issue is that the most recent element at the base of the tree won't be available to build from. You will need to manually build it and push it to the stack. See the methods initial_node_build and add_node_to_stack to implement this.
Default: 1 = the stack trace is on
attribute methods Methods provided to adjust this attribute
should_be_stacking
get the attribute value
change_stack_storage_to( $Bool )
Turn the stack trace(r) state to $Bool (1 = on)
Methods
These are the methods provided by this class.
start_the_file_over
Definition: Clears the position_index, the old stack trace, and kick starts stack trace tracking again. It then uses seek(0, 0) to reset the file handle to the beginning. Finally, it reads the file until it gets to the first non-xml header node.
Accepts: nothing
Returns: nothing
good_load( $state )
Definition: a setter method to indicated if the file loaded correctly. This generally should be set by consuming roles in the load_unique_bits phase.
Accepts: (1|0)
Returns: nothing
loaded_correctly
Definition: a getter method to understand if the file loaded correctly. This is generally used by consumers of the instance to see if there was any trouble during the initial build.
Accepts: nothing
Returns: 1 = good build, 0 = bad_build
parse_element( [$depth] )
Definition: This will read and store the full node from the current position down to an optional $depth. When the parse is complete the parser will be positioned at the beginning of the next node. The node does not include the top name but will include attributes.
Accepts: $depth = optional
Returns: A perl hash reference where all nodes at a level are listed using three hashref keys; list_keys, list, and attributes. The 'attributes' key points to a hash reference containing that nodes attributes. The 'list_keys' key points to an array reference with all the node names for each node at the next level down. The 'list' key points to an array reference of nodes or node values matching the position of the list_keys. There are two special case exceptions to this. First, for text values the node is listed as { raw_text => 'text node content' }. Second, if the attributes only include a 'val' key the node stores this under the 'val' key rather than the 'attributes' key with a sub key 'val'.
advance_element_position( $element, [$iterations] )
Definition: This will move the xml file reader forward until it finds the identified named $element. If the reader is already at an element of that name it will index forward until it finds the next $element of that name. If the optional positive $iterations integer is passed it will index to the named $element - $iterations times.
Accepts: $element = a case sensitive xml node name found forward of the current position in the file. [$iterations] = optional a positive integer indicating how many times to index forward to the named $element.
Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )
$success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.
next_sibling
Definition: This will move the xml file reader forward until it finds next node at the same level as the current node within the same supernode. If this method finds a higher node prior to finding a node at the same level it will return failure and stop reading.
Accepts: nothing
Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )
$success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.
skip_siblings
Definition: This will move the xml file reader forward until it finds next node higher. It will not stop on end nodes so it will continue to pass all closed nodes until it comes to the first open or self contained node above the current node.
Accepts: nothing
Returns: a list of 4 positions ( $success, $node_name, $node_level, $return_node_ref )
$success = a boolean value indicating whether the desired goal was met, $node_name = the actual node name for the final position (should match $element if $success), $node_level = the level of the final named node in the stack( not the sub text node ) $return_node_ref = When the stacking attribute is on this returns the last displaced elements in the stack displaced by the traverse of the xml tree. When stacking is off this returns an array ref of values used as the second argument in initial_node_build.
current_named_node
Definition: when processing xml files in a just in time fashion there will be some ambiguity surrounding text nodes;
<t>sometext</t>
<s>
<r val="2"/>
In the 't' node example the content between the '>' character and the '<' characters are intentional and valuable to the data set. In the 's' and 'r' node example the space between those characters is only intended for human readability. This parser will not be able to tell the value of the content after the 's' node '>' character until the 'r' node is read. At that point the 's' node will no longer be the 'current' position. To resolve this, all content other than '' between '>' and '<' is treated as a node until the next node is read. Because these nodes are ambiguous the idea of a 'named node' is valuable and knowing what the most recent named node is can be useful. This method either returns the last read node or the second to last node if the last node is a raw text node. In the first example it would return the 't' node and in the second example it would return the 's' node.
Accepts: nothing
Returns: a hash ref of information about the node containing the following keys;
level => counting from 0 at the start of the file and moving up
type => regular = xml named node|#text = node built from the contents between the > and < characters
name => the xml node name (for #text nodes this is 'raw_text')
closed => (closed|open) depending on the current tag state
initial_string => The string inside the < > quotes prior to parsing
[attributes] => all attributes and values will be stored under the attribute name
[val] => special case storage of one attribute
squash_node( $node )
Definition: This takes a $node from the parse_element output and turns it into a more perl like reference. It checks the list_keys and if there are any duplicates it takes the list values and uses them as elements of an array ref assigned to a hash key called list. If there are no duplicates in the list_keys it turns the list_keys into hash keys with the list elements assigned as values. It then takes the attributes and mingles them in the hashref with the prior results. There are two special cases for a node reorganization. For nodes with a 'val' in the 'list_keys' then the element in the same position of the 'list' is returned as the whole ref. If there is a raw_text node it is returned as a hashref with one key 'raw_text' with the text itself as the value. This is all done recursivly so lower layers are assigned to upper layers using the rules above.
Accepts: the output of a parse_element call
Returns: a perl data structure with the xml organization removed
extract_file( @node_list )
Definition: This will build an xml file and load it to a IO::Handle->new_tmpfile object. The xml is built on whole extracted xml strings defined by @node_list. If none of the node list elements is found in the parsed file then the first listed element from the node list will be used to create an empty self closing node.
Accepts: @node_list = Node list items can either be xml node name strings or array refs composed of two elements, first the node name and second the iterated position. Ex.
@node_list_example = ( 'r', [ 'si', 3 ] );
In this example the extracted file would contain the first 'r' node and the 3rd 'si' node.the output of a parse_element call. There is the exception case where you just want the whole file passed. The out here is to pass 'ALL_FILE' as the first element of the @node_list and a complete copy of the file_handle in read mode will be passed.
Returns: a File::Temp file handle loaded with an xml header and the listed nodes.
current_node_parsed
Definition: When nodes are read they are not completely processed to save cycles. If you want a fully processed result from the current node position including any embedded text then this is the method for you.
Accepts: Nothing
Returns: a perl ref equivalent to the squash_node call. This only returns the fully processed current_named_node and any sub text nodes.
close_the_file
Definition: It may be that the file(handle) may not be needed during the whole workbook parse. If so you can use this method to close (and clear / release) an open file handle as appropriate.
Accepts: Nothing
Returns: Nothing (the file handle is closed and cleared)
not_end_of_file
Definition: This is a poor mans End Of File test (EOF). The reader builds a node stack to keep track of where it is in the xml parse and when it runs out of nodes it means you are back at the top of the stack.
Accepts: Nothing
Returns: a count of the nodes in the node stack (header nodes are processed early on and are read and removed as part of startup)
initial_node_build( $node_name, $attribute_list_ref )
Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node using this method and store it to the node stack using add_node_to_stack . This method will build the essentials for adding to the node stack. Please not that it will not necessarily get the node level right. If you need that to be correct then don't turn off the stack trace. It will not build raw_text nodes correctly.
Accepts: $node_name = a string without spaces for the name of the node, $attribute_list_ref = This is basically everything else in the xml tag except the name split on /\s+/. Any self closing '/' should be removed prior to the split.
Returns: a node ref that can be added to the node stack to kickstart stack tracing
add_node_to_stack( $node_ref )
Definition: Generally this is an internal method and should not be used. However, in order to provide a faster forward ability the node stack trace(ing) can be turned off. When you want to turn it back on you have to manually build the top node and store it to the node stack using this method. Adding a node after the stack trace has been turned off will create a discontinuity where the new node is added. Stack trace operations above this node will generally fail and stop the script.
Accepts: $node_ref = a top to push on the node stack for traceability
Returns: nothing
SUPPORT
TODO
1. Nothing currently
AUTHOR
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2016 by Jed Lund
DEPENDENCIES
Spreadsheet::Reader::ExcelXML - the package
SEE ALSO
Spreadsheet::Read - generic Spreadsheet reader
Spreadsheet::ParseExcel - Excel binary version 2003 and earlier (.xls files)
Spreadsheet::XLSX - Excel version 2007 and later
Spreadsheet::ParseXLSX - Excel version 2007 and later
All lines in this package that use Log::Shiras are commented out