NAME
Spreadsheet::ReadGnumeric - read a Gnumeric file, return Spreadsheet::Read
VERSION
Version 0.3
SYNOPSIS
use Spreadsheet::ReadGnumeric;
my $reader = Spreadsheet::ReadGnumeric->new
(rc => 1, cells => 1); # these are the defaults
my $book = $reader->parse('spreadsheet.gnumeric');
my $n_sheets = $book->[0]{sheets};
my $sheet_1 = $book->[1];
my $b12 = $sheet_1->{B12};
# or equivalently:
my $b12 = $sheet_1->{cell}[2][12];
say $b12; # => "not a vitamin", e.g.
DESCRIPTION
Given a source of saved Gnumeric data, Spreadsheet::ReadGnumeric
parses it and returns the result in the same format as Spreadsheet::Read
. In fact, Spreadsheet::Read
has a few more bells and whistles, such as transposing the resulting sheets and providing an object-oriented interface to the resulting data, so it may be easier to access this module through Spreadsheet::Read
, even if you only want to parse Gnumeric data.
Note that Gnumeric only saves raw cell values, and not their formatted versions. In particular, Gnumeric saves expressions, and not expression values, never mind formatted expression values. Even the expressions are sometimes encoded in a way that Spreadsheet::ReadGnumeric
cannot not completely unparse, so the returned data structure will look less well populated than the spreadsheet does in Gnumeric. See the "ExprID" TODO item below.
METHODS
After creating an instance with "new", call "parse" with a file name, a stream, or a string containing the spreadsheet data. The "attr", "cells", "convert_colors" "gzipped_p", "merge", "minimal_attributes", and "rc" slots control how Spreadsheet::ReadGnumeric
parses its input. They may also be used as initializer keywords to the "new" method.
The rest of these deal with the XML parsing mechanism and are probably not of interest (unless you need to write a subclass in order to extract more information).
attr
Slot that, if true, directs the parsing methods to retain style information in the returned $sheet->{attr}
array of each sheet. If the value of this slot is "keep", an arrayref of each sheet's Spreadsheet::Gnumeric::StyleRegion
instances is saved in $sheet->{style_regions}
.
The attr
value is normally supplied to the new
method as an initializer. See the Spreadsheet::Read
Cell Attributes documentation for details of the resulting attributes. See Spreadsheet::Gnumeric::StyleRegion for details of how attribute conversion is performed.
See also the "merge" slot, which requests additional processing of cell merge information.
cells
Slot that contains the value of the Spreadsheet::Read
"cells" option. When true, this populates the $sheet->{$cell_name}
with the corresponding cell data, where $cell_name
is an alphanumeric cell name such as "B17" or "AA3". Defaults to a true value.
chars
Slot that contains the text content of an XML element being parsed.
convert_colors
Slot that contains a boolean which, if defined and true, causes Gnumeric colors to be converted to the more common HTML-style format. Gnumeric specifies colors as RGB triples using hex numbers of one to four digits for each primary color, separated by colons, so that white is "FFFF:FFFF:FFFF", and cyan is "0:FC00:F800". Conversion pads each primary to four digits, takes the first two, and prepends "#", producing "#FFFFFF" for white and "#00FCF8" for cyan.
For Spreadsheet::Read
compatibility, the default is true. This value is ignored unless "attr" is true.
current_attrs
Slot that contains the attribute of an element being parsed.
current_elt
Slot that contains the name of an element being parsed, including any namespace prefix (see namespaces
).
element_stack
Slot that contains an arrayref that is a stack of containing element contexts. The last entry in this is the parent of the current element being parsed, and is itself an arrayref of the values of
[ current_elt, current_attrs, chars, namespaces ]
while that containing element was being parsed.
find_style_for_cell
Given zero-based column and row indices, look in "style_regions" for a suitable Spreadsheet::Gnumeric::StyleRegion
instance, and return its attribute hash.
gzipped_p
Slot that determines whether parse
will test whether its argument is gzipped. There are three possibilities: Undefined, other false, and true. If undefined, then parse
will open the file, make the test, and then reset the stream to the beginning (see "stream_gzipped_p"). Otherwise, parse
will take the caller's word for it.
merge
Slot that requests additional processing of cell merging information. If true, (a) the value of the merged
attribute for merged cells is the name of the upper left corner of the rectangle of merged cells, and (b) the contents and attributes of that upper-left cell are copied into all of the others. The default is false.
This flag is ignored unless the "attr" slot is also true.
minimal_attributes
Slot that, if true, causes attributes to be filled in only for cells that have content, i.e. actually appear as a <Cell>
in the file with a nonempty string string. Otherwise, attributes are filled in for the full $sheet->{maxrow}
vs $sheet->{maxcol}
array. The default is false.
This flag is ignored unless the "attr" slot is also true.
namespaces
Slot that contains a hash of namespace prefix to defining URL, for all prefixes in scope. These are never modified by side effect, so that they can be copied into inner scopes. We use this to decide whether any given name is really in the schema we care about.
new
Creates a new Spreadsheet::ReadGnumeric
instance. Keywords are any of the method names listed as slots.
parse
Given an input source, reads it and returns the Spreadsheet::Read
-compatible data structure, constructed according to the options set up at instantiation time. If the "gzipped_p" slot is defined and true or the content starts with gzip magic (see "stream_gzipped_p"), which is the normal state of saved Gnumeric spreadsheets, it is uncompressed while being read. If gzipped_p
is not defined, the stream is read to see how it starts and then rewound.
The input can be a reference (which is taken to be an open stream), a file name, or a literal string that contains the data. If given an non-seekable stream, the stream_gzipped_p
slot must be defined in order to skip the test for compressed input.
rc
Slot that contains the value of the Spreadsheet::Read
"rc" option. When true, this populates the $sheet->{cell}[$col][$row]
arrays. Defaults to a true value.
sheet
Slot that contains the sheet currently under construction.
sheets
Read-only slot that contains the resulting Spreadsheet::Read
-compatible data structure. This slot is initialized when its value is first requested; it is expected that the value is then updated by side-effect.
stream_gzipped_p
Given a stream open to the start of the data and an optional file name, determine whether we need to uncompress the data and then reset the stream back to the beginning. If the gzipped_p
slot is defined, then we just return that and leave the stream untouched. The file name is only used for error messages.
style_attributes
Internal slot used to store a hashref of spreadsheet attributes during parsing.
style_regions
Slot that contains an arrayref of Spreadsheet::Gnumeric::StyleRegion
instances, built by the style parsing code for each sheet and cleared when the sheet is completed.
INTERNALS
Spreadsheet::ReadGnumeric
uses XML::Parser::Lite
to decode the XML into its component elements, calling the _process_Foo_elt
method at each </Foo>
closing tag (explicit or implicit), where "Foo" is the case-sensitive element name. If no such method exists, the element is ignored, though its content may already have been processed. For instance, we don't care about <Sheets>
because it exists only as a container for <Sheet>
elements, which we do process.
Elements with XMLNS prefixes (using the ":" separator) must be associated with the Gnumeric schema in order to be visible to this parsing mechanism. Elements without prefixes are assumed to be part of the Gnumeric schema (but the only Gnumeric files I've seen so far use prefixes on all elements). Note that element names stored in current_elt
and element_stack
have already been stripped of any Gnumeric-specific prefixes.
When the _process_Foo_elt
method is called, it can examine the "current_attrs" hash for its attributes and the "chars" string for its character content; all element content will already have been processed. If necessary, it can examine the "element_stack" slot to check context. For example, _process_Name_elt
needs to do this because it only processes Sheet element names. The element_stack
also makes the attributes of enclosing elements available.
Consequently, extending Spreadsheet::ReadGnumeric
to extract more information out of Gnumeric files should be a simple matter of defining a subclass with suitable _process_*_elt
methods that examine their context, extract their content, and stuff the result into the sheet
and/or sheets
slot values accordingly. At least that's my theory, and I'm sticking to it.
TODO
Note that the "clip", "strip", and "pivot" options of Spreadsheet::Read
are handled by its _clipsheets
sub, so I do not intend to do anything about them here.
These are not in any necessary order of importance.
Handle "ExprID". These are Cell attributes, e.g.:
<gnm:Cell Row="10" Col="7" ExprID="1"/>
Looking at the test data, they seem to indicate that the missing content is the expression in the earlier cell with the same ID after shifting by the location difference. This requires parsing equations, though.
BUGS
Please report any bugs or feature requests to bug-spreadsheet-readgnumeric at rt.cpan.org
, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Spreadsheet-ReadGnumeric.
SEE ALSO
- The Gnumeric Spreadsheet
- Spreadsheet::Read
-
Frontend to multiple spreadsheet formats, with additional options for manipulating the result, plus the option to give them object capabilities. This module also describes the output data structure in depth.
- The Gnumeric File Format
-
This is a PDF file by David Gilbert dated 2001-11-05, available at https://www.jfree.org/jworkbook/download/gnumeric-xml.pdf (GFDL). I am not aware of any more recent version.
AUTHOR
Bob Rogers <rogers at rgrjr.com>
, based heavily on the Spreadsheet::Read
architecture.
LICENSE AND COPYRIGHT
This software is Copyright (c) 2023 by Bob Rogers.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)