PackRow2
PackRow2 takes list of items and packs them (non-destructively) into a string of <= maxsize bytes. If offset is not specified, it builds the string starting with the last item in the list, prepending it with each preceding item until it runs out of space or the list is fully consumed. If the packer runs out of space, it returns the offset into the list where it stopped. The offset may be supplied as an argument to this function, and the packer will pack the remainder of the list starting at the offset, working back to the beginning of the list. The final argument to the packer is a "next pointer", a string that identifies the location of the next part of a row split into multiple pieces. Since the packer processes a list from back to front, the address of the "next" piece can be obtained before constructing the preceding piece. If the packer can process a complete list, it returns an array containing a single packed string, a byte string consisting of a count of the number of packed items, followed by length/value pairs for each item. If the packer runs out of space, it returns an array of the packed string and the offset of the remaining items
For example, given the list @a = qw(alpha bravo charlie delta), and a maxsize=15, PackRow2 returns a packed string (something like x01x05delta) and the offset 3, indicating that the last item in the list was processed, and the packer ran out of space at the third item. The packed string could be stored in a pushhash, which would return an index, e.g. "5/2", suitable for a next pointer. Packing the remainder of the string generates another packed string (e.g. x02x07charliex035/2) and the offset 2. The packing and storage process continues until the entire list is consumed.
advanced topics
- null vector
-
The packed string always contains a bitstring to identify null columns, which is used by UnPackRow to correctly distinguish between nulls and zero length strings.
- next pointer
-
Since the next pointer is used to find the next part of a split row, it must always remain whole -- if it was split, how could you find the next piece? The next pointer is a convention supported by PackRow/UnPackRow to facilitate the construction of methods that manipulate split rows. The packing function only flattens an array into a byte string or series of strings; it does not provide any intrinsic support to traverse these strings. Functions that manipulate packed rows may use additional structures to support multi-part rows, such as external metadata in the block row directory, or specialized metadata columns embedded in the row itself.
- column splitting (fragmentation)
-
The packer can support rows with individual columns that exceed the maxsize. The offset can simultaneously maintain the current column position, as well as the current character offset in that column. It's wicked complicated. Generally, we say that a row is split into row pieces, and the row pieces are chained (via the next pointers), which lets us reconstruct a complete row. Individual columns that are split are said to be fragmented.
future work
The packer could be extended to support more complex structures than arrays of scalars. In lieu of this ability, these structures can be flattened using Data::Dumper or YAML to large strings.
NAME
Genezzo::Util - Utility functions
TODO
- FileGetHeaderInfo: need to handle case of header which exceeds a single block. Probably should keep increasing the buffer size until find null terminator (within reason).
- packrow: store metadata in col0 vs trailing col with next ptr
- packrow: check pack format for a zero len row of zero cols. Does it need a nullvec?
- unpackrow: extend to support a prebuilt template when unpacking many rows with the same number of columns. Could probably store in an array. if (defined($a[$numcols])...
- packrow/unpackrow: in Perl 5.8 could use the nifty repeating templates to our advantage.
- packrow: could generate skiplists as col zero metadata tracking byte position and column numbers to speed lookups
AUTHOR
Jeffrey I. Cohen, jcohen@genezzo.com
SEE ALSO
Copyright (c) 2003, 2004 Jeffrey I Cohen. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
Address bug reports and comments to: jcohen@genezzo.com
For more information, please visit the Genezzo homepage at http://www.genezzo.com
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 892:
You forgot a '=back' before '=head2'
- Around line 898:
=back without =over