NAME
ODG::Record - Perl extension for manipulating row based records.
SYNOPSIS
use ODG::Record;
my $record = ODG::Record->new(
{
_data_ => [1..25] ,
_layout_ => ODG::Layout->new( .. )
}
);
# Data can then be accessed with the _data_ l-value accessor
$record->_data_ = [ 26..50 ] ;
DESCRIPTION
ODG::Record is an extensible class for efficiently dealing with row based records. Data and layout information are separate concerns existing in seperate slots (_data_ and _layout_ ) within the ODG::Record object.
Since the _layout_ (i.e. metadata) does not change between records, separating the _data_ from _layout_ allows for greater efficient when processing row based records via object recycling. Rather than creating a new object for each record, the new data is placed in the _data_ slot of an existing ODG::Record object. Since data is stored as an ArrayRef, this is a huge performance win.
ODG::Records require an ODG::Layout object ( a _data_ ArrayRef is optional ) for instantiation. During object construction, name-based accessors are built for each recod field. By default, the accessors permit lvalue assignment.
Object Model
ODG::Record
|
|- slot: _data_
|
|- slot: _layout_ (container for ODG::Metadata objects )
| |
| |- slot: _metadata_
|
|
|- slot: _metadata_ ( ref to ODG::Record::_layout_::_metadata_
DISCUSSION
Object Recycling
This module is designed for efficient streaming, i.e. accessing only one record at a time. By repopulating the data slot, i.e. recycling the record object, we do not incur the expensive of object instantiation for each record. A huge win.
The downside is that the checking / validation in new object creation might be lost. This may or not be acceptable depending on the situation. Generally, when the records are well-defined and well-maintained, i.e. from a database, this is not an issue. The data from one record to the next is fairly consistent.
Encapsulation may also be lost. Again, whether this is acceptable depends on the situation. There are several methods of object recycling ( as opposed to creating a new object). They are:
* using a standard accessor, * using an lvalue accessor (not officially supported by Moose -and- may break encapsulation. ( Since this is Moose, other Moose techniques can be used for validation, e.g. after method) * direct access (breaks encapsulation).
_data_ assignment performance
Here is a comparison of methods for placing new data in the record object on an Intel(R) Xeon(TM) CPU 3.06GHz processor:
Data has 5 elements: 1..5
Rate new object moose accessor lvalue moose direct access
new object 1268/s -- -99% -100% -100%
moose accessor 222222/s 17428% -- -56% -78%
lvalue moose 500000/s 39337% 125% -- -50%
direct access 1000000/s 78775% 350% 100% --
Data has 25 elements: 1..25
Rate new object moose accessor lvalue moose direct access
new object 1243/s -- -99% -100% -100%
moose accessor 166667/s 13308% -- -37% -54%
lvalue moose 266667/s 21353% 60% -- -27%
direct access 363636/s 29155% 118% 36% --
new object : ODG::Record->new( { _data_ => qw( [ 1..5] ) } );
moose accessor : $record->data( [ 1..5 ] )
lvalue moose : $record->data = [ 1..5 ]
direct access : $record->_data_ = [ 1.5 ]
In real situations, there is probably not much of a difference between the last three techniques, other bottleneck are likely to occur in the the code such as I/O ability to return > 100k/s.
Note to self: What is the comparison to fully encapsulate inside-out objects. Since inside out objects uses references, we would expect them to behave similar to the moose accessors.
RecordSet
This idea can be extended to a RecordSet where multiple records are placed in the data slot. This may be advantageous when batching is more appropriate. Some reasons for this might be related to:
* I/O constraints, especially latency provide for more
efficient batch processing
* Processing requires batch methods, look-backs e.g.
* Processing benefits from batch methods. i.e. records
are sorted in order and one event needs to be triggered
for all records of one type.
An example of a RecordSet is demonstrated at:
http://code2.0beta.co.uk/moose/svn/Moose/trunk/t/200_examples/008_record_set_iterator.t
This demonstration is not efficient since it seems that each record object requires instantiation (costly).
Mixing of attributes, data and metadata
It seems really bad design to store object attributes, data and metadata in the same construct, ie all as attributes. A conflict arises when the field names of data conflict with the field names of the metadata. These should be seperated.
The interface should be designed such that the user has access principally to the data, via object->field syntax. And it seems sloppy to save the metadata with a alternate naming convention. Some alternative might be:
A special slot for attributes and metadata. These can be called _data_ and _layout_, e.g.
$record->field_1
sub field1 :lvalue {
$_[0]->{_data_}->[ $_[0]->{_layout_}->{field1}->{pos} ]
Use $record->meta already exists, so this seems like a bad place to store information.
Absent a better methodology we stick to _name_.
METHODS
new
Object constructor. Creates and returns a ODG::Record object. It takes the following options.
_layout_ A ODG::Layout object containing the metadata for record. By convention, the first position has an index of 0.
_data_
L-value object accessor to the record data. Data is stroed internally as an array reference, so data This is the very fast accessor for the _data_,
# Getter
$record->_data_ # Retrieve entire array ref
$record->_data_->[ $index ] # Get a specific field
# Setter
$record->_data_( [ .. ] )
$record->_data_ = [ .. ]
$record->_data_->[ $index ] = $value
_layout_
READ-ONLY accessor to the layout object.
_metadata_
Convenience method for accessing the _layout_->_metadata_ object.
EXPORT
None by default.
TODO
Can object methods be installed to work with the fields such as CREDIT_CARD_NUMBER encrypt? Should objects or attributes be created for each of the fields? How can this be done in a way as to not sacrifice performance. Can we recycle each of those objects, too.
$record->CREDIT_CARD_NUMBER->encrypt;
What about:
$record->encrypt_CREDIT_CARD_NUMBER?
Indexed based access. Allow for $record->_1_, i.e. access to record by _data_ slot postion.
RecordIterator class. Subclass that itererates over a record set. This will likely be ODG::ETL::Extractor ( ODG::ETL::E, for short )
MooseX::AttirbuteHelpers::Collection::Array for the _data_ slot (?) providing list based methods
Some checking when _data_ is set or changed. Minimally that _data_ has the same number of elements as _layout_->_metadata_.
SEE ALSO
ODG::Metadata, Moose
THANKS
Steven Little, author of Moose
Paul Driver for suggesting to place the accessor methods in the instance rather than the class.
Members of moose@perl.org.
AUTHOR
Christopher Brown, <http://www.opendatagroup,com>
COPYRIGHT AND LICENSE
Copyright (C) 2008 by Open Data
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 327:
'=item' outside of any '=over'
- Around line 361:
You forgot a '=back' before '=head1'