The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

ODG::Record - Perl extension for manipulating row based records.

SYNOPSIS

  use ODG::Record;
  my $record = ODG::Record->new( 
    { 
        _data_      => [1..25] ,
        _layout_  => ODG::Layout->new( .. )
    } 
  );

  # Data can then be accessed with the _data_ l-value accessor
  $record->_data_ = [ 26..50 ] ;   

DESCRIPTION

ODG::Record is an extensible class for efficiently dealing with row based records. Data and layout information are separate concerns existing in seperate slots (_data_ and _layout_ ) within the ODG::Record object.

Since the _layout_ (i.e. metadata) does not change between records, separating the _data_ from _layout_ allows for greater efficient when processing row based records via object recycling. Rather than creating a new object for each record, the new data is placed in the _data_ slot of an existing ODG::Record object. Since data is stored as an ArrayRef, this is a huge performance win.

ODG::Records require an ODG::Layout object ( a _data_ ArrayRef is optional ) for instantiation. During object construction, name-based accessors are built for each recod field. By default, the accessors permit lvalue assignment.

Object Model

  ODG::Record
    |
    |- slot: _data_
    |
    |- slot: _layout_ (container for ODG::Metadata objects )
    |    |
    |    |- slot: _metadata_
    |              
    |               
    |- slot: _metadata_ ( ref to ODG::Record::_layout_::_metadata_

DISCUSSION

Object Recycling

This module is designed for efficient streaming, i.e. accessing only one record at a time. By repopulating the data slot, i.e. recycling the record object, we do not incur the expensive of object instantiation for each record. A huge win.

The downside is that the checking / validation in new object creation might be lost. This may or not be acceptable depending on the situation. Generally, when the records are well-defined and well-maintained, i.e. from a database, this is not an issue. The data from one record to the next is fairly consistent.

Encapsulation may also be lost. Again, whether this is acceptable depends on the situation. There are several methods of object recycling ( as opposed to creating a new object). They are:

* using a standard accessor, * using an lvalue accessor (not officially supported by Moose -and- may break encapsulation. ( Since this is Moose, other Moose techniques can be used for validation, e.g. after method) * direct access (breaks encapsulation).

_data_ assignment performance

Here is a comparison of methods for placing new data in the record object on an Intel(R) Xeon(TM) CPU 3.06GHz processor:

  Data has 5 elements: 1..5
                      Rate    new object moose accessor lvalue moose direct access
  new object        1268/s            --           -99%        -100%         -100%
  moose accessor  222222/s        17428%             --         -56%          -78%
  lvalue moose    500000/s        39337%           125%           --          -50%
  direct access  1000000/s        78775%           350%         100%            --


  Data has 25 elements: 1..25
                     Rate    new object moose accessor  lvalue moose direct access
  new object       1243/s            --           -99%         -100%         -100%
  moose accessor 166667/s        13308%             --          -37%          -54%
  lvalue moose   266667/s        21353%            60%            --          -27%
  direct access  363636/s        29155%           118%           36%            --
  
  
  new object      : ODG::Record->new( { _data_ => qw( [ 1..5] ) } );
  moose accessor  : $record->data( [ 1..5 ] ) 
  lvalue moose    : $record->data = [ 1..5 ] 
  direct access   : $record->_data_ = [ 1.5 ]
 

In real situations, there is probably not much of a difference between the last three techniques, other bottleneck are likely to occur in the the code such as I/O ability to return > 100k/s.

Note to self: What is the comparison to fully encapsulate inside-out objects. Since inside out objects uses references, we would expect them to behave similar to the moose accessors.

RecordSet

This idea can be extended to a RecordSet where multiple records are placed in the data slot. This may be advantageous when batching is more appropriate. Some reasons for this might be related to:

    * I/O constraints, especially latency provide for more 
      efficient batch processing 

    * Processing requires batch methods, look-backs e.g.  

    * Processing benefits from batch methods. i.e. records
      are sorted in order and one event needs to be triggered
      for all records of one type.  
 

An example of a RecordSet is demonstrated at:

    http://code2.0beta.co.uk/moose/svn/Moose/trunk/t/200_examples/008_record_set_iterator.t

This demonstration is not efficient since it seems that each record object requires instantiation (costly).

Mixing of attributes, data and metadata

It seems really bad design to store object attributes, data and metadata in the same construct, ie all as attributes. A conflict arises when the field names of data conflict with the field names of the metadata. These should be seperated.

The interface should be designed such that the user has access principally to the data, via object->field syntax. And it seems sloppy to save the metadata with a alternate naming convention. Some alternative might be:

A special slot for attributes and metadata. These can be called _data_ and _layout_, e.g.

    $record->field_1

    sub field1 :lvalue {

            $_[0]->{_data_}->[ $_[0]->{_layout_}->{field1}->{pos} ]

Use $record->meta already exists, so this seems like a bad place to store information.

Absent a better methodology we stick to _name_.

METHODS

new

Object constructor. Creates and returns a ODG::Record object. It takes the following options.

    _layout_ A ODG::Layout object containing the metadata for record. By convention, the first position has an index of 0.

_data_

L-value object accessor to the record data. Data is stroed internally as an array reference, so data This is the very fast accessor for the _data_,

  #  Getter
    $record->_data_               # Retrieve entire array ref
    $record->_data_->[ $index ]   # Get a specific field
  
  # Setter
    $record->_data_( [ .. ] )
    $record->_data_ = [ .. ]

    $record->_data_->[ $index ] = $value    

_layout_

READ-ONLY accessor to the layout object.

_metadata_

Convenience method for accessing the _layout_->_metadata_ object.

EXPORT

None by default.

TODO

  • Can object methods be installed to work with the fields such as CREDIT_CARD_NUMBER encrypt? Should objects or attributes be created for each of the fields? How can this be done in a way as to not sacrifice performance. Can we recycle each of those objects, too.

      $record->CREDIT_CARD_NUMBER->encrypt;

    What about:

      $record->encrypt_CREDIT_CARD_NUMBER? 
  • Indexed based access. Allow for $record->_1_, i.e. access to record by _data_ slot postion.

  • RecordIterator class. Subclass that itererates over a record set. This will likely be ODG::ETL::Extractor ( ODG::ETL::E, for short )

  • MooseX::AttirbuteHelpers::Collection::Array for the _data_ slot (?) providing list based methods

  • Some checking when _data_ is set or changed. Minimally that _data_ has the same number of elements as _layout_->_metadata_.

SEE ALSO

ODG::Metadata, Moose

THANKS

Steven Little, author of Moose

Paul Driver for suggesting to place the accessor methods in the instance rather than the class.

Members of moose@perl.org.

AUTHOR

Christopher Brown, <http://www.opendatagroup,com>

COPYRIGHT AND LICENSE

Copyright (C) 2008 by Open Data

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 327:

'=item' outside of any '=over'

Around line 361:

You forgot a '=back' before '=head1'