NAME

Data::AnyXfer::Elastic::Import::DataFile - Prepare and record data for bulk import into Elasticsearch clusters

SYNOPSIS

# Writing...

my $datafile =
    Data::AnyXfer::Elastic::Import::DataFile->new(
    file => $file, # optional
    index => 'Interiors::IndexInfo',
);

while ( my $row = $rs->next ) {
    $datafile->add_document( $row );
}

my $pathclass_file = $datafile->write;


# Reading...

my $datafile =
    Core::Elasticsearch::Import::DataFile->read(
    file => $pathclass_file );

# can then get the index info back out
my $index_info = $datafile->index_info;

# someone needing to stream the data out of the data file
# would use the ->fetch_data interface
$datafile->fetch_data(\&import_data);

DESCRIPTION

This module allows us to record a dataset for import into Elasticsearch, which can then be 'played' into an Elasticsearch cluster via Data::AnyXfer::Elastic::Importer.

This allows us to ensure that the same data is imported to multiple environments, and to later replay imports or load them into other environments.

ATTRIBUTES

file

Optional. The datafile file destination. This must be a string, or a an instance of Path::Class::File.

If not supplied, but "dir" IS supplied, this instance will write to a file within dir, under a name matching the following pattern:

import.<NAME>.<TIMESTAMP>-<HOSTNAME>.datafile'

Where name is INDEX || ALIAS || 'default' .

compress option also changes destination path.

dir

Optional. The destination directory for the datafile. If not supplied, we will switch to an underlying storage backend of Data::AnyXfer::Elastic::Import::Storage::TempDirectory.

Meaning no data will be persisted. The datafile instance can still be passed around within the current process (or until this instance goes out of scope).

storage

Optional. Manually override the storage backend used to persist the dataset.

Should be an object implementing Data::AnyXfer::Elastic::Import::Storage.

index_info

Optional. A ClassName or object instance implementing Data::AnyXfer::Elastic::Role::IndexInfo.

This will be recorded along with the data. This is simply a convenience method versus specifically setting information using "GETTERS AND SETTERS".

connect_hint
A connection hint for use with L<Data::AnyXfer::Elastic>.

Currently supports C<undef> (unspecified), C<readonly>, or C<readwrite>.
part_size

Optional. This is used for the datafile body containing the data to be imported. It will determine the number of data elements to store within a single storage entry, and will be the maximum number of data structures held in memory at any one time.

You will need to reduce this number when storing large nested data structures, and increase it as data structure size or memory limits increase.

Defaults to: 1000

data_buffer_size

Optional. This is used to determine how many documents must be added to the datafile before it will contact the underlying storage entry. This is closely related to the "part_size", as every time an item is added to an entry, all of the data held on that part in memory must be re-serialised and persisted.

You will need to increase this number when ingesting large numbers of documents.

The fastest and most efficient value for this will be the same as your maximum "part_size".

Defaults to: 25% of "part_size"

timestamp

Optional. This is the creation timestamp, and may be used in the resulting datafile name.

compress

Optional boolean. Defaults to 0. Turns LZMA compression on for saving datafile. Appends '.lzma' to the filename. If file is provdied this will be moved to compress version.

GETTERS AND SETTERS

Please see Data::AnyXfer::Elastic::Role::IndexInfo for the interface definition and information.

author_comment

$datafile->author_comment(
    'Imported from database "mystuff" on `mysql-db-1.myfqdn.net` @ 2015-10-04T12:56:21');

Use this to store useful information about where this data came from.

STATISTICS AND CONSISTENCY

get_document_count

$datafile->get_document_count;

Try to find the document count for the complete datafile. Fast on non-legacy datafiles as it will be pre-calculated from when it was authored.

READING AND WRITING

read

Synonym for new

fetch_data

$datafile->fetch_data(\&import_data);

sub import_data {

    my @data = @_;

    print join("\n", @data);
}

Retrieves the import data in batches, and passes them to the supplied callback for processing until we're exhausted.

add_document

$datafile->add_document( { some => 'data' } );

Add another elasticsearch document to the datafile for import.

write

my $file = $datafile->write;

Packages and writes the data out to the destination datafile.

UTILITY METHODS

export_index_info

my $index_info = $datafile->export_index_info;

Convenience method which creates an ad-hoc Data::AnyXfer::Elastic::IndexInfo instance representing the datafile target info (if this datafile were to be played by an importer as-is).

COPYRIGHT

This software is copyright (c) 2019, Anthony Lucas.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.