NAME
Data::AnyXfer::Elastic::Import::DataFile - Prepare and record data for bulk import into Elasticsearch clusters
SYNOPSIS
# Writing...
my $datafile =
Data::AnyXfer::Elastic::Import::DataFile->new(
file => $file, # optional
index => 'Interiors::IndexInfo',
);
while ( my $row = $rs->next ) {
$datafile->add_document( $row );
}
my $pathclass_file = $datafile->write;
# Reading...
my $datafile =
Core::Elasticsearch::Import::DataFile->read(
file => $pathclass_file );
# can then get the index info back out
my $index_info = $datafile->index_info;
# someone needing to stream the data out of the data file
# would use the ->fetch_data interface
$datafile->fetch_data(\&import_data);
DESCRIPTION
This module allows us to record a dataset for import into Elasticsearch, which can then be 'played' into an Elasticsearch cluster via Data::AnyXfer::Elastic::Importer.
This allows us to ensure that the same data is imported to multiple environments, and to later replay imports or load them into other environments.
ATTRIBUTES
- file
-
Optional. The datafile file destination. This must be a string, or a an instance of Path::Class::File.
If not supplied, but "dir" IS supplied, this instance will write to a file within dir, under a name matching the following pattern:
import.<NAME>.<TIMESTAMP>-<HOSTNAME>.datafile'
Where name is
INDEX || ALIAS || 'default'
.compress option also changes destination path.
- dir
-
Optional. The destination directory for the datafile. If not supplied, we will switch to an underlying storage backend of Data::AnyXfer::Elastic::Import::Storage::TempDirectory.
Meaning no data will be persisted. The datafile instance can still be passed around within the current process (or until this instance goes out of scope).
- storage
-
Optional. Manually override the storage backend used to persist the dataset.
Should be an object implementing Data::AnyXfer::Elastic::Import::Storage.
- index_info
-
Optional. A
ClassName
or object instance implementing Data::AnyXfer::Elastic::Role::IndexInfo.This will be recorded along with the data. This is simply a convenience method versus specifically setting information using "GETTERS AND SETTERS".
- connect_hint
-
A connection hint for use with L<Data::AnyXfer::Elastic>. Currently supports C<undef> (unspecified), C<readonly>, or C<readwrite>.
- part_size
-
Optional. This is used for the datafile body containing the data to be imported. It will determine the number of data elements to store within a single storage entry, and will be the maximum number of data structures held in memory at any one time.
You will need to reduce this number when storing large nested data structures, and increase it as data structure size or memory limits increase.
Defaults to:
1000
- data_buffer_size
-
Optional. This is used to determine how many documents must be added to the datafile before it will contact the underlying storage entry. This is closely related to the "part_size", as every time an item is added to an entry, all of the data held on that part in memory must be re-serialised and persisted.
You will need to increase this number when ingesting large numbers of documents.
The fastest and most efficient value for this will be the same as your maximum "part_size".
Defaults to: 25% of "part_size"
- timestamp
-
Optional. This is the creation timestamp, and may be used in the resulting datafile name.
- compress
-
Optional boolean. Defaults to 0. Turns LZMA compression on for saving datafile. Appends '.lzma' to the filename. If file is provdied this will be moved to compress version.
GETTERS AND SETTERS
Please see Data::AnyXfer::Elastic::Role::IndexInfo for the interface definition and information.
author_comment
$datafile->author_comment(
'Imported from database "mystuff" on `mysql-db-1.myfqdn.net` @ 2015-10-04T12:56:21');
Use this to store useful information about where this data came from.
STATISTICS AND CONSISTENCY
get_document_count
$datafile->get_document_count;
Try to find the document count for the complete datafile. Fast on non-legacy datafiles as it will be pre-calculated from when it was authored.
READING AND WRITING
read
Synonym for new
fetch_data
$datafile->fetch_data(\&import_data);
sub import_data {
my @data = @_;
print join("\n", @data);
}
Retrieves the import data in batches, and passes them to the supplied callback for processing until we're exhausted.
add_document
$datafile->add_document( { some => 'data' } );
Add another elasticsearch document to the datafile for import.
write
my $file = $datafile->write;
Packages and writes the data out to the destination datafile.
UTILITY METHODS
export_index_info
my $index_info = $datafile->export_index_info;
Convenience method which creates an ad-hoc Data::AnyXfer::Elastic::IndexInfo instance representing the datafile target info (if this datafile were to be played by an importer as-is).
COPYRIGHT
This software is copyright (c) 2019, Anthony Lucas.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.