NAME

Datahub::Factory - A CLI tool that transforms and transports data from a data source to a data sink.

SYNOPSIS

# From the command line
$ dhconveyor <command> OPTIONS

# Transport data via a pipeline configuration file, without output
$ dhconveyor transport -p pipeline.ini

# Pretty output
$ dhconveyor transport -p pipeline.ini -v

# Show logging output
# Log levels: 1 - 3
$ dhconveyor transport -p pipeline.ini -L 3

# Only process the first 5 records
$ dhconveyor transport -p pipeline.ini -n 5

# Breaking up the pipeline configuration file in separate files
$ dhconveyor transport -g general.ini -i importer.ini -f fixer.ini -e exporter.ini

# Pushing a JSON file to a search index (Solr)
$ dhconveyor index -p solr.ini

# Pretty output
$ dhconveyor index -p solr.ini -v

# Show logging output
# Log levels: 1 - 3
$ dhconveyor index -p solr.ini -L 3

DESCRIPTION

This package implements a command line ETL (Extract, Transform, Load) toolkit written as a wrapper around Catmandu and its ecosystem of Perl modules.

Features:

Configuration files as ETL pipelines. The Catmandu command takes input via CLI options and arguments that are defined in its modules. Depending on the invocation, commands can end up with a long list of parameters containing potentially sensitive information. Sequestering that information in separate files allows users to approach pipelines as a configuration management concern.
Conditional transforming. Pipelines can define multiple Catmandu Fixes. A check on a context dependent value (i.e. a repository field) allows the toolkit to dynamically apply the correct fix at runtime.
Loose coupling with the Catmandu ecosystem. Wrapping Catmandu modules brings dependency inversion. This makes it easier to swap out Catmandu modules for something else without touching the infrastructure configuration.
Robust processing with an increased fault-tolerance. Invalid records or input will simply be logged, rather then halting the entire process.
Extensibility. Leveraging a modular approach, this toolkit can be expanded by custom modules for specific use cases.

Datahub::Factory fetches data from several sources as specified by the Importer settings, executes a Catmandu::Fix and sends it to a data sink, set via an Exporter. Several importer and exporter modules are provided out of the box, but developers can extend the functionality with their own modules.

Datahub::Factory supports Log4perl.

USE

Command line options

All commands share the following switches:

--log_level --L [int]

Set the log_level. Takes a numeric parameter. Supported levels are: 1 (WARN), 2 (INFO), 3 (DEBUG). WARN (1) is the default.

--log_output

Selects an output for the log messages. By default, it will send them to STDERR (pass STDERR as parameter), but STDOUT (STDOUT) and a log file.

--verbose -v

Set verbosity. Invoking the command with the --verbose, -v flag will render verbose output to the terminal.

--number -n [int]

Set number of records to process. Invoking the transport command with the --number, -n flag will process the first [int] records instead of all records available at the data source.

Available Commands

help COMMAND

Documentation about command line options.

transport OPTIONS

Fetch data from a local or remote source, convert the data to a target format and structure and export the data to a local or remote data sink.

index OPTIONS

Fetch data from a local source, and push it to an enterprise search engine in bulk. Currently only supports Apache Solr (https://lucene.apache.org/solr/)

CONFIGURATION

Pipelines are defined in configuration files which are formatted according to the INI structure as expected by the Config::Simple library. Any pipeline consists of 4 parts: a General block, an Importer block, a Fixer block and an Exporter block.

Examples can be found in https://github.com/thedatahub/Datahub-Factory-Pipelines.

A simple example that pushes OAI data to a YAML output on STDOUT:

[General]
id_path = administrativeMetadata.recordWrap.recordID.0._

[Importer]
plugin = OAI

[plugin_importer_OAI]
endpoint =  https://datahub.vlaamsekunstcollectie.be/oai
handler = +Catmandu::Importer::OAI::Parser::lido
metadata_prefix = oai_lido

[Fixer]
plugin = Fix

[plugin_fixer_Fix]
file_name = '/home/foobar/datahub.fix'

[Exporter]
plugin = YAML

[Exporter_YAML]

Note: The datahub.fix file is required, but can be left empty.

An example defining multiple fix transforms based on a context dependent value:

[General]
id_path = 'administrativeMetadata.recordWrap.recordID.0._'

[Importer]
plugin = OAI

[plugin_importer_OAI]
# endpoint = 'http://collections.britishart.yale.edu/oaicatmuseum/OAIHandler'
endpoint = https://datahub.vlaamsekunstcollectie.be/oai
handler = +Catmandu::Importer::OAI::Parser::lido
metadata_prefix = oai_lido

[Fixer]
plugin = Fix

[plugin_fixer_Fix]
condition_path = '_metadata.administrativeMetadata.0.recordWrap.recordSource.0.legalBodyName.0.appellationValue.0._'
fixers = MSK, GRO

[plugin_fixer_GRO]
condition = 'Musea Brugge - Groeningemuseum'
file_name = '/Users/foobar/groeninge.fix'

[plugin_fixer_MSK]
condition = 'Museum voor Schone Kunsten Gent'
file_name = '/Users/foobar/msk.fix'

[Exporter]
plugin = YAML

[plugin_exporter_YAML]

Note: condition_path contains the Fix path to the node that contains the context-dependent value. The condtion parameter in each fixer contains the value against which the conditional check is performed.

API

Datahub::Factory leverages a plugin-based architecture. This makes extending the toolkit with new functionality fairly trivial.

New commands can be added by creating a new, separate Perl module that contains a `command_name.pm` file in the `lib/Datahub/Factory/Command` path. Datahub::Factory uses the Datahub::Factory::Command namespace and leverages App::Cmd internally.

New Datahub::Factory::Importer, Datahub::Factory::Exporter, Datahub::Factory::Fixer, Datahub::Factory::Indexer plugins can be added in the same way.

AUTHORS

Matthias Vandermaesen matthias.vandermaesen@vlaamsekunstcollectie.be
Pieter De Praetere pieter@packed.be

COPYRIGHT AND LICENSE

This software is copyright (c) 2016, 2019 by PACKED, vzw, Vlaamse Kunstcollectie, vzw.

This is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, Version 3, June 2007.