NAME

Datahub::Factory::Importer::MSK - Import data from the MSK OAI-PMH endpoint

SYNOPSIS

use Datahub::Factory::Importer::MSK;
use Data::Dumper qw(Dumper);

my $oai = Datahub::Factory::Importer::MSK->new(
    url                    => 'https://endpoint.msk.be/oai',
    metadataPrefix         => 'oai_lido',
    set                    => '2011',
    pid_module             => 'rcf',
    pid_username           => 'datahub',
    pid_password           => 'datahub',
    pid_rcf_container_name => 'datahub',
);

$oai->importer->each(sub {
    my $item = shift;
    print Dumper($item);
});

DESCRIPTION

Datahub::Factory::Importer::MSK imports data from the MSK OAI-PMH endpoint. By default it uses the ListRecords verb to return all records using the oai_lido format. It is possible to only return records from a single Set or those created, modified or deleted between two dates (from and until).

It automatically deals with resumptionTokens, so client code does not have to implement paging.

To support PIDs, it uses Rackspace Cloud Files to fetch PID CSV's and convert them to temporary sqlite tables.

Provide pid_username, pid_password and pid_rcf_container_name.

PARAMETERS

The endpoint parameter and some PID module parameters are required.

To link PIDs (Persistent Identifiers) to MSK records, it is necessary to use the PID module to fetch a CSV from either a Rackspace Cloud Files (protected by username and password) instance or a public web site. Depending on whether you choose Rackspace or a Web site, different options must be set. If an option is not applicable for your selected module, you can skip the parameter or set it to undef.

The CSV files are converted to sqlite tables inside /tmp and can be used in your fixes. See msk.fix for an example.

endpoint

URL of the OAI endpoint.

handler( sub {} | $object | 'NAME' | '+NAME' )

Handler to transform each record from XML DOM (XML::LibXML::Element) into Perl hash.

Handlers can be provided as function reference, an instance of a Perl package that implements 'parse', or by a package NAME. Package names should be prepended by + or prefixed with Catmandu::Importer::OAI::Parser. E.g foobar will create a Catmandu::Importer::OAI::Parser::foobar instance. By default the handler Catmandu::Importer::OAI::Parser::oai_dc is used for metadataPrefix oai_dc, Catmandu::Importer::OAI::Parser::marcxml for marcxml, Catmandu::Importer::OAI::Parser::mods for mods, Catmandu::Importer::OAI::Parser::Lido for Lido and Catmandu::Importer::OAI::Parser::struct for other formats. In addition there is Catmandu::Importer::OAI::Parser::raw to return the XML as it is.

metadata_prefix

Any metadata prefix the endpoint supports. Defaults to oai_lido.

set

Optionally, a set to get records from.

from

Optionally, a must_be_older_than date.

until

Optionally, a must_be_younger_than date.

username
password

PID options

pid_module

Choose the PID module you want to use. Set to rcf to use Rackspace Cloud Files, or to lwp to use a public web site.

pid_username

Provide your Rackspace Cloud Files username. If you selected lwp, provide an optional username (for HTTP Basic Authentication).

pid_password

Provide your Rackspace Cloud Files api key. For lwp, an optional password.

pid_rcf_container_name

Provide the container name that holds the PID CSV's for rcf.

pid_lwp_realm

For lwp, provide (optionally) the HTTP Basic Authentication Realm.

pid_lwp_base_url

For lwp, provide the URL where the CSV's are stored. This URL is used in addition to the name of the CSV file to create the URL where the file can be fetched from (i.e my $url = $pid_lwp_base_url + $csv_file_name).

ATTRIBUTES

importer

A Importer that can be used in your script.

AUTHOR

Pieter De Praetere <pieter at packed.be >

COPYRIGHT

Copyright 2017- PACKED vzw

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Datahub::Factory Catmandu