NAME

Net::Amazon::EMR - API for Amazon's Elastic Map-Reduce service

SYNOPSIS

use Net::Amazon::EMR;

my $emr = Net::Amazon::EMR->new(
  AWSAccessKeyId  => $AWS_ACCESS_KEY_ID,
  SecretAccessKey => $SECRET_ACCESS_KEY,
  ssl             => 1,
  );

# start a job flow
my $id = $emr->run_job_flow(Name => "Example Job",
                            Instances => {
                                Ec2KeyName => 'myKeyId',
                                InstanceCount => 10,
                                KeepJobFlowAliveWhenNoSteps => 1,
                                MasterInstanceType => 'm1.small',
                                Placement => { AvailabilityZone => 'us-east-1a' },
                                SlaveInstanceType => 'm1.small',
                            },
                            BootstrapActions => [{
                              Name => 'Bootstrap-configure',
                              ScriptBootstrapAction => {
                                Path => 's3://elasticmapreduce/bootstrap-actions/configure-hadoop',
                                Args => [ '-m', 'mapred.compress.map.output=true' ],
                              },
                            }],
                            Steps => [{
                              ActionOnFailure => 'TERMINATE_JOB_FLOWS',
                              Name => "Set up debugging",
                              HadoopJarStep => { 
                                         Jar => 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
                                         Args => [ 's3://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch' ],
                                         },
                            }],
                          );

print "Job flow id = " . $id->JobFlowId . "\n";

# Get details of just-launched job
$result = $emr->describe_job_flows(JobFlowIds => [ $id->JobFlowId ]);

# or get details of all jobs created after a given time
$result = $emr->describe_job_flows(CreatedAfter => '2012-12-17T07:19:57Z');

# or use DateTime
$result = $emr->describe_job_flows(CreatedAfter => DateTime->new(year => 2012, month => 12, day => 17));

# See the details of the typed result
use Data::Dumper; print Dumper($result);

# or dispense with types and see the details as a perl hash
use Data::Dumper; print Dumper($result->as_hash);

# Flexible Booleans - 1, 0, undef, 'true', 'false'
$emr->set_visible_to_all_users(JobFlowIds => $id, VisibleToAllUsers => 1);
$emr->set_termination_protection(JobFlowIds => [ $id->JobFlowId ], TerminationProtected => 'false');

# Add map-reduce steps and execute
$emr->add_job_flow_steps(JobFlowId => $job_id,
                         Steps => [{
          ActionOnFailure => 'CANCEL_AND_WAIT',
          Name => "Example",
          HadoopJarStep => { 
            Jar => '/home/hadoop/contrib/streaming/hadoop-streaming.jar',
            Args => [ '-input', 's3://my-bucket/my-input',
                      '-output', 's3://my-bucket/my-output',
                      '-mapper', '/path/to/mapper-script',
                      '-reducer', '/path/to/reducer-script',
                    ],
            Properties => [ { Key => 'reduce_tasks_speculative_execution', Value => 'false' } ],
            },
      }, ... ]);

DESCRIPTION

This is an implementation of the Amazon Elastic Map-Reduce API.

CONSTRUCTOR

new(%options)

This is the constructor. Options are as follows:

  • AWSAccessKeyId (required)

    Your AWS access key.

  • SecretAccessKey (required)

    Your secret key.

  • base_url (optional)

    The base URL for your chosen Amazon region; see http://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region. If not specified, the default URL is used (which implies region us-east-1).

    my $emr = Net::Amazon::EMR->new(
        AWSAccessKeyId  => $AWS_ACCESS_KEY_ID,
        SecretAccessKey => $SECRET_ACCESS_KEY,
        base_url => 'https://elasticmapreduce.us-west-2.amazonaws.com',
    );
  • ssl (optional)

    If set to a true value, the default base_url will use https:// instead of http://. Defaults to true.

    The ssl flag is not used if base_url is set explicitly.

  • max_failures (optional)

    Number of times to retry if a communications failure occurs, before raising an exception. Defaults to 5.

METHODS

Detailed information on each of the methods can be found in the Amazon EMR API documentation. Each method takes a hash of parameters using the names given in the documentation. Parameter passing uses the following rules:

  • Array inputs such as InstanceGroups.member.N use their primary name and a Perl ArrayRef, i.e. InstanceGroups => [ ... ] in this example.

  • Either hashes or object instances may be passed in; e.g both of the following forms are acceptable:

    $emr->run_job_flow(Name => "API Test Job",
                                Instances => {
                                    Ec2KeyName => 'xxx',
                                    InstanceCount => 1,
                                },
        );
    
    $emr->run_job_flow(Name => "API Test Job",
                                Instances => Net::Amazon::EMR::JobFlowInstancesConfig->new(
                                    Ec2KeyName => 'xxx',
                                    InstanceCount => 1,
                                ),
        );
  • Otherwise, the names of parameters are exactly as found in the Amazon documentation for API version 2009-03-31.

add_instance_groups(%params)

AddInstanceGroups adds an instance group to a running cluster. Returns a Net::Amazon::EMR::AddInstanceGroupsResult object.

add_job_flow_steps(%params)

AddJobFlowSteps adds new steps to a running job flow. Returns 1 on success.

describe_job_flows(%params)

Returns a Net::Amazon::EMR::RunJobFlowResult that describes the job flows that match all of the supplied parameters.

modify_instance_groups(%params)

Modifies the number of nodes and configuration settings of an instance group. Returns 1 on success.

run_job_flow(%params)

Creates and starts running a new job flow. Returns a Net::Amazon::EMR::RunJobFlowResult object that contains the job flow ID.

set_termination_protection(%params)

Locks a job flow so the Amazon EC2 instances in the cluster cannot be terminated by user intervention, an API call, or in the event of a job-flow error. Returns 1 on success.

set_visible_to_all_users(%params)

Sets whether all AWS Identity and Access Management (IAM) users under your account can access the specifed job flows. Returns 1 on success.

terminate_job_flows(%params)

Terminates a list of job flows. Returns 1 on success.

ERROR HANDLING

If an error occurs in any of the methods, the error will be logged and an Exception::Class exception of type Net::Amazon::EMR::Exception will be thrown.

ERROR LOGGING

Quick Start Logging uses Log::Log4perl. You should initialise Log::Log4perl at the beginning of your program to suit your needs. The simplest way to enable debugging output to STDERR is to call

use Log::Log4perl qw/:easy/;
Log::Log4perl->easy_init($DEBUG);

Advanced Logging Configuration

Log::Log4perl provides great flexibility and there are many ways to set it up. A favourite of my own is to use Config::General format to specify all configuration parameters including logging, and to initialise in the following manner:

use Config::General qw/ParseConfig/;

my %opts = ParseConfig(-ConfigFile => 'my.conf',
                       -SplitPolicy => 'equalsign',
                       -UTF8 => 1);
... 
 
unless (Log::Log4perl->initialized) {
    if ($opts{log4perl}) {
          Log::Log4perl::init($opts{log4perl});
    }
    else {
       Log::Log4perl->easy_init();
    }
}
 

And a typical configuration in Config::General format might look like this:

<log4perl>
  log4perl.rootLogger = DEBUG, Screen, Logfile
  log4perl.appender.Logfile = Log::Log4perl::Appender::File
  log4perl.appender.Logfile.filename = debug.log
  log4perl.appender.Logfile.layout = Log::Log4perl::Layout::PatternLayout
  log4perl.appender.Logfile.layout.ConversionPattern = "%d %-5p %c - %m%n"
  log4perl.appender.Screen = Log::Log4perl::Appender::ScreenColoredLevels
  log4perl.appender.Screen.stderr = 1
  log4perl.appender.Screen.layout = Log::Log4perl::Layout::PatternLayout
  log4perl.appender.Screen.layout.ConversionPattern = "[%d] [%p] %c %m%n"
</log4perl>

Logging Verbosity

At DEBUG level, the output can be very lengthy. To see only important messages for Net::Amazon::EMR whilst debugging other parts of your code, you could raise the threshold just for Net::Amazon::EMR by adding the following to your Log4perl configuration:

log4perl.logger.Net.Amazon.EMR = WARN

Map-Reduce Notes

This is somewhat beyond the scope of the documentation for using Net::Amazon::EMR. Nevertheless, here are a few notes about using EMR with Perl.

Installing Perl Libraries

Undoubtedly, to run any serious processing, you will need to install additional libraries on the map-reduce servers. A practical way to do this is to pre-configure all of the libraries using local::lib and use a bootstrap task to install them when the servers boot, using steps similar to the following:

  • Start an interactive EMR job on a single instance using the same machine architecture (e.g. m1.large) that you plan to use for running your jobs.

  • ssh to instance

  • setup CPAN, get local::lib and install

  • setup .bashrc to contain the environment variables required to use local::lib

  • install all the other modules you need via cpan

  • clean up files from .cpan that you don't need, such as build and source directories

  • Create a tar file, e.g. tar cfz local-perl5.tar.gz perl5 .cpan .bashrc

  • Copy the tar file to your bucket on S3.

  • Set up a bootstrap script to copy back the tar file from S3 and untar it into the hadoop home directory, e.g.

    #!/bin/bash
    set -e
    bucket=mybucketname
    tarfile=local-perl5.tar.gz
    arch=large
    cd $HOME
    hadoop fs -get s3://$bucket/$arch/$tarfile .
    tar xfz $tarfile
  • Put the bootstrap script on S3 and use it when creating a new job flow.

Mappers and Reducers

Assuming the reader is familiar with the basic principles of map-reduce, in terms of implementation in Perl with hadoop-streaming.jar, a mapper/reducer is simply a script that reads from STDIN and writes to STDOUT, typically line by line using a tab-separated key and value pair on each line. So the main loop of any mapper/reducer script is usually of the form:

while (my $line = <>) {
  chomp $line;
  my ($key, $value) = split(/\t/, @line);
  ... do something with key and value
  print "$newkey\t$newvalue\n";
}

Scripts can be uploaded to S3 using the web interface, or placed in the bootstrap bundle described above, or uploaded to the master instance using scp and distributed using the hadoop-streaming.jar -file option, or no doubt by many other mechanisms. If due care is taken with quoting, a script can even be specified using the -mapper and -reducer options directly; for example:

Args => [ '-mapper', '"perl -e MyClass->new->mapper"', ... ]

AUTHOR

Jon Schutz

http://notes.jschutz.net

BUGS

Please report any bugs or feature requests to bug-net-amazon-emr at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Net-Amazon-EMR. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Net::Amazon::EMR

You can also look for information at:

ACKNOWLEDGEMENTS

The core interface code was adapted from Net::Amazon::EC2.

LICENSE AND COPYRIGHT

Copyright 2012 Jon Schutz.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://dev.perl.org/licenses/ for more information.

SEE ALSO

Amazon EMR API: http://http://docs.amazonwebservices.com/ElasticMapReduce/latest/APIReference/Welcome.html