NAME
Net::Amazon::EMR - API for Amazon's Elastic Map-Reduce service
SYNOPSIS
use Net::Amazon::EMR;
my $emr = Net::Amazon::EMR->new(
AWSAccessKeyId => $AWS_ACCESS_KEY_ID,
SecretAccessKey => $SECRET_ACCESS_KEY,
ssl => 1,
);
# start a job flow
my $id = $emr->run_job_flow(Name => "Example Job",
Instances => {
Ec2KeyName => 'myKeyId',
InstanceCount => 10,
KeepJobFlowAliveWhenNoSteps => 1,
MasterInstanceType => 'm1.small',
Placement => { AvailabilityZone => 'us-east-1a' },
SlaveInstanceType => 'm1.small',
},
BootstrapActions => [{
Name => 'Bootstrap-configure',
ScriptBootstrapAction => {
Path => 's3://elasticmapreduce/bootstrap-actions/configure-hadoop',
Args => [ '-m', 'mapred.compress.map.output=true' ],
},
}],
Steps => [{
ActionOnFailure => 'TERMINATE_JOB_FLOWS',
Name => "Set up debugging",
HadoopJarStep => {
Jar => 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
Args => [ 's3://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch' ],
},
}],
);
print "Job flow id = " . $id->JobFlowId . "\n";
# Get details of just-launched job
$result = $emr->describe_job_flows(JobFlowIds => [ $id->JobFlowId ]);
# or get details of all jobs created after a given time
$result = $emr->describe_job_flows(CreatedAfter => '2012-12-17T07:19:57Z');
# or use DateTime
$result = $emr->describe_job_flows(CreatedAfter => DateTime->new(year => 2012, month => 12, day => 17));
# See the details of the typed result
use Data::Dumper; print Dumper($result);
# or dispense with types and see the details as a perl hash
use Data::Dumper; print Dumper($result->as_hash);
# Flexible Booleans - 1, 0, undef, 'true', 'false'
$emr->set_visible_to_all_users(JobFlowIds => $id, VisibleToAllUsers => 1);
$emr->set_termination_protection(JobFlowIds => [ $id->JobFlowId ], TerminationProtected => 'false');
# Add map-reduce steps and execute
$emr->add_job_flow_steps(JobFlowId => $job_id,
Steps => [{
ActionOnFailure => 'CANCEL_AND_WAIT',
Name => "Example",
HadoopJarStep => {
Jar => '/home/hadoop/contrib/streaming/hadoop-streaming.jar',
Args => [ '-input', 's3://my-bucket/my-input',
'-output', 's3://my-bucket/my-output',
'-mapper', '/path/to/mapper-script',
'-reducer', '/path/to/reducer-script',
],
Properties => [ { Key => 'reduce_tasks_speculative_execution', Value => 'false' } ],
},
}, ... ]);
DESCRIPTION
This is an implementation of the Amazon Elastic Map-Reduce API.
CONSTRUCTOR
new(%options)
This is the constructor. Options are as follows:
AWSAccessKeyId (required)
Your AWS access key.
SecretAccessKey (required)
Your secret key.
base_url (optional)
The base URL for your chosen Amazon region; see http://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region. If not specified, the default URL is used (which implies region us-east-1).
my $emr = Net::Amazon::EMR->new( AWSAccessKeyId => $AWS_ACCESS_KEY_ID, SecretAccessKey => $SECRET_ACCESS_KEY, base_url => 'https://elasticmapreduce.us-west-2.amazonaws.com', );
ssl (optional)
If set to a true value, the default base_url will use https:// instead of http://. Defaults to true.
The ssl flag is not used if base_url is set explicitly.
max_failures (optional)
Number of times to retry if a communications failure occurs, before raising an exception. Defaults to 5.
METHODS
Detailed information on each of the methods can be found in the Amazon EMR API documentation. Each method takes a hash of parameters using the names given in the documentation. Parameter passing uses the following rules:
Array inputs such as InstanceGroups.member.N use their primary name and a Perl ArrayRef, i.e. InstanceGroups => [ ... ] in this example.
Either hashes or object instances may be passed in; e.g both of the following forms are acceptable:
$emr->run_job_flow(Name => "API Test Job", Instances => { Ec2KeyName => 'xxx', InstanceCount => 1, }, ); $emr->run_job_flow(Name => "API Test Job", Instances => Net::Amazon::EMR::JobFlowInstancesConfig->new( Ec2KeyName => 'xxx', InstanceCount => 1, ), );
Otherwise, the names of parameters are exactly as found in the Amazon documentation for API version 2009-03-31.
add_instance_groups(%params)
AddInstanceGroups adds an instance group to a running cluster. Returns a Net::Amazon::EMR::AddInstanceGroupsResult object.
add_job_flow_steps(%params)
AddJobFlowSteps adds new steps to a running job flow. Returns 1 on success.
describe_job_flows(%params)
Returns a Net::Amazon::EMR::RunJobFlowResult that describes the job flows that match all of the supplied parameters.
modify_instance_groups(%params)
Modifies the number of nodes and configuration settings of an instance group. Returns 1 on success.
run_job_flow(%params)
Creates and starts running a new job flow. Returns a Net::Amazon::EMR::RunJobFlowResult object that contains the job flow ID.
set_termination_protection(%params)
Locks a job flow so the Amazon EC2 instances in the cluster cannot be terminated by user intervention, an API call, or in the event of a job-flow error. Returns 1 on success.
set_visible_to_all_users(%params)
Sets whether all AWS Identity and Access Management (IAM) users under your account can access the specifed job flows. Returns 1 on success.
terminate_job_flows(%params)
Terminates a list of job flows. Returns 1 on success.
ERROR HANDLING
If an error occurs in any of the methods, the error will be logged and an Exception::Class exception of type Net::Amazon::EMR::Exception will be thrown.
ERROR LOGGING
Quick Start Logging uses Log::Log4perl. You should initialise Log::Log4perl at the beginning of your program to suit your needs. The simplest way to enable debugging output to STDERR is to call
use Log::Log4perl qw/:easy/;
Log::Log4perl->easy_init($DEBUG);
Advanced Logging Configuration
Log::Log4perl provides great flexibility and there are many ways to set it up. A favourite of my own is to use Config::General format to specify all configuration parameters including logging, and to initialise in the following manner:
use Config::General qw/ParseConfig/;
my %opts = ParseConfig(-ConfigFile => 'my.conf',
-SplitPolicy => 'equalsign',
-UTF8 => 1);
...
unless (Log::Log4perl->initialized) {
if ($opts{log4perl}) {
Log::Log4perl::init($opts{log4perl});
}
else {
Log::Log4perl->easy_init();
}
}
And a typical configuration in Config::General format might look like this:
<log4perl>
log4perl.rootLogger = DEBUG, Screen, Logfile
log4perl.appender.Logfile = Log::Log4perl::Appender::File
log4perl.appender.Logfile.filename = debug.log
log4perl.appender.Logfile.layout = Log::Log4perl::Layout::PatternLayout
log4perl.appender.Logfile.layout.ConversionPattern = "%d %-5p %c - %m%n"
log4perl.appender.Screen = Log::Log4perl::Appender::ScreenColoredLevels
log4perl.appender.Screen.stderr = 1
log4perl.appender.Screen.layout = Log::Log4perl::Layout::PatternLayout
log4perl.appender.Screen.layout.ConversionPattern = "[%d] [%p] %c %m%n"
</log4perl>
Logging Verbosity
At DEBUG level, the output can be very lengthy. To see only important messages for Net::Amazon::EMR whilst debugging other parts of your code, you could raise the threshold just for Net::Amazon::EMR by adding the following to your Log4perl configuration:
log4perl.logger.Net.Amazon.EMR = WARN
Map-Reduce Notes
This is somewhat beyond the scope of the documentation for using Net::Amazon::EMR. Nevertheless, here are a few notes about using EMR with Perl.
Installing Perl Libraries
Undoubtedly, to run any serious processing, you will need to install additional libraries on the map-reduce servers. A practical way to do this is to pre-configure all of the libraries using local::lib and use a bootstrap task to install them when the servers boot, using steps similar to the following:
Start an interactive EMR job on a single instance using the same machine architecture (e.g. m1.large) that you plan to use for running your jobs.
ssh to instance
setup CPAN, get local::lib and install
setup .bashrc to contain the environment variables required to use local::lib
install all the other modules you need via cpan
clean up files from .cpan that you don't need, such as build and source directories
Create a tar file, e.g. tar cfz local-perl5.tar.gz perl5 .cpan .bashrc
Copy the tar file to your bucket on S3.
Set up a bootstrap script to copy back the tar file from S3 and untar it into the hadoop home directory, e.g.
#!/bin/bash set -e bucket=mybucketname tarfile=local-perl5.tar.gz arch=large cd $HOME hadoop fs -get s3://$bucket/$arch/$tarfile . tar xfz $tarfile
Put the bootstrap script on S3 and use it when creating a new job flow.
Mappers and Reducers
Assuming the reader is familiar with the basic principles of map-reduce, in terms of implementation in Perl with hadoop-streaming.jar, a mapper/reducer is simply a script that reads from STDIN and writes to STDOUT, typically line by line using a tab-separated key and value pair on each line. So the main loop of any mapper/reducer script is usually of the form:
while (my $line = <>) {
chomp $line;
my ($key, $value) = split(/\t/, @line);
... do something with key and value
print "$newkey\t$newvalue\n";
}
Scripts can be uploaded to S3 using the web interface, or placed in the bootstrap bundle described above, or uploaded to the master instance using scp and distributed using the hadoop-streaming.jar -file option, or no doubt by many other mechanisms. If due care is taken with quoting, a script can even be specified using the -mapper and -reducer options directly; for example:
Args => [ '-mapper', '"perl -e MyClass->new->mapper"', ... ]
AUTHOR
Jon Schutz
BUGS
Please report any bugs or feature requests to bug-net-amazon-emr at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Net-Amazon-EMR. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Net::Amazon::EMR
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
The core interface code was adapted from Net::Amazon::EC2.
LICENSE AND COPYRIGHT
Copyright 2012 Jon Schutz.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://dev.perl.org/licenses/ for more information.
SEE ALSO
Amazon EMR API: http://http://docs.amazonwebservices.com/ElasticMapReduce/latest/APIReference/Welcome.html