NAME
Apache::Hadoop::Config - Perl extension for Hadoop node configuration
SYNOPSIS
use Apache::Hadoop::Config;
Hadoop configuration setup
Configuration of Apache Hadoop is easy to build a cluster with default settings. But those settings are not suitable for a wide variety of hardware configuration. This perl package proposes optimal properties for some of the configuration parameters based on hardware configuration and user requirement.
It is primarily designed to extract hardware configuration from /proc file system to understand cpu cores, system memory and disk layout information. But these parameters can be manually fed as arguments to generate recommended settings.
This perl package can create namenode and datanode repositories, set appropriate permissions and generate configuration XML files with recommended settings.
DESCRIPTION
Perl extension Apache::Hadoop::Config is designed to address Hadoop deployment and configuration practices, enabling rapid provisioning of Hadoop cluster with customization. It has two distinct capabilities (1) to generate configuration files, (2) create namenode and datanode repositories.
This package need to be installed ideally on at least one of the nodes in the cluster, assuming that all nodes have identical hardware configuration. However, this package can be installed on any other node and required hardware information can be supplied using arguments and configuration files can be generated and copied to actual cluster nodes.
This package is capable of creating repositories for namenode and datanodes, for which it should be installed on ALL hadoop cluster nodes.
Create a new Apache::Hadoop::Config object, either using system configuration or by supplying from command line arguments.
my $h = Apache::Hadoop::Config->new;
Basic configuration and memory settings are available using two functions. Calling basic configuration function is required while memory configuration is recommended.
$h->basic_config;
$h->memory_config;
The package can print or create XML configuration files independently, using print and write functions, for configuration. It is necessary to provide conf directory, writable, to write configuration XML files.
$h->print_config;
$h->write_config (confdir=>'etc/hadoop');
Additional configuration parameters can be supplied at the time of creating the object.
my $h = Apache::Hadoop::Config->new (
config=> {
'mapred-site.xml' => {
'mapreduce.task.io.sort.mb' => 256,
},
'core-site.xml' => {
'hadoop.tmp.dir' => '/tmp/hadoop',
},
},
);
These parameters will override any automatically generated parameters, built into this package.
The package creates namenode and datanode volumes along with setting permission of hadoop.tmp.dir and log directories. The disk information can be supplied at object construction time.
my $h = Apache::Hadoop::Config->new (
hdfs_name_disks => [ '/hdfs/namedisk1', '/hdfs/namedisk2' ],
hdfs_data_disks => [ '/hdfs/datadisk1', '/hdfs/datadisk2' ],
hdfs_tmp => '/hdfs/tmp',
hdfs_logdir => [ '/logs', '/logs/userlog' ],
);
Note that name disks and data disks accept reference to array type of data. The package creates all the namenode and datanode volumes and creates log and tmp directories.
$h->create_hdfs_name_disks;
$h->create_hdfs_data_disks;
$h->create_hdfs_tmpdir;
$h->create_hadoop_logdir;
The permission will be set as appropriate. It is strongly recommended that this package and associated script is executed by Hadoop Admin user (hduser).
Some of the basic configuration can be customized externally using object arguments. Namenode, secondary namenode, proxy node informations can be customized. Default is localhost for each of them.
my $h = Apache::Hadoop::Config->new (
namenode => 'nn.myorg.com',
secondary=> 'nn2.myorg.com',
proxynode=> 'pr.myorg.com',
proxyport=> '8888', # default, optional
);
These are optional and required only when secondary namenode and proxy node are different than primary namenode.
EXAMPLES
Below are a few examples of different uses. The first example is to create recommended configurations for the localhost or command-line provided data:
#!/usr/bin/perl -w
use strict;
use warnings;
use Apache::Hadoop::Config;
use Getopt::Long;
my %opts;
GetOptions (\%opts, 'disks=s','memory=s','cores=s');
my $h = Apache::Hadoop::Config->new (
meminfo=>$opts{'memory'} || undef,
cpuinfo=>$opts{'cores'} || undef,
diskinfo=>$opts{'disks'} || undef,
);
# setup configs
$h->basic_config;
$h->memory_config;
# print and save
$h->print_config;
$h->write_config (confdir=>'.');
exit(0);
The above gives an output like below, if no argument is supplied:
min cont size (mb) : 256
num of containers : 7
mem per container (mb): 368
disk : 4
cpu : 4
mem : 3.52075958251953
---------------
hdfs-site.xml
dfs.namenode.secondary.http-address: 0.0.0.0:50090
dfs.replication: 1
dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4
dfs.namenode.secondary.https-address: 0.0.0.0:50091
dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2
yarn-site.xml
yarn.web-proxy.address: localhost:8888
yarn.nodemanager.aux-services: mapreduce_shuffle
yarn.scheduler.minimum-allocation-mb: 368
yarn.scheduler.maximum-allocation-mb: 2576
yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler
yarn.nodemanager.resource.memory-mb: 2576
core-site.xml
hadoop.tmp.dir: /hdfs/tmp
fs.defaultFS: http://localhost:9000
mapred-site.xml
mapreduce.reduce.java.opts: -Xmx588m
mapreduce.map.memory.mb: 368
mapreduce.map.java.opts: -Xmx294m
mapreduce.framework.name: yarn
mapreduce.reduce.memory.mb: 736
---------------
-> writing to ./hdfs-site.xml ...
-> writing to ./yarn-site.xml ...
-> writing to ./core-site.xml ...
-> writing to ./mapred-site.xml ...
If supplied with some arguments, basically for a different clusters, the configuration files can still be generated:
$ perl hadoop_config.pl --cores 16 --memory 64 --disks 6
min cont size (mb) : 2048
num of containers : 10
mem per container (mb): 5734
disk : 6
cpu : 16
mem : 64
---------------
hdfs-site.xml
dfs.namenode.secondary.http-address: 0.0.0.0:50090
dfs.replication: 1
dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4
dfs.namenode.secondary.https-address: 0.0.0.0:50091
dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2
yarn-site.xml
yarn.web-proxy.address: localhost:8888
yarn.nodemanager.aux-services: mapreduce_shuffle
yarn.scheduler.minimum-allocation-mb: 5734
yarn.scheduler.maximum-allocation-mb: 57340
yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler
yarn.nodemanager.resource.memory-mb: 57340
core-site.xml
hadoop.tmp.dir: /hdfs/tmp
fs.defaultFS: http://localhost:9000
mapred-site.xml
mapreduce.reduce.java.opts: -Xmx9174m
mapreduce.map.memory.mb: 5734
mapreduce.map.java.opts: -Xmx4587m
mapreduce.framework.name: yarn
mapreduce.reduce.memory.mb: 11468
---------------
-> writing to ./hdfs-site.xml ...
-> writing to ./yarn-site.xml ...
-> writing to ./core-site.xml ...
-> writing to ./mapred-site.xml ...
Different customization can be done, using object's constructor arguments.
SEE ALSO
hadoop.apache.org - The Hadoop documentation and authoritative source for Apache Hadoop and its components.
AUTHOR
Snehasis Sinha, <snehasis@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015 by Snehasis Sinha
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.