NAME

Apache::Hadoop::Config - Perl extension for Hadoop node configuration

SYNOPSIS

use Apache::Hadoop::Config;
Hadoop configuration setup

Configuration of Apache Hadoop is easy to build a cluster with default settings. But those settings are not suitable for a wide variety of hardware configuration. This perl package proposes optimal properties for some of the configuration parameters based on hardware configuration and user requirement.

It is primarily designed to extract hardware configuration from /proc file system to understand cpu cores, system memory and disk layout information. But these parameters can be manually fed as arguments to generate recommended settings.

This perl package can create namenode and datanode repositories, set appropriate permissions and generate configuration XML files with recommended settings.

DESCRIPTION

Perl extension Apache::Hadoop::Config is designed to address Hadoop deployment and configuration practices, enabling rapid provisioning of Hadoop cluster with customization. It has two distinct capabilities (1) to generate configuration files, (2) create namenode and datanode repositories.

This package need to be installed ideally on at least one of the nodes in the cluster, assuming that all nodes have identical hardware configuration. However, this package can be installed on any other node and required hardware information can be supplied using arguments and configuration files can be generated and copied to actual cluster nodes.

This package is capable of creating repositories for namenode and datanodes, for which it should be installed on ALL hadoop cluster nodes.

Create a new Apache::Hadoop::Config object, either using system configuration or by supplying from command line arguments.

my $h = Apache::Hadoop::Config->new; 

Basic configuration and memory settings are available using two functions. Calling basic configuration function is required while memory configuration is recommended.

$h->basic_config;
$h->memory_config;

The package can print or create XML configuration files independently, using print and write functions, for configuration. It is necessary to provide conf directory, writable, to write configuration XML files.

$h->print_config;
$h->write_config (confdir=>'etc/hadoop');

Additional configuration parameters can be supplied at the time of creating the object.

my $h = Apache::Hadoop::Config->new (
    config=> {
        'mapred-site.xml' => {
            'mapreduce.task.io.sort.mb' => 256,
        },
        'core-site.xml'   => {
            'hadoop.tmp.dir' => '/tmp/hadoop',
        },
    },
);

These parameters will override any automatically generated parameters, built into this package.

The package creates namenode and datanode volumes along with setting permission of hadoop.tmp.dir and log directories. The disk information can be supplied at object construction time.

my $h = Apache::Hadoop::Config->new (
    hdfs_name_disks => [ '/hdfs/namedisk1', '/hdfs/namedisk2' ],
    hdfs_data_disks => [ '/hdfs/datadisk1', '/hdfs/datadisk2' ],
    hdfs_tmp        => '/hdfs/tmp',
    hdfs_logdir     => [ '/logs', '/logs/userlog' ],
    );

Note that name disks and data disks accept reference to array type of data. The package creates all the namenode and datanode volumes and creates log and tmp directories.

$h->create_hdfs_name_disks;
$h->create_hdfs_data_disks;
$h->create_hdfs_tmpdir;
$h->create_hadoop_logdir;

The permission will be set as appropriate. It is strongly recommended that this package and associated script is executed by Hadoop Admin user (hduser).

Some of the basic configuration can be customized externally using object arguments. Namenode, secondary namenode, proxy node informations can be customized. Default is localhost for each of them.

my $h = Apache::Hadoop::Config->new (
    namenode => 'nn.myorg.com',
    secondary=> 'nn2.myorg.com',
    proxynode=> 'pr.myorg.com',
    proxyport=> '8888', # default, optional
    );

These are optional and required only when secondary namenode and proxy node are different than primary namenode.

EXAMPLES

Below are a few examples of different uses. The first example is to create recommended configurations for the localhost or command-line provided data:

#!/usr/bin/perl -w
use strict;
use warnings;
use Apache::Hadoop::Config;
use Getopt::Long;

my %opts;
GetOptions (\%opts, 'disks=s','memory=s','cores=s');

my $h = Apache::Hadoop::Config->new (
        meminfo=>$opts{'memory'} || undef,
        cpuinfo=>$opts{'cores'} || undef,
        diskinfo=>$opts{'disks'} || undef,
        );

# setup configs
$h->basic_config;
$h->memory_config;

# print and save
$h->print_config;
$h->write_config (confdir=>'.');

exit(0);

The above gives an output like below, if no argument is supplied:

min cont size (mb)    : 256
num of containers     : 7
mem per container (mb): 368
 disk : 4
  cpu : 4
  mem : 3.52075958251953
---------------
hdfs-site.xml
  dfs.namenode.secondary.http-address: 0.0.0.0:50090
  dfs.replication: 1
  dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4
  dfs.namenode.secondary.https-address: 0.0.0.0:50091
  dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2
yarn-site.xml
  yarn.web-proxy.address: localhost:8888
  yarn.nodemanager.aux-services: mapreduce_shuffle
  yarn.scheduler.minimum-allocation-mb: 368
  yarn.scheduler.maximum-allocation-mb: 2576
  yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler
  yarn.nodemanager.resource.memory-mb: 2576
core-site.xml
  hadoop.tmp.dir: /hdfs/tmp
  fs.defaultFS: http://localhost:9000
mapred-site.xml
  mapreduce.reduce.java.opts: -Xmx588m
  mapreduce.map.memory.mb: 368
  mapreduce.map.java.opts: -Xmx294m
  mapreduce.framework.name: yarn
  mapreduce.reduce.memory.mb: 736
---------------
-> writing to ./hdfs-site.xml ...
-> writing to ./yarn-site.xml ...
-> writing to ./core-site.xml ...
-> writing to ./mapred-site.xml ...

If supplied with some arguments, basically for a different clusters, the configuration files can still be generated:

$ perl hadoop_config.pl --cores 16 --memory 64 --disks 6
min cont size (mb)    : 2048
num of containers     : 10
mem per container (mb): 5734
 disk : 6
  cpu : 16
  mem : 64
---------------
hdfs-site.xml
  dfs.namenode.secondary.http-address: 0.0.0.0:50090
  dfs.replication: 1
  dfs.datanode.data.dir: file:///hdfs/data1,file:///hdfs/data2,file:///hdfs/data3,file:///hdfs/data4
  dfs.namenode.secondary.https-address: 0.0.0.0:50091
  dfs.namenode.name.dir: file:///hdfs/name1,file:///hdfs/name2
yarn-site.xml
  yarn.web-proxy.address: localhost:8888
  yarn.nodemanager.aux-services: mapreduce_shuffle
  yarn.scheduler.minimum-allocation-mb: 5734
  yarn.scheduler.maximum-allocation-mb: 57340
  yarn.nodemanager.aux-services.mapreduce.shuffle.class: org.apache.hadoop.mapred.ShuffleHandler
  yarn.nodemanager.resource.memory-mb: 57340
core-site.xml
  hadoop.tmp.dir: /hdfs/tmp
  fs.defaultFS: http://localhost:9000
mapred-site.xml
  mapreduce.reduce.java.opts: -Xmx9174m
  mapreduce.map.memory.mb: 5734
  mapreduce.map.java.opts: -Xmx4587m
  mapreduce.framework.name: yarn
  mapreduce.reduce.memory.mb: 11468
---------------
-> writing to ./hdfs-site.xml ...
-> writing to ./yarn-site.xml ...
-> writing to ./core-site.xml ...
-> writing to ./mapred-site.xml ...

Different customization can be done, using object's constructor arguments.

SEE ALSO

hadoop.apache.org - The Hadoop documentation and authoritative source for Apache Hadoop and its components.

AUTHOR

Snehasis Sinha, <snehasis@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015 by Snehasis Sinha

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.