NAME
Net::Hadoop::Oozie - Interface to various Oozie REST endpoints and utility methods.
VERSION
version 0.116
DESCRIPTION
This module is a Perl interface to Oozie REST service endpoints and also include some utility methods for some bulk requests and some admin functionality.
SYNOPSIS
use Net::Hadoop::Oozie;
my $oozie = Net::Hadoop::Oozie->new( %options );
ACCESSORS
action
api_version
doas
filter
The submission format is filter_key1=filter_value1;filter_key2=...;
, but the filters are defined as a hash.
filter => {
status => ...,
}
The valid filters are listed below.
- name
-
The application name from the workflow/coordinator/bundle definition
- user
-
The user that submitted the job
- group
-
The group for the job
- status
-
The status of the job
You need to consider a certain behavior when using filters:
The query will do an AND among all the filter names.
The query will do an OR among all the filter values for the same name.
Multiple values must be specified as different name value pairs.
jobtype
The doc says workflow, coordinator, bundle BUT in CDH 4.4, valid values are '','coordinators' and 'bundles'. workflows
and coordinator
methods are helper functions setting these values behind the scenes.
len
Defaults to 50
.
offset
Defaults to 1
.
order
Default is asc
, can be asc
or desc
. For instance, when used on a coordinator in a job
call, using desc will put the len
most recent actions in the actions key, in most recent order first; the offset
is then applied from the end of the list.
show
METHODS
END POINTS
admin
build_version
coord_rerun
coordinators
job
jobs
kill
resume
submit_job
For details about job submission through REST, see https://oozie.apache.org/docs/4.2.0/WebServicesAPI.html#Job_Submission.
Required parameters are listed below.
oozie.wf.application.path
Like /oozie_workflows/myworkflow, must be deployed there already.
appName
How this specific instance will be called, can be anything you want.
Optional parameters are listed below.
- Auto variables
-
If you want some variable interpolated in your script (like a date, an int, or whatever), pass it in the options you call the method with. if you pass
foo => 'bar',
inside the workflow you will be able to use it as${foo}
. - Configuration properties
-
Useful parameters for oozie itself (like the queue name) need AFAICT an extra level of handling. they can be set dynamically, but need a tweak in the workflow definition itself, in the top config section; for instance, if we need to specify
mapreduce.job.queuename
to assign the tasks to a specific fair scheduler queue, we need to declare it in the global configuration section, like this:<property> <name>mapreduce.job.queuename</name> <value>${queueName}</value> </property>
And we will call "submit_job" adding this to the options hash:
queueName => "root.<queue name>"
This method returns a job ID which you can use directly to query the job status, with the "job" method above, so you can launch a job from a script, and have a loop query the job status at regular intervals (be nice, please) to check when it's done (untested code :-).
my $oozie = Net::Hadoop::Oozie->new;
my $job_params = [
{ appName => 'job1', myParam => 'foo' },
{ appName => 'job2', myParam => 'bar' },
...
];
for my $job (@$job_params) {
my $jobid = $oozie->submit_job({
myParam => $job->{myParam},
debug => 0, # set to 1 to print the job config and response
appName => $job->{appName},
'oozie.wf.application.path' => "/wf_base_path/<workflow name>/",
});
push @ids, $jobid;
}
while (my $jobid = shift @ids) {
my $status;
if (($status = $oozie->job($jobid)->{status}) =~ /(WAITING|READY|SUBMITTED|RUNNING)/)) {
push @ids, $jobid; # put back in the queue
sleep 10; # or more, how about 60?
}
# what do you want to do if not succeeded?
if ($status !~ /SUCCEEDED/) {
die "job $jobid died";
}
}
workflows
UTILITY METHODS
active_coordinators
active_job_paths
coordinators_on_the_same_path
coordinators_with_the_same_appname_on_the_same_path
Returns a hash consisting of duplicated application names for multiple coordinators. Having coordinators like this is usually an user error when submitting jobs.
my %offenders = $oozie->coordinators_with_the_same_appname_on_the_same_path;
failed_workflows_last_n_hours
my %options = ( # all keys are optional
parent_info => Bool, # default: 1
);
my $failed_arrayref = $oozie->failed_workflows_last_n_hours( $hours, $pattern, \%options );
failed_workflows_last_n_hours_pretty
my $string = $oozie->failed_workflows_last_n_hours_pretty( $hours );
failed_workflows_last_n_hours_paged
my %options = ( # all keys are optional
parent_info => Bool, # default: 1
page_size => Int, # default: 50
page_nr => Int, # default: 1
);
my(
$total_page_nr,
$page_nr,
$failed_arrayref,
) = $oozie->failed_workflows_last_n_hours_paged(
$hours,
$pattern,
\%options,
);
job_exists
This is a sugar interface on top of the "job" method. Normally the REST interface just dies with an HTTP 400
message on missing jobs. This method won't die and will return the data set if there is a proper response from the service. It will return false otherwise.
if ( my $job = $oozie->job_exists( $id ) ) {
# do something
}
else {
warn "No such job: $id";
}
kerberos_enabled
Returns true if kerberos is enabled
max_node_name_len
Returns the value of the hardcoded (in Oozie Java code) MAX_NODE_NAME_LEN
value by probing the Oozie server version. This is the maximum length of an Oozie action name that can be in your workflow definitions. If longer action names are deployed and scheduled, then the Oozie server will happily schedule a coordinator but the individual workflow runs will throw exceptions and and no part of the job will get executed. Also note that (if you didn't guess already) oozie validation function will validate and pass such names (unless you have a recent Oozie version which pushes the validation on the server side).
The relevant part in the Oozie source:
core/src/main/java/org/apache/oozie/util/ParamChecker.java
private static final int MAX_NODE_NAME_LEN = {Integer};
Currently there is no way to probe the value of this constant through the APIs, but it is possible to map a limit to certain Oozie versions.
Oozie version 4.3.0
and later sets the limit to 128
while anything older than that will have the value 50
(for the time being).
This method, checks the Oozie server version and returns the relevant limit for that version.
See these Oozie Jira tickets for more information:
Checking the limit is especially important if you are deploying the Oozie jobs with custom code generators (instead of hand writing all of the XML) and this helper method will give you the ability to display meaningful exceptions to the users, instead of the obscure Oozie ones in the Oozie console.
oozie_version
Just a sugor interface on top of build_version
trying to return the actual numerical Oozie
version without the build string.
my $oozie_version = $oozie->oozie_version;
# Something like "4.1.0"
standalone_active_workflows
Returns an arrayref of standalone workflows (as in jobs not attached to a coordinator):
my $wfs_without_a_coordinator = $oozie->standalone_active_workflows;
foreach my $wf ( @{ $wfs_without_a_coordinator } ) {
# do something
}
suspended_coordinators
Returns an arrayref of suspended coordinators:
my $suspended = $oozie->suspended_coordinators;
foreach my $coord ( @{ $suspended } ) {
# do something
}
suspended_workflows
Returns an arrayref of suspended workflows:
my $suspended = $oozie->suspended_workflows;
foreach my $wf ( @{ $suspended } ) {
# do something
}