Algorithm::DecisionTree - A pure-Perl implementation for constructing a decision tree from multidimensional training data and for using the decision tree thus induced for classifying data.
SYNOPSIS
# FOR CONSTRUCTING A DECISION TREE AND FOR CLASSIFYING A SAMPLE:
my $training_datafile = "training.dat";
my $dt = Algorithm::DecisionTree->new(
training_datafile => $training_datafile,
debug1 => 1,
);
$dt->get_training_data();
$dt->show_training_data();
my $root_node = $dt->construct_decision_tree_classifier();
$root_node->display_decision_tree(" ");
my @test_sample = qw /exercising=>never
smoking=>heavy
fatIntake=>heavy
videoAddiction=>heavy /;
$dt->classify($root_node, @test_sample);
# For the above calls to work, the format in which the training data is made
# available to the decision-tree constructor new() must meet certain
# assumptions. (See the training.dat file in the examples directory.) The
# training datafile must declare the class names, the feature names and the
# names of the different possible values for the features. The rest of the
# training datafile is expected to contain the training samples in the form of
# a multi-column table.
# HIGHLY RECOMMENDED: Always use the "debug1" option in the constructor call
# above when working with a training dataset for the first time.
# If your training file specifies a large number of features or a large
# number of values for the features, the above constructor call could result
# in a decision tree that is simply much too large (and much too slow to
# construct). For such cases, consider using following additional options in
# the constructor call shown above:
my $dt = Algorithm::DecisionTree->new(
training_datafile => $training_datafile,
max_depth_desired => some_number,
entropy_threshold => some_value,
debug1 => 1,
);
# where for max_depth_desired you should choose a number that is less than the
# number of features in your training file. This will set the depth of your
# decision tree to max_depth_desired. The algorithm will automatically use the
# BEST max_depth_desired features --- best in the sense of being the most
# discriminative --- for constructing the decision tree. The parameter
# entropy_threshold sets the granularity with which the entropies will be
# sampled. Its default value is 0.001. The larger the value you choose for
# entropy_threshold, the smaller the tree.
# FOR GENERATING TRAINING DATA:
use Algorithm::DecisionTree;
my $parameter_file = "param.txt";
my $output_data_file = "training.dat";
my $training_data_gen = Algorithm::DecisionTree->training_data_generator(
output_datafile => $output_data_file,
parameter_file => $parameter_file,
number_of_training_samples => 35,
);
$training_data_gen->read_parameter_file();
$training_data_gen->gen_training_data();
$training_data_gen->write_training_data_to_file();
# For the above calls to work, the parameter file must obey certain
# assumptions. (See the param.txt file in the examples directory.) The
# parameter file must declare the class names, the class priors, the feature
# names and the different possible values for the features. The parameter
# file is also expected to present information on how you want the data
# vectors to be biased for the different classes.
# FOR GENERATING TEST DATA:
use Algorithm::DecisionTree;
my $parameter_file = "param.txt";
my $output_test_datafile = "testdata.dat";
my $output_class_label_file = "test_data_class_labels.dat";
my $test_data_gen = Algorithm::DecisionTree->test_data_generator(
output_test_datafile => $output_test_datafile,
output_class_label_file => $output_class_label_file,
parameter_file => $parameter_file,
write_to_file => 1,
number_of_test_samples => 10,
debug1 => 1,
);
$test_data_gen->read_parameter_file();
$test_data_gen->gen_test_data();
$test_data_gen->write_test_data_to_file();
# The test data is deposited without the class labels in the file named for
# the parameter output_test_datafile. The class labels are deposited in a
# separate file named for the parameter output_class_label_file. The class
# names, the feature names, the feature values, and the probabilistic bias
# used for the test data are according to the information placed in the
# parameter file.
CHANGES
Some of the key elements of the documentation were cleaned up and made more readable in Version 1.41. The implementation code remains unchanged from Version 1.4.
Version 1.4 should make things faster (and easier) for folks who want to use this module with training data that creates very large decision trees (that is, trees with tens of thousands or more decision nodes). The speedup in Version 1.4 has been achieved by eliminating duplicate calculation of probabilities as the tree grows. In addition, this version provides an additional constructor parameter, max_depth_desired
for controlling the size of the decisiotn tree. This is in addition to the tree size control achieved by the parameter entropy_threshold
that was introduced in Version 1.3. Since large decision trees can take a long time to create, you may find yourself wishing you could store the tree you just created in a disk file and that, subsequently, you could use the stored tree for classification work. The examples
directory contains two scripts, store_dt_on_disk.pl
and classify_from_disk_stored_dt.pl
, that show how you can do exactly that with the help of Perl's Storable
module.
Version 1.3 addresses the issue that arises when the header of a training datafile declares a certain possible value for a feature but that (feature,value) pair does NOT show up anywhere in the training data. Version 1.3 also makes it possible for a user to control the size of the decision tree by changing the value of the parameter entropy_threshold.
Additionally, Version 1.3 includes a method called determine_data_condition()
that displays useful information regarding the size and some other attributes of the training data. It also warns the user if the training data might result in a decision tree that would simply be much too large --- unless the user controls the size with the entropy_threshold parameter.
In addition to the removal of a couple of serious bugs, version 1.2 incorporates a number of enhancements: (1) Version 1.2 includes checks on the names of the features and values used in test data --- this is the data you want to classify with the decision tree classifier constructed by this module. (2) Version 1.2 includes a separate constructor for generating test data. To make it easier to generate test data whose probabilistic parameters may not be identical to that used for the training data, I have used separate routines for generating the test data. (3) Version 1.2 also includes in its examples directory a script that classifies the test data in a file and outputs the class labels into another file. This is for folks who do not wish to write their own scripts using this module. (4) Version 1.2 also includes addition to the documentation regarding the issue of numeric values for features.
DESCRIPTION
Algorithm::DecisionTree is a perl5 module for constructing a decision tree from a training datafile containing multidimensional data. In one form or another, decision trees have been around for about fifty years. But their popularity during the last decade is owing to the entropy-based method proposed by Ross Quinlan for their construction. Fundamental to Quinlan's approach is the notion that a decision node in a tree should be split only if the entropy at the ensuing child nodes will be less than the entropy at the node in question. The implementation presented here is based on the same idea.
For those not familiar with decision tree ideas, the traditional way to classify multidimensional data is to start with a feature space whose dimensionality is the same as that of the data. Each feature in this space would correspond to the attribute that each dimension of the data measures. You then use the training data to carve up the feature space into different regions, each corresponding to a different class. Subsequently, when you are trying to classify a new data sample, you locate it in the feature space and find the class label of the region to which it belongs. One can also give the data point the same class label as that of the nearest training sample. (This is referred to as the nearest neighbor classification.)
A decision tree classifier works differently. When you construct a decision tree, you select for the root node a feature test that can be expected to maximally disambiguate the class labels that could be associated with the data you are trying to classify. You then attach to the root node a set of child nodes, one for each value of the feature you chose at the root node. Now at each child node you pose the same question that you posed when you found the best feature to use at the root node: What feature at the child node in question would maximally disambiguate the class labels to be associated with a given data vector assuming that the data vector passed the root node on the branch that corresponds to the child node in question. The feature that is best at each node is the one that causes the maximal reduction in class entropy at that node.
As the reader would expect, the two key steps in any approach to decision-tree based classification are the construction of the decision tree itself from a file containing the training data, and then using the decision tree thus obtained for classifying the data.
In addition to the above two key steps, the implementation presented here also allows you to generate your own training data. Generating your own training data, using it for constructing a decision-tree classifier and subsequently testing the classifier on a test set of data is a good way to develop greater proficiency with decision trees.
What is cool about decision tree classification is that it gives you soft classification, meaning it may associate more than one class label with a given data vector. When this happens, it may mean that your classes are indeed overlapping in the underlying feature space. It could also mean that you simply have not supplied sufficient training data to the decision tree classifier. For a tutorial introduction to how a decision tree is constructed and used, visit http://cobweb.ecn.purdue.edu/~kak/DecisionTreeClassifiers.pdf
WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE
Consider the following scenario: Let's say you are running a small investment company that employs a team of stockbrokers who make buy/sell decisions for the customers of your company. Assume that your company has asked the traders to make each investment decision on the basis of the following four criteria:
price_to_earnings_ratio (P_to_E)
price_to_sales_ratio (P_to_S)
return_on_equity (R_on_E)
market_share (MS)
Since you are the boss, you keep track of the buy/sell decisions made by the individual traders. But one unfortunate day, all of your traders decide to quit because you did not pay them enough. So what do you do? If you had a module like the one here, you could still run your company and do so in such a way that, on the average, would do better than any of the individual traders who worked for your company. This is what you do: You pool together the individual trader buy/sell decisions you have accumulated during the last one year. This pooled information is likely to look like:
example buy/sell P_to_E P_to_S R_on_E MS
============================================================+=
example_1 buy high low medium low
example_2 buy medium medium low low
example_3 sell low medium low high
....
....
This data would constitute your training file. You could feed this file into the module by calling:
my $dt = Algorithm::DecisionTree->new(
training_datafile => $training_datafile,
);
$dt->get_training_data();
and then construct a decision tree by calling:
my $root_node = $dt->construct_decision_tree_classifier();
Now you and your company (with practically no employees) are ready to service the customers again. Suppose your computer needs to make a buy/sell decision about an investment prospect that is best described by:
price_to_earnings_ratio => low
price_to_sales_ratio => very_low
return_on_equity => none
market_share => medium
All that your computer would need to do would be to construct a data vector like
my @data = qw / P_to_E=>low
P_to_S=>very_low
R_on_E=>none
MS=>medium /;
and call the decision tree classifier you just constructed by
$dt->classify($root_node, @data);
The answer returned will be 'buy' and 'sell', along with the associated probabilities. So if the probability of 'buy' is considerably greater than the probability of 'sell', that's what you should instruct your computer to do.
The chances are that, on the average, this approach would beat the performance of any of your individual traders who worked for you previously since the buy/sell decisions made by the computer would be based on the collective wisdom of all your previous traders. DISCLAIMER: There is obviously a lot more to good investing than what is captured by the silly little example here. However, it does the convey the sense in which the current module could be used.
WHAT HAPPENS IF THE NUMBER OF FEATURES AND/OR VALUES IS LARGE
If n
is the number of features and m
the largest number for the possible values for any of the features, then, in only the worst case, the algorithm would want to construct m**n
nodes. In other words, in the worst case, the size of the decision tree grows exponentially as you increase either the number of features or the number of possible values for the features. That is the bad news. The good news is that you have two constructor parameters at your disposal for controlling the size of the decision tree: The parameter max_depth_desired
controls the depth of the constructed tree from its root node, and the parameter entropy_threshold
controls the granularity with which the entropy space will be sampled. The smaller the max_depth_desired
and the larger the entropy_threshold
, the smaller the size of the decision tree. The default value for max_depth_desired
is the number of features specified in the training datafile, and the the default value for entropy_threshold
is 0.001.
The users of this module with a need to create very large decision trees should also consider storing the tree once constructed in a diskfile and then using the stored tree for classification work. The scripts store_dt_on_disk.pl
and classify_from_disk_stored_dt.pl
in the examples
directory show you how you can do that with the help of Perl's Storable module.
Also note that it is always a good idea to keep the debug1
option set anytime you are experimenting with a new datafile --- especially if your training datafile is likely to create an inordinately large decision tree.
WHAT HAPPENS WHEN THE FEATURE VALUES ARE NUMERIC
The current module will treat a numeric value for a feature as just a string. In that sense, there is no difference between a string value for a feature and a numeric value. This would obviously make the module unsuitable for applications in which a feature may take on a numeric value from a very large set of such values and you want feature values to be compared using numeric comparison predicates as opposed to string comparison predicates. (Consider, for example, using color as an object feature in a computer vision application.) The decision trees for applications in which the feature values are primarily numerical in nature are constructed differently, as explained in the tutorial at http://cobweb.ecn.purdue.edu/~kak/DecisionTreeClassifiers.pdf
METHODS
The module provides the following methods for decision-tree induction from training data in a diskfile, for data classification with the decision tree, and for generating your own training data:
- new():
-
my $dt = Algorithm::DecisionTree->new( training_datafile => $training_datafile, );
A call to
new()
constructs a new instance of theAlgorithm::DecisionTree
class. For this call to make sense, the training data in the training datafile must be according to a certain format that is shown below. (Also see the filetraining.dat
in theexamples
directory.) - get_training_data():
-
After you have constructed a new instance of the
Algorithm::DecisionTree
class, you must now read in the training data that is contained in the file named above. This you do by:$dt->get_training_data();
IMPORTANT: The training data file must in a format that makes sense to the decision tree constructor. The information in this file should look like
Class names: malignant benign Feature names and their values: videoAddiction => none low medium heavy exercising => never occasionally regularly smoking => heavy medium light never fatIntake => low medium heavy Training Data: sample class videoAddiction exercising smoking fatIntake ========================================================================== sample_0 benign medium occasionally heavy low sample_1 malignant none occasionally heavy medium sample_2 benign low occasionally light heavy sample_3 malignant medium occasionally heavy heavy .... ....
IMPORTANT: Note that the class names, the number of classes, the feature names, and the possible values for the features can be anything that your data requires them to be. The training data file that is generated by the data generation part of the module will be in the format shown above. More on that later.
- show_training_data():
-
If you wish to see the training data that was just digested by the module, call
$dt->show_training_data();
- construct_decision_tree_classifier():
-
After the training data is digested, it is time to construct a decision tree classifier. This you do by
my $root_node = $dt->construct_decision_tree_classifier();
This call returns an instance of type
Node
. TheNode
class is defined within the main package file, at its end. So, don't forget, that$root_node
in the above example call will be instantiated to an instance of typeNode
. - $root_node
->
display_decision_tree(" "): -
$root_node->display_decision_tree(" ");
This will display the decision tree in your terminal window by using a recursively determined offset for each node as the display routine descends down the tree.
I have intentionally left the syntax fragment
$root_node
in the above call to remind the reader thatdisplay_decision_tree()
is NOT called on the instance of theDecisionTree
we constructed earlier, but on theNode
instance returned by the call toconstruct_decision_tree_classifier()
. - classify($root_node, @test_sample):
-
my @test_sample = qw /exercising=>never smoking=>heavy fatIntake=>heavy videoAddiction=>heavy /; my $classification = $dt->classify($root_node, @test_sample);
where, again,
$root_node
is an instance of typeNode
returned by the call toconstruct_decision_tree_classifier()
. The variable$classification
holds a reference to a hash whose keys are the class labels and whose values the associated probabilities. - determine_data_condition():
-
This method, automatically invoked when
debug1
option is set in the call to the decision-tree constructor, displays useful information regarding your training data file. This method also warns you if you are trying to construct a decision tree that may simply be much too large.$dt->determine_data_condition();
- training_data_generator():
-
The training data generator is created by using its own constructor:
my $parameter_file = "param2.txt"; my $output_data_file = "training.dat"; my $training_data_gen = Algorithm::DecisionTree->training_data_generator( output_datafile => $output_data_file, parameter_file => $parameter_file, number_of_training_samples => 35, );
- $training_data_gen
->
read_parameter_file(): -
After you have constructed an instance of the training data generator, you need to ask it to read the parameter file:
$training_data_gen->read_parameter_file();
The parameter file is expected to be in the following format:
# comment lines begin with the hash mark class names: malignant benign class priors: 0.4 0.6 feature: smoking values: heavy medium light never feature: exercising values: never occasionally regularly feature: fatIntake values: low medium heavy feature: videoAddiction values: none low medium heavy bias: class: malignant smoking: heavy=0.7 exercising: never=0.7 fatIntake: heavy=0.5 videoAddiction: bias: class: benign smoking: heavy=0.1 exercising: regularly=0.7 fatIntake: low=0.6 videoAddiction:
See the parameter file
param.txt
in theexamples
directory. Initially, it might be best to modify that file to suit your needs.IMPORTANT: You can use any names for the classes, can have any number of classes, can use any names for the features and their values.
Also note the the important role played by the biasing information. Without the biasing, your training data will be uniformly distributed with respect to all of the feature values and you will only get ambiguous classifications from the resulting decision tree classifier. The biasing allows you to express a higher or lower probability that a particular feature value should have for a given class. The probability weight that is unused for each feature is distributed uniformly amongst the remaining feature values. I did experiment with the idea of assigning probability weights to multiple (or even all) of the values for a given feature --- it does not add to the educational value you derive from the resulting training data.
NOTE: if you do NOT express a bias for a feature (as is the case with the feature
videoAddiction
above), equal weight is given to all its values. - $training_data_gen
->
gen_training_data(): -
This call generators the training data from your parameter file:
$training_data_gen->gen_training_data();
- $training_data_gen
->
write_training_data_to_file(): -
To write out the training data to a disk file:
$training_data_gen->write_training_data_to_file();
This call will also display the training data in your terminal window if the
debug1
option is set in the training data generator constructor. - test_data_generator():
-
The test data is generated by using its own constructor:
my $parameter_file = "param.txt"; my $output_test_datafile = "testdata1.dat"; my $output_class_label_file = "test_data_class_labels.dat"; my $test_data_gen = Algorithm::DecisionTree->test_data_generator( output_test_datafile => $output_test_datafile, output_class_label_file => $output_class_label_file, parameter_file => $parameter_file, write_to_file => 1, number_of_test_samples => 10, );
- $test_data_gen
->
read_parameter_file(): -
After you have constructed an instance of the test data generator, you need to ask it to read the parameter file.
$test_data_gen->read_parameter_file();
This parameter file named in the call to the test-data generator constructor must possess the same structure as for generating the training data. In most cases, you would want to use the same paramter file both for generating the training data and the test data.
- $test_data_gen
->
gen_test_data(): -
This call generates the test data from your parameter file:
$training_data_gen->gen_training_data();
- $test_data_gen
->
write_test_data_to_file(): -
To write out the test data to a disk file:
$test_data_gen->write_test_data_to_file();
This call will also display the test data in your terminal window if the
debug1
option is set in the test data generator constructor.
HOW THE CLASSIFICATION RESULTS ARE DISPLAYED
It depends on whether you apply the classifier at once to all the data samples in a file, or whether you feed one data sample at a time into the classifier.
For large test datasets, you would obviously want to process an entire file of test data at a time. The best way to do this is to follow my script
classify_test_data_in_a_file.pl
in the examples directly. This script requires three command-line arguments, the first argument names the training datafile, the second the test datafile, and the third in which the classification results will be deposited.
You can also invoke the classifier on one data sample at a time. A call such as
my @test_sample = qw /exercising=>never
smoking=>heavy
fatIntake=>heavy
videoAddiction=>heavy /;
my $classification = $dt->classify($root_node, @test_sample);
print "The classification:\n";
foreach my $class ($dt->get_class_names()) {
print " $class with probability $classification->{$class}\n";
}
will print out the classification results in the following form:
The classification:
malignant with probability 0.744186046511628
benign with probability 0.255813953488372
Note again the soft classification returned. That is, if the probability distributions for the different classes overlap in the underlying feature space, the classifier will return all of the applicable class labels for a data vector along with the corresponding class probabilities. Another reason for why the decision tree classifier may associate significant probabilities with multiple class labels is that you used inadequate number of training samples to induce the decision tree. The good thing is that the classifier does not lie to you (unlike, say, a hard classification rule that would corresponding to a partitioning of the underlying feature space). The decision tree classifier give you the best classification that can be made given the training data you fed into it.
EXAMPLES
See the examples
directory in the distribution for how to generate the training data, how to induce a decision tree, and how to then classify new data using the decision tree.
To become more familiar with the module, run the script
training_data_generator.pl
to generate a training datafile according to the information placed in the file param.txt. Then run the script
construct_dt_and_classify_one_sample.pl
to classify a new data sample that is in the script. Next generate a test dataset by calling
generate_test_data.pl*
This will deposit the test data in a file. You can invoke the classifier on this file by an invocation like
classify_test_data_in_a_file.pl training.dat testdata.dat out.txt
This call will first construct a decision tree using the training data in the file training.dat
. It will then calculate the class label for each data vector in the file testdata.dat
. The estimated class labels will be written out to the file out.txt
.
The examples
directory also contains the script store_dt_on_disk.pl
that shows how you can use Perl's Storable module to store a decision tree in a disk file. The file classify_from_disk_stored_dt.pl
in the same directory shows how you can classify new data vectors with the stored decision tree. This is expected to be extremely useful for situations that involve tens of thousands or millions of decision nodes.
EXPORT
None by design.
BUGS
Please notify the author if you encounter any bugs. When sending email, please place the string 'DecisionTree' in the subject line.
INSTALLATION
The usual
perl Makefile.PL
make
make test
make install
if you have root access. If not,
perl Makefile.PL prefix=/some/other/directory/
make
make test
make install
AUTHOR
Avinash Kak, kak@purdue.edu
If you send email, please place the string "DecisionTree" in your subject line to get past my spam filter.
COPYRIGHT
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Copyright 2011 Avinash Kak
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1362:
=pod directives shouldn't be over one line long! Ignoring all 2 lines of content