The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Algorithm::DecisionTree - A pure-Perl implementation for constructing a decision tree from multidimensional training data and for using the decision tree thus induced for classifying data.

SYNOPSIS

  # FOR CONSTRUCTING A DECISION TREE AND FOR CLASSIFYING A SAMPLE:

      my $training_datafile = "training.dat";
      my $dt = Algorithm::DecisionTree->new( 
                               training_datafile => $training_datafile,
      );
      $dt->get_training_data();
      $dt->show_training_data();
      my $root_node = $dt->construct_decision_tree_classifier();
      $root_node->display_decision_tree("   ");
      my @test_sample = qw /exercising=>never 
                            smoking=>heavy 
                            fatIntake=>heavy 
                            videoAddiction=>heavy /;
      $dt->classify($root_node, @test_sample);

  # For the above calls to work, the format in which the training data is made
  # available to the decision-tree constructor new() must meet certain 
  # assumptions.  (See the training.dat file in the examples directory.) The
  # training datafile must declare the class names, the feature names and 
  # the names of the different possible values for the features.  The rest of
  # the training datafile is expected to contain the training samples in the 
  # form of a multi-column table.

  # FOR GENERATING TRAINING DATA:

      use Algorithm::DecisionTree;
      my $parameter_file = "param.txt";
      my $output_data_file = "training.dat";
      my $data_gen = Algorithm::DecisionTree->training_data_generator( 
                                  output_datafile => $output_data_file,
                                  parameter_file    => $parameter_file,
                                  number_of_training_samples => 35,
      );
      $data_gen->read_parameter_file();
      $data_gen->gen_training_data();
      $data_gen->write_training_data_to_file(); 

  # For the above calls to work, the parameter file must obey certain 
  # assumptions.  (See the param.txt file in the examples directory.) The
  # parameter file must declare the class names, the class priors, the 
  # feature names and the different possible values for the features.
  # The parameter file is also expected to present information on how
  # you want the data vectors to be biased for the different classes.

DESCRIPTION

Algorithm::DecisionTree is a perl5 module for constructing a decision tree from a training datafile containing multidimensional data. In one form or another, decision trees have been around for about fifty years. But their popularity during the last decade is owing to the entropy-based method proposed by Ross Quinlan for their construction. Fundamental to Quinlan's approach is the notion that a decision node in a tree should be split only if the entropy at the ensuing child nodes will be less than the entropy at the node in question. The implementation presented here is based on the same idea.

For those not familiar with decision tree ideas, the traditional way to classify multidimensional data is to start with a feature space whose dimensionality is the same as that of the data. Each feature in this space would correspond to the attribute that each dimension of the data measures. You then use the training data to carve up the feature space into different regions, each corresponding to a different class. Subsequently, when you are trying to classify a new data sample, you locate it in the feature space and find the class label of the region to which it belongs. One can also give the data point the same class label as that of the nearest training sample. (This is referred to as the nearest neighbor classification.)

A decision tree classifier works differently. When you construct a decision tree, you select for the root node a feature test that can be expected to maximally ddisambiguate the class labels that could be associated with the data you are trying to classify. You then attach to the root node a set of child nodes, one for each value of the feature you chose at the root node. Now at each child node you pose the same question that you posed when you found the best feature to use at the root node: What feature at the child node in question would maximally disambiguate the class labels to be associated with a given data vector assuming that the data vector passed the root node on the branch that corresponds to the child node in question. The feature that is best at each node is the one that causes the maximal reduction in class entropy at that node.

As the reader would expect, the two key steps in any approach to decision-tree based classification are the construction of the decision tree itself from a file containing the training data, and then using the decision tree thus obtained for classifying the data.

In addition to the above two key steps, the implementation presented here also allows you to generate your own training data. Generating your own training data, using it for constructing a decision-tree classifier and subsequently testing the classifier on a test set of data is a good way to develop greater proficiency with decision trees.

What is cools about decision tree classification is that it gives you soft classification, meaning it may associate more than one class label with a given data vector. When this happens, it may mean that your classes are indeed overlapping in the underlying feature space. It could also mean that you simply have not supplied sufficient training data to the decision tree classifier.

METHODS

The module provides the following methods for decision-tree induction from training data in a diskfile, for data classification with the decision tree, and for generating your own training data:

new()
    my $dt = Algorithm::DecisionTree->new( 
                                          training_datafile => $training_datafile,
                                         );

A call to new() constructs a new instance of the Algorithm::DecisionTree class. For this call to make sense, the training data in the training datafile must be according to a certain format that is shown below. (Also see the file training.dat in the examples directory.)

get_training_data()

After you have constructed a new instance of the Algorithm::DecisionTree class, you must now read in the training data that is contained in the file named above. This you do by:

    $dt->get_training_data(); 

IMPORTANT: The training data file must in a format that makes sense to the decision tree constructor. The information in this file should look like

    Class names: malignant benign

    Feature names and their values:
        videoAddiction => none low medium heavy
        exercising => never occasionally regularly
        smoking => heavy medium light never
        fatIntake => low medium heavy


    Training Data:

    sample     class      videoAddiction   exercising    smoking   fatIntake
    ==========================================================================

    sample_0   benign     medium           occasionally  heavy     low
    sample_1   malignant  none             occasionally  heavy     medium
    sample_2   benign     low              occasionally  light     heavy
    sample_3   malignant  medium           occasionally  heavy     heavy
    ....
    ....

IMPORTANT: Note that the class names, the number of classes, the feature names, and the possible values for the features can be anything that your data requires them to be. The training data file that is generated by the data generation part of the module will be in the format shown above. More on that later.

show_training_data()

If you wish to see the training data that was just digested by the module, call

    $dt->show_training_data(); 
construct_decision_tree_classifier()

After the training data is digested, it is time to construct a decision tree classifier. This you do by

    my $root_node = $dt->construct_decision_tree_classifier();

This call returns an instance of type Node. The Node class is defined within the main package file, at its end. So, don't forget, that $root_node in the above example call will be instantiated to an instance of type Node.

display_decision_tree(" ")
    $root_node->display_decision_tree("   ");

This will display the decision tree in your terminal window by using a recursively determined offset for each node and the display routine descends down the tree.

I have intentionally left the syntax fragment $root_node in the above call to remind the reader that display_decision_tree() is NOT called on the instance of the DecisionTree we constructed earlier, but on the Node instance returned by the call to construct_decision_tree_classifier().

classify($root_node, @test_sample)
    my @test_sample = qw /exercising=>never 
                          smoking=>heavy 
                          fatIntake=>heavy 
                          videoAddiction=>heavy /;
    $dt->classify($root_node, @test_sample);

where, again, $root_node is an instance of type Node returned by the call to construct_decision_tree_classifier().

training_data_generator()

The training data generator is created by using its own constructor:

    my $parameter_file = "param2.txt";
    my $output_data_file = "training.dat";
    my $data_generator = Algorithm::DecisionTree->training_data_generator( 
                              output_datafile => $output_data_file,
                              parameter_file    => $parameter_file,
                              number_of_training_samples => 35,
                                                                         );
read_parameter_file()

After you have constructed an instance of the data generator, you need to ask it to read the parameter file:

    $data_generator->read_parameter_file();

The parameter file is expected to be in the following format:

    # comment lines begin with the hash mark

    class names:  malignant benign
    class priors: 0.4 0.6

    feature: smoking
    values: heavy medium light never

    feature: exercising
    values: never occasionally regularly

    feature: fatIntake
    values: low medium heavy

    feature: videoAddiction
    values:  none low medium heavy


    bias:  class: malignant 

          smoking:    heavy=0.7
          exercising: never=0.7 
          fatIntake:  heavy=0.5
          videoAddiction: 

    bias:  class: benign

          smoking:     heavy=0.1
          exercising:  regularly=0.7
          fatIntake:   low=0.6
          videoAddiction: 

See the parameter file param.txt in the example directory. Initially, it might be best to modify that file to suit your needs.

IMPORTANT: You can use any names for the classes, can have any number of classes, can use any names for the features and their values.

Also note the the important role played by the biasing information. Without the biasing, your training data will be uniformly distributed with respect to all of the feature values and you will only get ambiguous classifications from the resulting decision tree classifier. The biasing allows you express a higher or lower probability that a particular feature value should have for a given class. The probability weight that is unused for each feature is distributed uniformly amongst the remaining feature values. I did experiment with with idea of assigning probability weights to multiple (or even all) of the values on a given feature, it does not add to the educational value you derive from the resulting training data.

NOTE: if you do NOT give a bias for a feature (as is the case with the feature 'videoAddiction' above), equal weight is given to all its values.

gen_training_data()

This call generators the training data from your parameter file:

    $data_generator->gen_training_data();
write_training_data_to_file()

To write out the training data to a disk file:

    $data_generator->write_training_data_to_file();

This call will also display the training data in your terminal window if the $debug1 is on.

HOW THE CLASSIFICATION RESULTS ARE DISPLAYED

A call such as

    my @test_sample = qw /exercising=>never 
                          smoking=>heavy 
                          fatIntake=>heavy 
                          videoAddiction=>heavy /;
    $dt->classify($root_node, @test_sample);

will return an answer like:

    Classification=> Probability of class malignant: 0.461538461538462
    Classification=> Probability of class benign: 0.538461538461538

Note again the soft classification returned. That is, if the probability distributions for the different classes overlap in the underlying feature space, the classifier will return all of the applicable class labels for a data vector along with the corresponding class probabilities. Another reason for why the decision tree classifier may associate significant probabilities with multiple class labels is that you used inadequate number of training samples to induce the decision tree. The good thing is that the classifier does not lie to you (unlike, say, a hard classification rule that would corresponding to a partitioning of the underlying feature space). The decision tree classifier give you the best classification that can be made given the training data you fed into it.

EXAMPLES

See the examples directory in the distribution for how to generate the training data, how to induce a decision tree, and how to then classify the data using the decision tree.

EXPORT

None by design.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'DecisionTree' in the subject line.

INSTALLATION

The usual

    perl Makefile.PL
    make
    make test
    make install

if you have root access. If not,

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "DecisionTree" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 Copyright 2010 Avinash Kak