The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Algorithm::DecisionTree - A Perl module for constructing a decision tree from multidimensional training data and for using the decision tree thus constructed for classifying new data.

SYNOPSIS

  # FOR CONSTRUCTING A DECISION TREE AND FOR CLASSIFYING A SAMPLE:

  # If your training data includes numeric features (a feature is numeric if it can
  # take any floating point value over an interval), you are expected to supply your
  # training data through a CSV file and your call for constructing an instance of
  # the DecisionTree class will look like:

      my $training_datafile = "stage3cancer.csv";

      my $dt = Algorithm::DecisionTree->new( 
                              training_datafile => $training_datafile,
                              csv_class_column_index => 2,
                              csv_columns_for_features => [3,4,5,6,7,8],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
               );

  # The constructor option `csv_class_column_index' informs the module as to which
  # column of your CSV file contains the class label.  THE COLUMN INDEXING IS ZERO
  # BASED.  The constructor option `csv_columns_for_features' specifies which columns
  # are to be used for feature values.  The first row of the CSV file must specify
  # the names of the features.  See examples of CSV files in the `examples'
  # subdirectory.

  # The option `symbolic_to_numeric_cardinality_threshold' is also important.  For
  # the example shown above, if an ostensibly numeric feature takes on only 10 or
  # fewer different values in your training datafile, it will be treated like a
  # symbolic features.  The option `entropy_threshold' determines the granularity
  # with which the entropies are sampled for the purpose of calculating entropy gain
  # with a particular choice of decision threshold for a numeric feature or a feature
  # value for a symbolic feature.

  # After you have constructed an instance of the DecisionTree class as shown above,
  # you read in the training data file and initialize the probability cache by
  # calling:

      $dt->get_training_data();
      $dt->calculate_first_order_probabilities();
      $dt->calculate_class_priors();

  # Next you construct a decision tree for your training data by calling:

      $root_node = $dt->construct_decision_tree_classifier();

  # where $root_node is an instance of the DTNode class that is also defined in the
  # module file.  Now you are ready to classify a new data record.  Let's say that
  # your data record looks like:

      my @test_sample  = qw /  g2=4.2
                               grade=2.3
                               gleason=4
                               eet=1.7
                               age=55.0
                               ploidy=diploid /;

  # You can classify it by calling:

      my $classification = $dt->classify($root_node, \@test_sample);

  # The call to `classify()' returns a reference to a hash whose keys are the class
  # names and the values the associated classification probabilities.  This hash also
  # includes another key-value pair for the solution path from the root node to the
  # leaf node at which the final classification was carried out.


  # FOR THE CASE OF PURELY SYMBOLIC FEATURES:

  # If your features are purely symbolic, you can continue to use the same
  # constructor syntax that was used in the older versions of this module.  However,
  # your old `.dat' training files will not work with the new version.  The good news
  # is that with just a small fix, you can continue to use them.  The fix and why it
  # was needed is described in the file README_for_dat_files in the `examples'
  # directory.

CHANGES

Version 2.25 further downshifts the required version of Perl for this module. This was a result of testing the module with Version 5.10.1 of Perl. Only one statement in the module code needed to be changed for the module to work with the older version of Perl.

Version 2.24 fixes the Makefile.PL restriction on the required Perl version. This version should work with Perl versions 5.14.0 and higher.

Version 2.23 changes the required version of Perl from 5.18.0 to 5.14.0. Everything else remains the same.

Version 2.22 should prove more robust when the probability distribution for the values of a feature is expected to be heavy-tailed; that is, when the supposedly rare observations can occur with significant probabilities. A new option in the DecisionTree constructor lets the user specify the precision with which the probability distributions are estimated for such features.

Version 2.21 fixes a bug that was caused by the explicitly set zero values for numerical features being misconstrued as "false" in the conditional statements in some of the method definitions.

Version 2.2 makes it easier to write code for classifying in one go all of your test data samples in a CSV file. The bulk classifications obtained can be written out to either a CSV file or to a regular text file. See the script classify_test_data_in_a_file_numeric.pl in the examples directory for how to classify all of your test data records in a CSV file. This version also includes improved code for generating synthetic numeric/symbolic training and test data records for experimenting with the decision tree classifier.

Version 2.1 allows you to test the quality of your training data by running a 10-fold cross-validation test on the data. This test divides all of the training data into ten parts, with nine parts used for training a decision tree and one part used for testing its ability to classify correctly. This selection of nine parts for training and one part for testing is carried out in all of the ten different ways that are possible. This testing functionality in Version 2.1 can also be used to find the best values to use for the constructor parameters entropy_threshold, max_depth_desired, and symbolic_to_numeric_cardinality_threshold.

Version 2.0 is a major rewrite of this module. Now you can use both numeric and symbolic features for constructing a decision tree. A feature is numeric if it can take any floating-point value over an interval.

Version 1.71 fixes a bug in the code that was triggered by 0 being declared as one of the features values in the training datafile. Version 1.71 also include an additional safety feature that is useful for training datafiles that contain a very large number of features. The new version makes sure that the number of values you declare for each sample record matches the number of features declared at the beginning of the training datafile.

Version 1.7 includes safety checks on the consistency of the data you place in your training datafile. When a training file contains thousands of samples, it is difficult to manually check that you used the same class names in your sample records that you declared at top of your training file or that the values you have for your features are legal vis-a-vis the earlier declarations of the values in the training file. Another safety feature incorporated in this version is the non-consideration of classes that are declared at the top of the training file but that have no sample records in the file.

Version 1.6 uses probability caching much more extensively compared to the previous versions. This should result in faster construction of large decision trees. Another new feature in Version 1.6 is the use of a decision tree for interactive classification. In this mode, after you have constructed a decision tree from the training data, the user is prompted for answers to the questions pertaining to the feature tests at the nodes of the tree.

Some key elements of the documentation were cleaned up and made more readable in Version 1.41. The implementation code remains unchanged from Version 1.4.

Version 1.4 should make things faster (and easier) for folks who want to use this module with training data that creates very large decision trees (that is, trees with tens of thousands or more decision nodes). The speedup in Version 1.4 has been achieved by eliminating duplicate calculation of probabilities as the tree grows. In addition, this version provides an additional constructor parameter, max_depth_desired for controlling the size of the decision tree. This is in addition to the tree size control achieved by the parameter entropy_threshold that was introduced in Version 1.3. Since large decision trees can take a long time to create, you may find yourself wishing you could store the tree you just created in a disk file and that, subsequently, you could use the stored tree for classification work. The examples directory contains two scripts, store_dt_on_disk.pl and classify_from_disk_stored_dt.pl, that show how you can do exactly that with the help of Perl's Storable module.

Version 1.3 addresses the issue that arises when the header of a training datafile declares a certain possible value for a feature but that (feature,value) pair does NOT show up anywhere in the training data. Version 1.3 also makes it possible for a user to control the size of the decision tree by changing the value of the parameter entropy_threshold. Additionally, Version 1.3 includes a method called determine_data_condition() that displays useful information regarding the size and some other attributes of the training data. It also warns the user if the training data might result in a decision tree that would simply be much too large --- unless the user controls the size with the entropy_threshold parameter.

In addition to the removal of a couple of serious bugs, version 1.2 incorporates a number of enhancements: (1) Version 1.2 includes checks on the names of the features and values used in test data --- this is the data you want to classify with the decision tree classifier constructed by this module. (2) Version 1.2 includes a separate constructor for generating test data. To make it easier to generate test data whose probabilistic parameters may not be identical to that used for the training data, I have used separate routines for generating the test data. (3) Version 1.2 also includes in its examples directory a script that classifies the test data in a file and outputs the class labels into another file. This is for folks who do not wish to write their own scripts using this module. (4) Version 1.2 also includes addition to the documentation regarding the issue of numeric values for features.

SPECIAL USAGE NOTE

For those transitioning from versions older than 2.0 of this module, if your training data consists of numeric features, or has a combination of numeric and symbolic features, you MUST use a CSV file to supply your data to the module. Additionally, this CSV file must satisfy certain formatting constraints. See README_for_CSV_files in the examples directory for what these formatting restrictions are. And, even if your training data is purely symbolic, your old-style `.dat' training files will not work with the new module. See README_for_dat_files in the examples directory for the formatting related to the new `.dat' files.

DESCRIPTION

Algorithm::DecisionTree is a perl5 module for constructing a decision tree from a training datafile containing multidimensional data. In one form or another, decision trees have been around for about fifty years. From a statistical perspective, they are closely related to classification and regression by recursive partitioning of multidimensional data. Early work that demonstrated the usefulness of such partitioning of data for classification and regression can be traced to the work of Terry Therneau in the early 1980's in the statistics community, and to the work of Ross Quinlan in the mid 1990's in the machine learning community,

For those not familiar with decision tree ideas, the traditional way to classify multidimensional data is to start with a feature space whose dimensionality is the same as that of the data. Each feature in this space corresponds to the attribute that each dimension of the data measures. You then use the training data to carve up the feature space into different regions, each corresponding to a different class. Subsequently, when you try to classify a new data sample, you locate it in the feature space and find the class label of the region to which it belongs. One can also give the new data point the same class label as that of the nearest training sample. This is referred to as the nearest neighbor classification. There exist hundreds of variations of varying power on these two basic approaches to the classification of multidimensional data.

A decision tree classifier works differently. When you construct a decision tree, you select for the root node a feature test that partitions the training data in a way that causes maximal disambiguation of the class labels associated with the data. In terms of information content as measured by entropy, such a feature test would cause maximum reduction in class entropy in going from all of the training data taken together to the data as partitioned by the feature test. You then drop from the root node a set of child nodes, one for each partition of the training data created by the feature test at the root node. When your features are purely symbolic, you'll have one child node for each value of the feature chosen for the feature test at the root. When the test at the root involves a numeric feature, you find the decision threshold for the feature that best bipartitions the data and you drop from the root node two child nodes, one for each partition. Now at each child node you pose the same question that you posed when you found the best feature to use at the root: Which feature at the child node in question would maximally disambiguate the class labels associated with the training data corresponding to that child node?

As the reader would expect, the two key steps in any approach to decision-tree based classification are the construction of the decision tree itself from a file containing the training data, and then using the decision tree thus obtained for classifying new data.

What is cool about decision tree classification is that it gives you soft classification, meaning it may associate more than one class label with a given data vector. When this happens, it may mean that your classes are indeed overlapping in the underlying feature space. It could also mean that you simply have not supplied sufficient training data to the decision tree classifier. For a tutorial introduction to how a decision tree is constructed and used, visit https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf

This module also allows you to generate your own synthetic training and test data. Generating your own training data, using it for constructing a decision-tree classifier, and subsequently testing the classifier on a synthetically generated test set of data is a good way to develop greater proficiency with decision trees.

WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE

If you are new to the concept of a decision tree, their practical utility is best understood with an example that only involves symbolic features. However, as mentioned earlier, versions of the module higher than 2.0 allow you to use both symbolic and numeric features.

Consider the following scenario: Let's say you are running a small investment company that employs a team of stockbrokers who make buy/sell decisions for the customers of your company. Assume that your company has asked the traders to make each investment decision on the basis of the following four criteria:

  price_to_earnings_ratio   (P_to_E)

  price_to_sales_ratio      (P_to_S)

  return_on_equity          (R_on_E)

  market_share              (MS)

Since you are the boss, you keep track of the buy/sell decisions made by the individual traders. But one unfortunate day, all of your traders decide to quit because you did not pay them enough. So what do you do? If you had a module like the one here, you could still run your company and do so in such a way that, on the average, would do better than any of the individual traders who worked for your company. This is what you do: You pool together the individual trader buy/sell decisions you have accumulated during the last one year. This pooled information is likely to look like:

  example      buy/sell     P_to_E     P_to_S     R_on_E      MS
  ============================================================+=

  example_1     buy          high       low        medium    low
  example_2     buy          medium     medium     low       low
  example_3     sell         low        medium     low       high
  ....
  ....

This data would constitute your training file. You could feed this file into the module by calling:

    my $dt = Algorithm::DecisionTree->new( 
                                          training_datafile => $training_datafile,
                                         );
    $dt->get_training_data(); 
    $dt->calculate_first_order_probabilities();
    $dt->calculate_class_priors();

Subsequently, you would construct a decision tree by calling:

    my $root_node = $dt->construct_decision_tree_classifier();

Now you and your company (with practically no employees) are ready to service the customers again. Suppose your computer needs to make a buy/sell decision about an investment prospect that is best described by:

    price_to_earnings_ratio  =  low
    price_to_sales_ratio     =  very_low
    return_on_equity         =  none
    market_share             =  medium    

All that your computer would need to do would be to construct a data vector like

   my @data =   qw / P_to_E=low
                     P_to_S=very_low
                     R_on_E=none
                     MS=medium /;

and call the decision tree classifier you just constructed by

    $dt->classify($root_node, \@data); 

The answer returned will be 'buy' and 'sell', along with the associated probabilities. So if the probability of 'buy' is considerably greater than the probability of 'sell', that's what you should instruct your computer to do.

The chances are that, on the average, this approach would beat the performance of any of your individual traders who worked for you previously since the buy/sell decisions made by the computer would be based on the collective wisdom of all your previous traders. DISCLAIMER: There is obviously a lot more to good investing than what is captured by the silly little example here. However, it does nicely the convey the sense in which the current module could be used.

SYMBOLIC FEATURES VERSUS NUMERIC FEATURES

A feature is symbolic when its values are compared using string comparison operators. By the same token, a feature is numeric when its values are compared using numeric comparison operators. Having said that, features that take only a small number of numeric values in the training data can be treated symbolically provided you are careful about handling their values in the test data. At the least, you have to set the test data value for such a feature to its closest value in the training data. The module does that automatically for you for those numeric features for which the number different numeric values is less than a user-specified threshold. For those numeric features that the module is allowed to treat symbolically, this snapping of the values of the features in the test data to the small set of values in the training data is carried out automatically by the module. That is, after a user has told the module which numeric features to treat symbolically, the user need not worry about how the feature values appear in the test data.

The constructor parameter symbolic_to_numeric_cardinality_threshold let's you tell the module when to consider an otherwise numeric feature symbolically. Suppose you set this parameter to 10, that means that all numeric looking features that take 10 or fewer different values in the training datafile will be considered to be symbolic features by the module. See the tutorial at https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf for further information on the implementation issues related to the symbolic and numeric features.

FEATURES WITH NOT SO "NICE" STATISTICAL PROPERTIES

For the purpose of estimating the probabilities, it is necessary to sample the range of values taken on by a numerical feature. For features with "nice" statistical properties, this sampling interval is set to the median of the differences between the successive feature values in the training data. (Obviously, as you would expect, you first sort all the values for a feature before computing the successive differences.) This logic will not work for the sort of a feature described below.

Consider a feature whose values are heavy-tailed, and, at the same time, the values span a million to one range. What I mean by heavy-tailed is that rare values can occur with significant probabilities. It could happen that most of the values for such a feature are clustered at one of the two ends of the range. At the same time, there may exist a significant number of values near the end of the range that is less populated. (Typically, features related to human economic activities --- such as wealth, incomes, etc. --- are of this type.) With the logic described in the previous paragraph, you could end up with a sampling interval that is much too small, which could result in millions of sampling points for the feature if you are not careful.

Beginning with Version 2.22, you have two options in dealing with such features. You can choose to go with the default behavior of the module, which is to sample the value range for such a feature over a maximum of 500 points. Or, you can supply an additional option to the constructor that sets a user-defined value for the number of points to use. The name of the option is number_of_histogram_bins. The following script

    construct_dt_for_heavytailed.pl 

in the examples directory shows an example of how to call the constructor of the module with the number_of_histogram_bins option.

TESTING THE QUALITY OF YOUR TRAINING DATA

Versions 2.1 and higher include a new class named EvalTrainingData, derived from the main class DecisionTree, that runs a 10-fold cross-validation test on your training data to test its ability to discriminate between the classes mentioned in the training file.

The 10-fold cross-validation test divides all of the training data into ten parts, with nine parts used for training a decision tree and one part used for testing its ability to classify correctly. This selection of nine parts for training and one part for testing is carried out in all of the ten different possible ways.

The following code fragment illustrates how you invoke the testing function of the EvalTrainingData class:

    my $training_datafile = "training.csv";                                         
    my $eval_data = EvalTrainingData->new(
                                  training_datafile => $training_datafile,
                                  csv_class_column_index => 1,
                                  csv_columns_for_features => [2,3],
                                  entropy_threshold => 0.01,
                                  max_depth_desired => 3,
                                  symbolic_to_numeric_cardinality_threshold => 10,
                    );
    $eval_data->get_training_data();
    $eval_data->evaluate_training_data()

The last statement above prints out a Confusion Matrix and the value of Training Data Quality Index on a scale of 0 to 100, with 100 designating perfect training data. The Confusion Matrix shows how the different classes were mis-labeled in the 10-fold cross-validation test.

This testing functionality can also be used to find the best values to use for the constructor parameters entropy_threshold, max_depth_desired, and symbolic_to_numeric_cardinality_threshold.

The following two scripts in the examples directory illustrate the use of the EvalTrainingData class for testing the quality of your data:

    evaluate_training_data1.pl
    evaluate_training_data2.pl

METHODS

The module provides the following methods for constructing a decision tree from training data in a disk file and for classifying new data records with the decision tree thus constructed:

new():
    my $dt = Algorithm::DecisionTree->new( 
                              training_datafile => $training_datafile,
                              csv_class_column_index => 2,
                              csv_columns_for_features => [3,4,5,6,7,8],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
            );

A call to new() constructs a new instance of the Algorithm::DecisionTree class. For this call to make sense, the training data in the training datafile must be according to a certain format. For the format, see the files training.csv training.dat in the examples directory. If your training data includes numeric features, you must use a CSV file for supplying the data to the module. The previous versions of this module used `.dat' files for the training data. You can still use your old `.dat' files provided you modify them a little bit. See README_for_dat_files in the examples directory for how to modify your old .dat files. Note that column indexing is zero-based. With the index set to 2 as shown above, the class labels are in the third column for the above case.

Constructor Parameters:

training_datafile:

This parameter supplies the name of the file that contains the training data. This must be a CSV file if your training data includes both numeric and symbolic features. If your data is purely symbolic, you can use the old-style `.dat' file.

csv_class_column_index:

When using a CSV file for your training data, this parameter supplies the zero-based column index for the column that contains the class label for each data record in the training file.

csv_columns_for_features:

When using a CSV file for your training data, this parameter supplies a list of columns corresponding to the features you wish to use for decision tree construction. Each column is specified by its zero-based index.

entropy_threshold:

This parameter sets the granularity with which the entropies are sampled by the module. For example, a feature test at a node in the decision tree is acceptable if the entropy gain achieved by the test exceeds this threshold. The larger the value you choose for this parameter, the smaller the tree. Its default value is 0.001.

max_depth_desired:

This parameter sets the maximum depth of the decision tree. For obvious reasons, the smaller the value you choose for this parameter, the smaller the tree.

symbolic_to_numeric_cardinality_threshold:

This parameter allows the module to treat an otherwise numeric feature symbolically if the number of different values the feature takes in the training data file does not exceed the value of this parameter.

number_of_histogram_bins:

This parameter gives the user the option to set the number of points at which the value range for a feature should be sampled for estimating the probabilities. This parameter is effective only for those features that occupy a large value range and whose probability distributions are heavy tailed.

You can choose the best values to use for the last three constructor parameters by running a 10-fold cross-validation test on your training data through the class EvalTrainingData that comes with Versions 2.1 and higher of this module. See the section "TESTING THE QUALITY OF YOUR TRAINING DATA" of this document page.

get_training_data():

After you have constructed a new instance of the Algorithm::DecisionTree class, you must now read in the training data that is the file named in the call to the constructor. This you do by:

    $dt->get_training_data(); 

IMPORTANT: The training datafile must be in a format that makes sense to the decision tree constructor. See the files README_for_CSV_files and README_for_dat_files in the examples directory for these formats. Also see the files training.csv and training.dat for examples of such files.

show_training_data():

If you wish to see the training data that was just digested by the module, call

    $dt->show_training_data(); 
calculate_first_order_probabilities():
calculate_class_priors():

After the module has read the training data file, it needs to initialize the probability cache. This you do by invoking:

    $dt->calculate_first_order_probabilities()
    $dt->calculate_class_priors() 
construct_decision_tree_classifier():

With the probability cache initialized, it is time to construct a decision tree classifier. This you do by

    my $root_node = $dt->construct_decision_tree_classifier();

This call returns an instance of type DTNode. The DTNode class is defined within the main package file. So, don't forget, that $root_node in the above example call will be instantiated to an object of type DTNode.

$root_node->display_decision_tree(" "):
    $root_node->display_decision_tree("   ");

This will display the decision tree in your terminal window by using a recursively determined offset for each node as the display routine descends down the tree.

I have intentionally left the syntax fragment $root_node in the above call to remind the reader that display_decision_tree() is NOT called on the instance of the DecisionTree we constructed earlier, but on the DTNode instance returned by the call to construct_decision_tree_classifier().

classify($root_node, \@test_sample):

Let's say you want to classify the following data record:

    my @test_sample  = qw /  g2=4.2
                             grade=2.3
                             gleason=4
                             eet=1.7
                             age=55.0
                             ploidy=diploid /;

you'd make the following call:

    my $classification = $dt->classify($root_node, \@test_sample);

where, again, $root_node is an instance of type DTNode returned by the call to construct_decision_tree_classifier(). The variable $classification holds a reference to a hash whose keys are the class names and whose values the associated probabilities. The hash that is returned by the above call also includes a special key-value pair for a key named solution_path. The value associated with this key is an anonymous array that holds the path, in the form of a list of nodes, from the root node to the leaf node in the decision tree where the final classification was made.

classify_by_asking_questions($root_node):

This method allows you to use a decision-tree based classifier in an interactive mode. In this mode, a user is prompted for answers to the questions pertaining to the feature tests at the nodes of the tree. The syntax for invoking this method is:

    my $classification = $dt->classify_by_asking_questions($root_node);

where $dt is an instance of the Algorithm::DecisionTree class returned by a call to new() and $root_node the root node of the decision tree returned by a call to construct_decision_tree_classifier().

GENERATING SYNTHETIC TRAINING AND TEST DATA

The module file contains the following additional classes: (1) TrainingAndTestDataGeneratorNumeric, and (2) TrainingAndTestDataGeneratorSymbolic for generating synthetic training and test data.

The class TrainingAndTestDataGeneratorNumeric outputs one CSV file for the training data and another one for the test data for experimenting with numeric features. The numeric values are generated using a multivariate Gaussian distribution whose mean and covariance are specified in a parameter file. See the file param_numeric.txt in the examples directory for an example of such a parameter file. Note that the dimensionality of the data is inferred from the information you place in the parameter file.

The class TrainingAndTestDataGeneratorSymbolic generates synthetic training and test data for the purely symbolic case. The relative frequencies of the different possible values for the features is controlled by the biasing information you place in a parameter file. See param_symbolic.txt for an example of such a file.

HOW THE CLASSIFICATION RESULTS ARE DISPLAYED

It depends on whether you apply the classifier at once to all the data samples in a file, or whether you feed one data sample at a time into the classifier.

In general, the classifier returns soft classification for a test data vector. What that means is that, in general, the classifier will list all the classes to which a given data vector could belong and the probability of each such class label for the data vector. Run the examples scripts in the Examples directory to see how the output of classification can be displayed.

For large test datasets, you would obviously want to process an entire file of test data samples in one go. As to how you can do that, see the following two scripts in the examples directory:

    classify_test_data_in_a_file_numeric.pl
    classify_test_data_in_a_file_symbolic.pl

the first for the case of numeric/symbolic data placed in a CSV file and the second case of purely symbolic data placed in a `.dat' file. These scripts require three command-line arguments, the first argument names the training datafile, the second the test datafile, and the third the name of the file in which the classification results will be deposited. The test data files for both the numeric/symbolic and the purely symbolic cases must look like the training data files. If the test data files do not mention the class labels --- as will be the case for real-world test data --- you must still have a column for the class labels with the entries set to the empty string "". The test datafile for the purely symbolic case must mention the order in which the features values are presented. For an example, see the file testdata.dat in the examples directory.

A special feature of the script classify_test_data_in_a_file_numeric.pl is that its last command-line argument (meaning the 3rd argument) can be either a .csv filename or a .txt filename. As you'd expect, the output classifications are displayed in a CSV format for the former case. For the latter case, the results are shown with white-space separation between the different class names and their probabilities. Additionally, for the latter case, you can the tell the script either to show you only the most probable class for each data sample or all the classes. You control this choice by setting the variable $show_hard_classifications in the script. When this variable is set, you'll only be shown the most probable class for each sample when the output is to a .txt file.

With regard to the soft classifications returned by this classifier, if the probability distributions for the different classes overlap in the underlying feature space, you would want the classifier to return all of the applicable class labels for a data vector along with the corresponding class probabilities. Another reason for why the decision tree classifier may associate significant probabilities with multiple class labels is that you used inadequate number of training samples to induce the decision tree. The good thing is that the classifier does not lie to you (unlike, say, a hard classification rule that would return a single class label corresponding to the partitioning of the underlying feature space). The decision tree classifier give you the best classification that can be made given the training data you fed into it.

THE EXAMPLES DIRECTORY

See the examples directory in the distribution for how to construct a decision tree, and how to then classify new data using the decision tree. To become more familiar with the module, run the scripts

    construct_dt_and_classify_one_sample_case1.pl
    construct_dt_and_classify_one_sample_case2.pl
    construct_dt_and_classify_one_sample_case3.pl
    construct_dt_and_classify_one_sample_case4.pl

The first script is for the purely symbolic case, the second for the case that involves both numeric and symbolic features, the third for the case of purely numeric features, and the last for the case when the training data is synthetically generated by the script generate_training_data_numeric.pl.

Next run the following scripts just as you find them:

    classify_test_data_in_a_file_numeric.pl   training4.csv   test4.csv   out4.csv

    classify_test_data_in_a_file_symbolic.pl   training4.dat   test4.dat   out4.dat

and examine the contents of the output files out4.csv and out4.dat. Each of above two scripts first constructs a decision tree using the training data in the training file supplied by the first command-line argument. The script then calculates the class label for each data record in the test data file supplied through the second command-line argument. The estimated class labels are written out to the output file named by the third argument.

In general, for the two calls shown above, the test data files should look identical to the training data files. Of course, for real-world test data, you will not have the class labels for the test samples. For real-world test data, you are still required to reserve a column for the class label, which now must be just the empty string "" for each data record. For example, the test data records supplied in the following two calls through the files test4_no_class_labels.csv and test4_no_class_labels.dat do not mention class labels:

    classify_test_data_in_a_file_numeric.pl   training4.csv   test4_no_class_labels.csv   out4.csv

    classify_test_data_in_a_file_symbolic.pl   training4.dat   test4_no_class_labels.dat   out4.dat 

For bulk classification, the output file can also be a .txt file. In that case, you will see white-space separate results in the output file. When you mention a .txt file for the output, you can control the extent of information placed in the output file by setting the variable $show_hard_classifications in the scripts. If this variable is set, the output will show only the most probable class for each test data record.

The following script in the examples directory

    classify_by_asking_questions.pl

shows how you can use a decision-tree classifier interactively. In this mode, you first construct the decision tree from the training data and then the user is prompted for answers to the feature tests at the nodes of the tree.

If your training data has a feature whose values span a large range and, at the same time, are characterized by a heavy-tail distribution, you should look at the script

    construct_dt_for_heavytailed.pl                                                     

to see how to use the option number_of_histogram_bins in the call to the constructor. This option was introduced in Version 2.22 for dealing with such features. If you do not set this option, the module will use the default value of 500 for the number of points at which to sample the value range for such a feature.

The examples directory also contains the following scripts:

    generate_training_and_test_data_numeric.pl
    generate_training_and_test_data_symbolic.pl

that show how you can use the module to generate synthetic training and test data. Synthetic training and test data are generated according to the specifications laid out in a parameter file. There are constraints on how the information is laid out in a parameter file. See the files param_numeric.txt and param_symbolic.txt in the examples directory for how to structure these files.

The examples directory of Versions 2.1 and higher of the DecisionTree module also contains the following two scripts:

    evaluate_training_data1.pl
    evaluate_training_data2.pl

that illustrate how the Perl class EvalTrainingData can be used to evaluate the quality of your training data (as long as it resides in a `.csv' file.) This new class is a subclass of the DecisionTree class in the module file. See the README in the examples directory for further information regarding these two scripts.

EXPORT

None by design.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'DecisionTree' in the subject line.

INSTALLATION

Download the archive from CPAN in any directory of your choice. Unpack the archive with a command that on a Linux machine would look like:

    tar zxvf Algorithm-DecisionTree-2.25.tar.gz

This will create an installation directory for you whose name will be Algorithm-DecisionTree-2.25. Enter this directory and execute the following commands for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment variable so that this module can find the required other modules. How you do that would depend on what platform you are working on. In order to install this module in a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

THANKS

I wish to thank many users of this module for their feedback. Many of the improvements I have made to the module over the years are a result of the feedback received.

I thank Slaven for pointing out that the module worked with Perl 5.14.x. For Version 2.22, I had set the required version of Perl to 5.18.0 since that's what I used for testing the module. Slaven's feedback in the form of the Bug report #96547 resulted in Version 2.23 of the module. Version 2.25 further downshifts the required version of Perl to 5.10.

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "DecisionTree" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 Copyright 2014 Avinash Kak