NAME

Algorithm::DecisionTree - A Perl module for decision-tree based classification of multidimensional data.

SYNOPSIS

# FOR CONSTRUCTING A DECISION TREE AND FOR CLASSIFYING A SAMPLE:

# In general, your call for constructing an instance of the DecisionTree class 
# will look like:

    my $training_datafile = "stage3cancer.csv";
    my $dt = Algorithm::DecisionTree->new( 
                            training_datafile => $training_datafile,
                            csv_class_column_index => 2,
                            csv_columns_for_features => [3,4,5,6,7,8],
                            entropy_threshold => 0.01,
                            max_depth_desired => 8,
                            symbolic_to_numeric_cardinality_threshold => 10,
                            csv_cleanup_needed => 1,
             );

# The constructor option `csv_class_column_index' informs the module as to which
# column of your CSV file contains the class label.  THE COLUMN INDEXING IS ZERO
# BASED.  The constructor option `csv_columns_for_features' specifies which columns
# are to be used for feature values.  The first row of the CSV file must specify
# the names of the features.  See examples of CSV files in the `Examples'
# subdirectory.

# The option `symbolic_to_numeric_cardinality_threshold' is also important.  For
# the example shown above, if an ostensibly numeric feature takes on only 10 or
# fewer different values in your training datafile, it will be treated like a
# symbolic features.  The option `entropy_threshold' determines the granularity
# with which the entropies are sampled for the purpose of calculating entropy gain
# with a particular choice of decision threshold for a numeric feature or a feature
# value for a symbolic feature.

# The option 'csv_cleanup_needed' is by default set to 0.  If you set it
# to 1, that would cause all line records in your CSV file to be "sanitized" before
# they are used for constructing a decision tree.  You need this option if your CSV
# file uses double-quoted field names and field values in the line records and if
# such double-quoted strings are allowed to include commas for, presumably, better
# readability.

# After you have constructed an instance of the DecisionTree class as shown above,
# you read in the training data file and initialize the probability cache by
# calling:

    $dt->get_training_data();
    $dt->calculate_first_order_probabilities();
    $dt->calculate_class_priors();

# Next you construct a decision tree for your training data by calling:

    $root_node = $dt->construct_decision_tree_classifier();

# where $root_node is an instance of the DTNode class that is also defined in the
# module file.  Now you are ready to classify a new data record.  Let's say that
# your data record looks like:

    my @test_sample  = qw /  g2=4.2
                             grade=2.3
                             gleason=4
                             eet=1.7
                             age=55.0
                             ploidy=diploid /;

# You can classify it by calling:

    my $classification = $dt->classify($root_node, \@test_sample);

# The call to `classify()' returns a reference to a hash whose keys are the class
# names and the values the associated classification probabilities.  This hash also
# includes another key-value pair for the solution path from the root node to the
# leaf node at which the final classification was carried out.

CHANGES

Version 3.43: This version fixes a bug in the csv_cleanup_needed() function. The source of the bug was a typo in a regex component meant for matching with white space. I have also made one additional change to this function to increase its versatility. With this change, you are now allowed to have empty strings as values for features.

Version 3.42: This version reintroduces csv_cleanup_needed as an optional parameter in the module constructor. This was done in response to several requests received from the user community. (Previously, all line records from a CSV file were processed by the cleanup_csv() function no matter what.) The main point made by the users was that invoking cleanup_csv() when there was no need for CSV clean-up extracted a performance penalty when ingesting large database files with tens of thousands of line records. In addition to making csv_cleanup_needed optional, I have also tweaked up the code in the cleanup_csv() function in order to extract data from a larger range of messy CSV files.

Version 3.41: All the changes made in this version relate to the construction of regression trees. I have fixed a couple of bugs in the calculation of the regression coefficients. Additionally, the RegressionTree class now comes with a new constructor parameter named jacobian_choice. For most cases, you'd set this parameter to 0, which causes the regression coefficients to be estimated through linear least-squares minimization.

Version 3.40: In addition to constructing decision trees, this version of the module also allows you to construct regression trees. The regression tree capability has been packed into a separate subclass, named RegressionTree, of the main DecisionTree class. The subdirectory ExamplesRegression in the main installation directory illustrates how you can use this new functionality of the module.

Version 3.30: This version incorporates four very significant upgrades/changes to the DecisionTree module: (1) The CSV cleanup is now the default. So you do not have to set any special parameters in the constructor calls to initiate CSV cleanup. (2) In the form of a new Perl class named RandomizedTreesForBigData, this module provides you with an easy-to-use programming interface for attempting needle-in-a-haystack solutions for the case when your training data is overwhelmingly dominated by a single class. You need to set the constructor parameter looking_for_needles_in_haystack to invoke the logic that constructs multiple decision trees, each using the minority class samples along with samples drawn randomly from the majority class. The final classification is made through a majority vote from all the decision trees. (3) Assuming you are faced with a big-data problem --- in the sense that you have been given a training database with a very large number of training records --- the class RandomizedTreesForBigData will also let you construct multiple decision trees by pulling training data randomly from your training database (without paying attention to the relative populations of the classes). The final classification decision for a test sample is based on a majority vote from all the decision trees thus constructed. See the ExamplesRandomizedTrees directory for how to use these new features of the module. And, finally, (4) Support for the old-style '.dat' training files has been dropped in this version.

Version 3.21: This version makes it easier to use a CSV training file that violates the assumption that a comma be used only to separate the different field values in a line record. Some large econometrics databases use double-quoted values for fields, and these values may contain commas (presumably for better readability). This version also allows you to specify the leftmost entry in the first CSV record that names all the fields. Previously, this entry was required to be an empty double-quoted string. I have also made some minor changes to the 'get_training_data_from_csv()' method to make it more user friendly for large training files that may contain tens of thousands of records. When pulling training data from such files, this method prints out a dot on the terminal screen for every 10000 records it has processed.

Version 3.20: This version brings the boosting capability to the DecisionTree module.

Version 3.0: This version adds bagging to the DecisionTree module. If your training dataset is large enough, you can ask the module to construct multiple decision trees using data bags extracted from your dataset. The module can show you the results returned by the individual decision trees and also the results obtained by taking a majority vote of the classification decisions made by the individual trees. You can specify any arbitrary extent of overlap between the data bags.

Version 2.31: The introspection capability in this version packs more of a punch. For each training data sample, you can now figure out not only the decision-tree nodes that are affected directly by that sample, but also those nodes that are affected indirectly through the generalization achieved by the probabilistic modeling of the data. The 'examples' directory of this version includes additional scripts that illustrate these enhancements to the introspection capability. See the section "The Introspection API" for a declaration of the introspection related methods, old and new.

Version 2.30: In response to requests from several users, this version includes a new capability: You can now ask the module to introspect about the classification decisions returned by the decision tree. Toward that end, the module includes a new class named DTIntrospection. Perhaps the most important bit of information you are likely to seek through DT introspection is the list of the training samples that fall directly in the portion of the feature space that is assigned to a node. CAVEAT: When training samples are non-uniformly distributed in the underlying feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN WHEN NO TRAINING SAMPLES FALL IN THE PORTION OF THE FEATURE SPACE ASSIGNED TO THE NODE. (This is an important part of the generalization achieved by probabilistic modeling of the training data.) For additional information related to DT introspection, see the section titled "DECISION TREE INTROSPECTION" in this documentation page.

Version 2.26 fixes a bug in the part of the module that some folks use for generating synthetic data for experimenting with decision tree construction and classification. In the class TrainingDataGeneratorNumeric that is a part of the module, there was a problem with the order in which the features were recorded from the user-supplied parameter file. The basic code for decision tree construction and classification remains unchanged.

Version 2.25 further downshifts the required version of Perl for this module. This was a result of testing the module with Version 5.10.1 of Perl. Only one statement in the module code needed to be changed for the module to work with the older version of Perl.

Version 2.24 fixes the Makefile.PL restriction on the required Perl version. This version should work with Perl versions 5.14.0 and higher.

Version 2.23 changes the required version of Perl from 5.18.0 to 5.14.0. Everything else remains the same.

Version 2.22 should prove more robust when the probability distribution for the values of a feature is expected to be heavy-tailed; that is, when the supposedly rare observations can occur with significant probabilities. A new option in the DecisionTree constructor lets the user specify the precision with which the probability distributions are estimated for such features.

Version 2.21 fixes a bug that was caused by the explicitly set zero values for numerical features being misconstrued as "false" in the conditional statements in some of the method definitions.

Version 2.2 makes it easier to write code for classifying in one go all of your test data samples in a CSV file. The bulk classifications obtained can be written out to either a CSV file or to a regular text file. See the script classify_test_data_in_a_file_numeric.pl in the Examples directory for how to classify all of your test data records in a CSV file. This version also includes improved code for generating synthetic numeric/symbolic training and test data records for experimenting with the decision tree classifier.

Version 2.1 allows you to test the quality of your training data by running a 10-fold cross-validation test on the data. This test divides all of the training data into ten parts, with nine parts used for training a decision tree and one part used for testing its ability to classify correctly. This selection of nine parts for training and one part for testing is carried out in all of the ten different ways that are possible. This testing functionality in Version 2.1 can also be used to find the best values to use for the constructor parameters entropy_threshold, max_depth_desired, and symbolic_to_numeric_cardinality_threshold.

Version 2.0 is a major rewrite of this module. Now you can use both numeric and symbolic features for constructing a decision tree. A feature is numeric if it can take any floating-point value over an interval.

Version 1.71 fixes a bug in the code that was triggered by 0 being declared as one of the features values in the training datafile. Version 1.71 also include an additional safety feature that is useful for training datafiles that contain a very large number of features. The new version makes sure that the number of values you declare for each sample record matches the number of features declared at the beginning of the training datafile.

Version 1.7 includes safety checks on the consistency of the data you place in your training datafile. When a training file contains thousands of samples, it is difficult to manually check that you used the same class names in your sample records that you declared at top of your training file or that the values you have for your features are legal vis-a-vis the earlier declarations of the values in the training file. Another safety feature incorporated in this version is the non-consideration of classes that are declared at the top of the training file but that have no sample records in the file.

Version 1.6 uses probability caching much more extensively compared to the previous versions. This should result in faster construction of large decision trees. Another new feature in Version 1.6 is the use of a decision tree for interactive classification. In this mode, after you have constructed a decision tree from the training data, the user is prompted for answers to the questions pertaining to the feature tests at the nodes of the tree.

Some key elements of the documentation were cleaned up and made more readable in Version 1.41. The implementation code remains unchanged from Version 1.4.

Version 1.4 should make things faster (and easier) for folks who want to use this module with training data that creates very large decision trees (that is, trees with tens of thousands or more decision nodes). The speedup in Version 1.4 has been achieved by eliminating duplicate calculation of probabilities as the tree grows. In addition, this version provides an additional constructor parameter, max_depth_desired for controlling the size of the decision tree. This is in addition to the tree size control achieved by the parameter entropy_threshold that was introduced in Version 1.3. Since large decision trees can take a long time to create, you may find yourself wishing you could store the tree you just created in a disk file and that, subsequently, you could use the stored tree for classification work. The Examples directory contains two scripts, store_dt_on_disk.pl and classify_from_disk_stored_dt.pl, that show how you can do exactly that with the help of Perl's Storable module.

Version 1.3 addresses the issue that arises when the header of a training datafile declares a certain possible value for a feature but that (feature,value) pair does NOT show up anywhere in the training data. Version 1.3 also makes it possible for a user to control the size of the decision tree by changing the value of the parameter entropy_threshold. Additionally, Version 1.3 includes a method called determine_data_condition() that displays useful information regarding the size and some other attributes of the training data. It also warns the user if the training data might result in a decision tree that would simply be much too large --- unless the user controls the size with the entropy_threshold parameter.

In addition to the removal of a couple of serious bugs, version 1.2 incorporates a number of enhancements: (1) Version 1.2 includes checks on the names of the features and values used in test data --- this is the data you want to classify with the decision tree classifier constructed by this module. (2) Version 1.2 includes a separate constructor for generating test data. To make it easier to generate test data whose probabilistic parameters may not be identical to that used for the training data, I have used separate routines for generating the test data. (3) Version 1.2 also includes in its examples directory a script that classifies the test data in a file and outputs the class labels into another file. This is for folks who do not wish to write their own scripts using this module. (4) Version 1.2 also includes addition to the documentation regarding the issue of numeric values for features.

DESCRIPTION

Algorithm::DecisionTree is a perl5 module for constructing a decision tree from a training datafile containing multidimensional data. In one form or another, decision trees have been around for about fifty years. From a statistical perspective, they are closely related to classification and regression by recursive partitioning of multidimensional data. Early work that demonstrated the usefulness of such partitioning of data for classification and regression can be traced to the work of Terry Therneau in the early 1980's in the statistics community, and to the work of Ross Quinlan in the mid 1990's in the machine learning community.

For those not familiar with decision tree ideas, the traditional way to classify multidimensional data is to start with a feature space whose dimensionality is the same as that of the data. Each feature in this space corresponds to the attribute that each dimension of the data measures. You then use the training data to carve up the feature space into different regions, each corresponding to a different class. Subsequently, when you try to classify a new data sample, you locate it in the feature space and find the class label of the region to which it belongs. One can also give the new data point the same class label as that of the nearest training sample. This is referred to as the nearest neighbor classification. There exist hundreds of variations of varying power on these two basic approaches to the classification of multidimensional data.

A decision tree classifier works differently. When you construct a decision tree, you select for the root node a feature test that partitions the training data in a way that causes maximal disambiguation of the class labels associated with the data. In terms of information content as measured by entropy, such a feature test would cause maximum reduction in class entropy in going from all of the training data taken together to the data as partitioned by the feature test. You then drop from the root node a set of child nodes, one for each partition of the training data created by the feature test at the root node. When your features are purely symbolic, you'll have one child node for each value of the feature chosen for the feature test at the root. When the test at the root involves a numeric feature, you find the decision threshold for the feature that best bipartitions the data and you drop from the root node two child nodes, one for each partition. Now at each child node you pose the same question that you posed when you found the best feature to use at the root: Which feature at the child node in question would maximally disambiguate the class labels associated with the training data corresponding to that child node?

As the reader would expect, the two key steps in any approach to decision-tree based classification are the construction of the decision tree itself from a file containing the training data, and then using the decision tree thus obtained for classifying new data.

What is cool about decision tree classification is that it gives you soft classification, meaning it may associate more than one class label with a given data vector. When this happens, it may mean that your classes are indeed overlapping in the underlying feature space. It could also mean that you simply have not supplied sufficient training data to the decision tree classifier. For a tutorial introduction to how a decision tree is constructed and used, visit https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf

This module also allows you to generate your own synthetic training and test data. Generating your own training data, using it for constructing a decision-tree classifier, and subsequently testing the classifier on a synthetically generated test set of data is a good way to develop greater proficiency with decision trees.

WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE

If you are new to the concept of a decision tree, their practical utility is best understood with an example that only involves symbolic features. However, as mentioned earlier, versions of the module higher than 2.0 allow you to use both symbolic and numeric features.

Consider the following scenario: Let's say you are running a small investment company that employs a team of stockbrokers who make buy/sell decisions for the customers of your company. Assume that your company has asked the traders to make each investment decision on the basis of the following four criteria:

price_to_earnings_ratio   (P_to_E)

price_to_sales_ratio      (P_to_S)

return_on_equity          (R_on_E)

market_share              (MS)

Since you are the boss, you keep track of the buy/sell decisions made by the individual traders. But one unfortunate day, all of your traders decide to quit because you did not pay them enough. So what do you do? If you had a module like the one here, you could still run your company and do so in such a way that, on the average, would do better than any of the individual traders who worked for your company. This is what you do: You pool together the individual trader buy/sell decisions you have accumulated during the last one year. This pooled information is likely to look like:

example      buy/sell     P_to_E     P_to_S     R_on_E      MS
============================================================+=

example_1     buy          high       low        medium    low
example_2     buy          medium     medium     low       low
example_3     sell         low        medium     low       high
....
....

This data, when formatted according to CSV, would constitute your training file. You could feed this file into the module by calling:

my $dt = Algorithm::DecisionTree->new( 
                 training_datafile => $training_datafile,
                 csv_class_column_index => 1,
                 csv_columns_for_features => [2,3,4,5],
         );
$dt->get_training_data(); 
$dt->calculate_first_order_probabilities();
$dt->calculate_class_priors();

Subsequently, you would construct a decision tree by calling:

my $root_node = $dt->construct_decision_tree_classifier();

Now you and your company (with practically no employees) are ready to service the customers again. Suppose your computer needs to make a buy/sell decision about an investment prospect that is best described by:

price_to_earnings_ratio  =  low
price_to_sales_ratio     =  very_low
return_on_equity         =  none
market_share             =  medium    

All that your computer would need to do would be to construct a data vector like

my @data =   qw / P_to_E=low
                  P_to_S=very_low
                  R_on_E=none
                  MS=medium /;

and call the decision tree classifier you just constructed by

$dt->classify($root_node, \@data); 

The answer returned will be 'buy' and 'sell', along with the associated probabilities. So if the probability of 'buy' is considerably greater than the probability of 'sell', that's what you should instruct your computer to do.

The chances are that, on the average, this approach would beat the performance of any of your individual traders who worked for you previously since the buy/sell decisions made by the computer would be based on the collective wisdom of all your previous traders. DISCLAIMER: There is obviously a lot more to good investing than what is captured by the silly little example here. However, it does nicely the convey the sense in which the current module could be used.

SYMBOLIC FEATURES VERSUS NUMERIC FEATURES

A feature is symbolic when its values are compared using string comparison operators. By the same token, a feature is numeric when its values are compared using numeric comparison operators. Having said that, features that take only a small number of numeric values in the training data can be treated symbolically provided you are careful about handling their values in the test data. At the least, you have to set the test data value for such a feature to its closest value in the training data. The module does that automatically for you for those numeric features for which the number different numeric values is less than a user-specified threshold. For those numeric features that the module is allowed to treat symbolically, this snapping of the values of the features in the test data to the small set of values in the training data is carried out automatically by the module. That is, after a user has told the module which numeric features to treat symbolically, the user need not worry about how the feature values appear in the test data.

The constructor parameter symbolic_to_numeric_cardinality_threshold let's you tell the module when to consider an otherwise numeric feature symbolically. Suppose you set this parameter to 10, that means that all numeric looking features that take 10 or fewer different values in the training datafile will be considered to be symbolic features by the module. See the tutorial at https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf for further information on the implementation issues related to the symbolic and numeric features.

FEATURES WITH NOT SO "NICE" STATISTICAL PROPERTIES

For the purpose of estimating the probabilities, it is necessary to sample the range of values taken on by a numerical feature. For features with "nice" statistical properties, this sampling interval is set to the median of the differences between the successive feature values in the training data. (Obviously, as you would expect, you first sort all the values for a feature before computing the successive differences.) This logic will not work for the sort of a feature described below.

Consider a feature whose values are heavy-tailed, and, at the same time, the values span a million to one range. What I mean by heavy-tailed is that rare values can occur with significant probabilities. It could happen that most of the values for such a feature are clustered at one of the two ends of the range. At the same time, there may exist a significant number of values near the end of the range that is less populated. (Typically, features related to human economic activities --- such as wealth, incomes, etc. --- are of this type.) With the logic described in the previous paragraph, you could end up with a sampling interval that is much too small, which could result in millions of sampling points for the feature if you are not careful.

Beginning with Version 2.22, you have two options in dealing with such features. You can choose to go with the default behavior of the module, which is to sample the value range for such a feature over a maximum of 500 points. Or, you can supply an additional option to the constructor that sets a user-defined value for the number of points to use. The name of the option is number_of_histogram_bins. The following script

construct_dt_for_heavytailed.pl 

in the Examples directory shows an example of how to call the constructor of the module with the number_of_histogram_bins option.

TESTING THE QUALITY OF YOUR TRAINING DATA

Versions 2.1 and higher include a new class named EvalTrainingData, derived from the main class DecisionTree, that runs a 10-fold cross-validation test on your training data to test its ability to discriminate between the classes mentioned in the training file.

The 10-fold cross-validation test divides all of the training data into ten parts, with nine parts used for training a decision tree and one part used for testing its ability to classify correctly. This selection of nine parts for training and one part for testing is carried out in all of the ten different possible ways.

The following code fragment illustrates how you invoke the testing function of the EvalTrainingData class:

my $training_datafile = "training.csv";                                         
my $eval_data = EvalTrainingData->new(
                              training_datafile => $training_datafile,
                              csv_class_column_index => 1,
                              csv_columns_for_features => [2,3],
                              entropy_threshold => 0.01,
                              max_depth_desired => 3,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              csv_cleanup_needed => 1,
                );
$eval_data->get_training_data();
$eval_data->evaluate_training_data()

The last statement above prints out a Confusion Matrix and the value of Training Data Quality Index on a scale of 0 to 100, with 100 designating perfect training data. The Confusion Matrix shows how the different classes were mislabeled in the 10-fold cross-validation test.

This testing functionality can also be used to find the best values to use for the constructor parameters entropy_threshold, max_depth_desired, and symbolic_to_numeric_cardinality_threshold.

The following two scripts in the Examples directory illustrate the use of the EvalTrainingData class for testing the quality of your data:

evaluate_training_data1.pl
evaluate_training_data2.pl

HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS

Assuming your training data is good, the quality of the results you get from a decision tree would depend on the choices you make for the constructor parameters entropy_threshold, max_depth_desired, and symbolic_to_numeric_cardinality_threshold. You can optimize your choices for these parameters by running the 10-fold cross-validation test that is made available in Versions 2.2 and higher through the new class EvalTrainingData that is included in the module file. A description of how to run this test is in the previous section of this document.

DECISION TREE INTROSPECTION

Starting with Version 2.30, you can ask the DTIntrospection class of the module to explain the classification decisions made at the different nodes of the decision tree.

Perhaps the most important bit of information you are likely to seek through DT introspection is the list of the training samples that fall directly in the portion of the feature space that is assigned to a node.

However, note that, when training samples are non-uniformly distributed in the underlying feature space, it is possible for a node to exist even when there are no training samples in the portion of the feature space assigned to the node. That is because the decision tree is constructed from the probability densities estimated from the training data. When the training samples are non-uniformly distributed, it is entirely possible for the estimated probability densities to be non-zero in a small region around a point even when there are no training samples specifically in that region. (After you have created a statistical model for, say, the height distribution of people in a community, the model may return a non-zero probability for the height values in a small interval even if the community does not include a single individual whose height falls in that interval.)

That a decision-tree node can exist even when there are no training samples in that portion of the feature space that belongs to the node is an important indication of the generalization ability of a decision-tree-based classifier.

In light of the explanation provided above, before the DTIntrospection class supplies any answers at all, it asks you to accept the fact that features can take on non-zero probabilities at a point in the feature space even though there are zero training samples at that point (or in a small region around that point). If you do not accept this rudimentary fact, the introspection class will not yield any answers (since you are not going to believe the answers anyway).

The point made above implies that the path leading to a node in the decision tree may test a feature for a certain value or threshold despite the fact that the portion of the feature space assigned to that node is devoid of any training data.

See the following three scripts in the Examples directory for how to carry out DT introspection:

introspection_in_a_loop_interactive.pl

introspection_show_training_samples_at_all_nodes_direct_influence.pl

introspection_show_training_samples_to_nodes_influence_propagation.pl

The first script places you in an interactive session in which you will first be asked for the node number you are interested in. Subsequently, you will be asked for whether or not you are interested in specific questions that the introspection can provide answers for. The second script descends down the decision tree and shows for each node the training samples that fall directly in the portion of the feature space assigned to that node. The third script shows for each training sample how it affects the decision-tree nodes either directly or indirectly through the generalization achieved by the probabilistic modeling of the data.

The output of the script introspection_show_training_samples_at_all_nodes_direct_influence.pl looks like:

Node 0: the samples are: None
Node 1: the samples are: [sample_46 sample_58]
Node 2: the samples are: [sample_1 sample_4 sample_7 .....]
Node 3: the samples are: []
Node 4: the samples are: []
...
...            

The nodes for which no samples are listed come into existence through the generalization achieved by the probabilistic modeling of the data.

The output produced by the script introspection_show_training_samples_to_nodes_influence_propagation.pl looks like

sample_1:                                                                 
   nodes affected directly: [2 5 19 23]                                
   nodes affected through probabilistic generalization:                   
        2=> [3 4 25]                                                    
            25=> [26]                                                     
        5=> [6]                                                           
            6=> [7 13]                                                   
                7=> [8 11]                                               
                    8=> [9 10]                                           
                    11=> [12]                                             
                13=> [14 18]                                             
                    14=> [15 16]                                         
                        16=> [17]                                         
        19=> [20]                                                         
            20=> [21 22]                                                 
        23=> [24]                                                         
                         
sample_4:                                                                 
   nodes affected directly: [2 5 6 7 11]                              
   nodes affected through probabilistic generalization:                   
        2=> [3 4 25]                                                    
            25=> [26]                                                     
        5=> [19]                                                          
            19=> [20 23]                                                 
                20=> [21 22]                                             
                23=> [24]                                                 
        6=> [13]                                                          
            13=> [14 18]                                                 
                14=> [15 16]                                             
                    16=> [17]                                             
        7=> [8]                                                           
            8=> [9 10]                                                   
        11=> [12]                                                         
                                                                          
...                                                                       
...  
...

For each training sample, the display shown above first presents the list of nodes that are directly affected by the sample. A node is affected directly by a sample if the latter falls in the portion of the feature space that belongs to the former. Subsequently, for each training sample, the display shows a subtree of the nodes that are affected indirectly by the sample through the generalization achieved by the probabilistic modeling of the data. In general, a node is affected indirectly by a sample if it is a descendant of another node that is affected directly.

Also see the section titled The Introspection API regarding how to invoke the introspection capabilities of the module in your own code.

METHODS

The module provides the following methods for constructing a decision tree from training data in a disk file and for classifying new data records with the decision tree thus constructed:

new():
my $dt = Algorithm::DecisionTree->new( 
                          training_datafile => $training_datafile,
                          csv_class_column_index => 2,
                          csv_columns_for_features => [3,4,5,6,7,8],
                          entropy_threshold => 0.01,
                          max_depth_desired => 8,
                          symbolic_to_numeric_cardinality_threshold => 10,
                          csv_cleanup_needed => 1,
        );

A call to new() constructs a new instance of the Algorithm::DecisionTree class. For this call to make sense, the training data in the training datafile must be in the CSV format.

The Constructor Parameters

training_datafile:

This parameter supplies the name of the file that contains the training data.

csv_class_column_index:

When using a CSV file for your training data, this parameter supplies the zero-based column index for the column that contains the class label for each data record in the training file.

csv_cleanup_needed:

You need to set this parameter to 1 if your CSV file has double quoted strings (which may include commas) as values for the fields and if such values are allowed to include commas for, presumably, better readability.

csv_columns_for_features:

When using a CSV file for your training data, this parameter supplies a list of columns corresponding to the features you wish to use for decision tree construction. Each column is specified by its zero-based index.

entropy_threshold:

This parameter sets the granularity with which the entropies are sampled by the module. For example, a feature test at a node in the decision tree is acceptable if the entropy gain achieved by the test exceeds this threshold. The larger the value you choose for this parameter, the smaller the tree. Its default value is 0.001.

max_depth_desired:

This parameter sets the maximum depth of the decision tree. For obvious reasons, the smaller the value you choose for this parameter, the smaller the tree.

symbolic_to_numeric_cardinality_threshold:

This parameter allows the module to treat an otherwise numeric feature symbolically if the number of different values the feature takes in the training data file does not exceed the value of this parameter.

number_of_histogram_bins:

This parameter gives the user the option to set the number of points at which the value range for a feature should be sampled for estimating the probabilities. This parameter is effective only for those features that occupy a large value range and whose probability distributions are heavy tailed. This parameter is also important when you have a very large training dataset: In general, the larger the dataset, the smaller the smallest difference between any two values for a numeric feature in relation to the overall range of values for that feature. In such cases, the module may use too large a number of bins for estimating the probabilities and that may slow down the calculation of the decision tree. You can get around this difficulty by explicitly giving a value to the 'number_of_histogram_bins' parameter.

You can choose the best values to use for the last three constructor parameters by running a 10-fold cross-validation test on your training data through the class EvalTrainingData that comes with Versions 2.1 and higher of this module. See the section "TESTING THE QUALITY OF YOUR TRAINING DATA" of this document page.

get_training_data():

After you have constructed a new instance of the Algorithm::DecisionTree class, you must now read in the training data that is the file named in the call to the constructor. This you do by:

$dt->get_training_data(); 
show_training_data():

If you wish to see the training data that was just digested by the module, call

$dt->show_training_data(); 
calculate_first_order_probabilities():
calculate_class_priors():

After the module has read the training data file, it needs to initialize the probability cache. This you do by invoking:

$dt->calculate_first_order_probabilities()
$dt->calculate_class_priors() 
construct_decision_tree_classifier():

With the probability cache initialized, it is time to construct a decision tree classifier. This you do by

my $root_node = $dt->construct_decision_tree_classifier();

This call returns an instance of type DTNode. The DTNode class is defined within the main package file. So, don't forget, that $root_node in the above example call will be instantiated to an object of type DTNode.

$root_node->display_decision_tree(" "):
$root_node->display_decision_tree("   ");

This will display the decision tree in your terminal window by using a recursively determined offset for each node as the display routine descends down the tree.

I have intentionally left the syntax fragment $root_node in the above call to remind the reader that display_decision_tree() is NOT called on the instance of the DecisionTree we constructed earlier, but on the DTNode instance returned by the call to construct_decision_tree_classifier().

classify($root_node, \@test_sample):

Let's say you want to classify the following data record:

my @test_sample  = qw /  g2=4.2
                         grade=2.3
                         gleason=4
                         eet=1.7
                         age=55.0
                         ploidy=diploid /;

you'd make the following call:

my $classification = $dt->classify($root_node, \@test_sample);

where, again, $root_node is an instance of type DTNode returned by the call to construct_decision_tree_classifier(). The variable $classification holds a reference to a hash whose keys are the class names and whose values the associated probabilities. The hash that is returned by the above call also includes a special key-value pair for a key named solution_path. The value associated with this key is an anonymous array that holds the path, in the form of a list of nodes, from the root node to the leaf node in the decision tree where the final classification was made.

classify_by_asking_questions($root_node):

This method allows you to use a decision-tree based classifier in an interactive mode. In this mode, a user is prompted for answers to the questions pertaining to the feature tests at the nodes of the tree. The syntax for invoking this method is:

my $classification = $dt->classify_by_asking_questions($root_node);

where $dt is an instance of the Algorithm::DecisionTree class returned by a call to new() and $root_node the root node of the decision tree returned by a call to construct_decision_tree_classifier().

THE INTROSPECTION API

To construct an instance of DTIntrospection, you call

my $introspector = DTIntrospection->new($dt);

where you supply the instance of the DecisionTree class you used for constructing the decision tree through the parameter $dt. After you have constructed an instance of the introspection class, you must initialize it by

$introspector->initialize();

Subsequently, you can invoke either of the following methods:

$introspector->explain_classification_at_one_node($node);

$introspector->explain_classifications_at_multiple_nodes_interactively();

depending on whether you want introspection at a single specified node or inside an infinite loop for an arbitrary number of nodes.

If you want to output a tabular display that shows for each node in the decision tree all the training samples that fall in the portion of the feature space that belongs to that node, call

$introspector->display_training_samples_at_all_nodes_direct_influence_only();

If you want to output a tabular display that shows for each training sample a list of all the nodes that are affected directly AND indirectly by that sample, call

$introspector->display_training_training_samples_to_nodes_influence_propagation();

A training sample affects a node directly if the sample falls in the portion of the features space assigned to that node. On the other hand, a training sample is considered to affect a node indirectly if the node is a descendant of a node that is affected directly by the sample.

BULK CLASSIFICATION OF DATA RECORDS

For large test datasets, you would obviously want to process an entire file of test data at a time. The following scripts in the Examples directory illustrate how you can do that:

classify_test_data_in_a_file.pl

This script requires three command-line arguments, the first argument names the training datafile, the second the test datafile, and the third the file in which the classification results are to be deposited.

The other examples directories, ExamplesBagging, ExamplesBoosting, and ExamplesRandomizedTrees, also contain scripts that illustrate how to carry out bulk classification of data records when you wish to take advantage of bagging, boosting, or tree randomization. In their respective directories, these scripts are named:

bagging_for_bulk_classification.pl
boosting_for_bulk_classification.pl
classify_database_records.pl

HOW THE CLASSIFICATION RESULTS ARE DISPLAYED

It depends on whether you apply the classifier at once to all the data samples in a file, or whether you feed one data sample at a time into the classifier.

In general, the classifier returns soft classification for a test data vector. What that means is that, in general, the classifier will list all the classes to which a given data vector could belong and the probability of each such class label for the data vector. Run the examples scripts in the Examples directory to see how the output of classification can be displayed.

With regard to the soft classifications returned by this classifier, if the probability distributions for the different classes overlap in the underlying feature space, you would want the classifier to return all of the applicable class labels for a data vector along with the corresponding class probabilities. Another reason for why the decision tree classifier may associate significant probabilities with multiple class labels is that you used inadequate number of training samples to induce the decision tree. The good thing is that the classifier does not lie to you (unlike, say, a hard classification rule that would return a single class label corresponding to the partitioning of the underlying feature space). The decision tree classifier give you the best classification that can be made given the training data you fed into it.

USING BAGGING

Starting with Version 3.0, you can use the class DecisionTreeWithBagging that comes with the module to incorporate bagging in your decision tree based classification. Bagging means constructing multiple decision trees for different (possibly overlapping) segments of the data extracted from your training dataset and then aggregating the decisions made by the individual decision trees for the final classification. The aggregation of the classification decisions can average out the noise and bias that may otherwise affect the classification decision obtained from just one tree.

Calling the bagging constructor::

A typical call to the constructor for the DecisionTreeWithBagging class looks like:

use Algorithm::DecisionTreeWithBagging;

my $training_datafile = "stage3cancer.csv";

my $dtbag = Algorithm::DecisionTreeWithBagging->new(
                              training_datafile => $training_datafile,
                              csv_class_column_index => 2,
                              csv_columns_for_features => [3,4,5,6,7,8],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_bags => 4,
                              bag_overlap_fraction => 0.2,
                              csv_cleanup_needed => 1,
             );

Note in particular the following two constructor parameters:

how_many_bags

bag_overlap_fraction

where, as the name implies, the parameter how_many_bags controls how many bags (and, therefore, how many decision trees) will be constructed from your training dataset; and where the parameter bag_overlap_fraction controls the degree of overlap between the bags. To understand what exactly is achieved by setting the parameter bag_overlap_fraction to 0.2 in the above example, let's say that the non-overlapping partitioning of the training data between the bags results in 100 training samples per bag. With bag_overlap_fraction set to 0.2, additional 20 samples drawn randomly from the other bags will be added to the data in each bag.

Methods defined for DecisionTreeWithBagging class

get_training_data_for_bagging():

This method reads your training datafile, randomizes it, and then partitions it into the specified number of bags. Subsequently, if the constructor parameter bag_overlap_fraction is non-zero, it adds to each bag additional samples drawn at random from the other bags. The number of these additional samples added to each bag is controlled by the constructor parameter bag_overlap_fraction. If this parameter is set to, say, 0.2, the size of each bag will grow by 20% with the samples drawn from the other bags.

show_training_data_in_bags():

Shows for each bag the names of the training data samples in that bag.

calculate_first_order_probabilities():

Calls on the appropriate methods of the main DecisionTree class to estimate the first-order probabilities from the data samples in each bag.

calculate_class_priors():

Calls on the appropriate method of the main DecisionTree class to estimate the class priors for the data samples in each bag.

construct_decision_trees_for_bags():

Calls on the appropriate method of the main DecisionTree class to construct a decision tree from the training data in each bag.

display_decision_trees_for_bags():

Display separately the decision tree for each bag..

classify_with_bagging( test_sample ):

Calls on the appropriate methods of the main DecisionTree class to classify the argument test sample.

display_classification_results_for_each_bag():

Displays separately the classification decision made by each the decision tree constructed for each bag.

get_majority_vote_classification():

Using majority voting, this method aggregates the classification decisions made by the individual decision trees into a single decision.

See the example scripts in the directory bagging_examples for how to call these methods for classifying individual samples and for bulk classification when you place all your test samples in a single file.

USING BOOSTING

Starting with Version 3.20, you can use the class BoostedDecisionTree for constructing a boosted decision-tree classifier. Boosting results in a cascade of decision trees in which each decision tree is constructed with samples that are mostly those that are misclassified by the previous decision tree. To be precise, you create a probability distribution over the training samples for the selection of samples for training each decision tree in the cascade. To start out, the distribution is uniform over all of the samples. Subsequently, this probability distribution changes according to the misclassifications by each tree in the cascade: if a sample is misclassified by a given tree in the cascade, the probability of its being selected for training the next tree is increased significantly. You also associate a trust factor with each decision tree depending on its power to classify correctly all of the training data samples. After a cascade of decision trees is constructed in this manner, you construct a final classifier that calculates the class label for a test data sample by taking into account the classification decisions made by each individual tree in the cascade, the decisions being weighted by the trust factors associated with the individual classifiers. These boosting notions --- generally referred to as the AdaBoost algorithm --- are based on a now celebrated paper "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" by Yoav Freund and Robert Schapire that appeared in 1995 in the Proceedings of the 2nd European Conf. on Computational Learning Theory. For a tutorial introduction to AdaBoost, see https://engineering.purdue.edu/kak/Tutorials/AdaBoost.pdf

Keep in mind the fact that, ordinarily, the theoretical guarantees provided by boosting apply only to the case of binary classification. Additionally, your training dataset must capture all of the significant statistical variations in the classes represented therein.

Calling the BoostedDecisionTree constructor:

If you'd like to experiment with boosting, a typical call to the constructor for the BoostedDecisionTree class looks like:

use Algorithm::BoostedDecisionTree;
my $training_datafile = "training6.csv";
my $boosted = Algorithm::BoostedDecisionTree->new(
                          training_datafile => $training_datafile,
                          csv_class_column_index => 1,
                          csv_columns_for_features => [2,3],
                          entropy_threshold => 0.01,
                          max_depth_desired => 8,
                          symbolic_to_numeric_cardinality_threshold => 10,
                          how_many_stages => 4,
                          csv_cleanup_needed => 1,
              );

Note in particular the constructor parameter:

how_many_stages

As its name implies, this parameter controls how many stages will be used in the boosted decision tree classifier. As mentioned above, a separate decision tree is constructed for each stage of boosting using a set of training samples that are drawn through a probability distribution maintained over the entire training dataset.

Methods defined for BoostedDecisionTree class

get_training_data_for_base_tree():

This method reads your training datafile, creates the data structures from the data ingested for constructing the base decision tree.

show_training_data_for_base_tree():

Writes to the standard output the training data samples and also some relevant properties of the features used in the training dataset.

calculate_first_order_probabilities_and_class_priors():

Calls on the appropriate methods of the main DecisionTree class to estimate the first-order probabilities and the class priors.

construct_base_decision_tree():

Calls on the appropriate method of the main DecisionTree class to construct the base decision tree.

display_base_decision_tree():

Displays the base decision tree in your terminal window. (The textual form of the decision tree is written out to the standard output.)

construct_cascade_of_trees():

Uses the AdaBoost algorithm to construct a cascade of decision trees. As mentioned earlier, the training samples for each tree in the cascade are drawn using a probability distribution over the entire training dataset. This probability distribution for any given tree in the cascade is heavily influenced by which training samples are misclassified by the previous tree.

display_decision_trees_for_different_stages():

Displays separately in your terminal window the decision tree constructed for each stage of the cascade. (The textual form of the trees is written out to the standard output.)

classify_with_boosting( $test_sample ):

Calls on each decision tree in the cascade to classify the argument $test_sample.

display_classification_results_for_each_stage():

You can call this method to display in your terminal window the classification decision made by each decision tree in the cascade. The method also prints out the trust factor associated with each decision tree. It is important to look simultaneously at the classification decision and the trust factor for each tree --- since a classification decision made by a specific tree may appear bizarre for a given test sample. This method is useful primarily for debugging purposes.

show_class_labels_for_misclassified_samples_in_stage( $stage_index ):

As with the previous method, this method is useful mostly for debugging. It returns class labels for the samples misclassified by the stage whose integer index is supplied as an argument to the method. Say you have 10 stages in your cascade. The value of the argument stage_index would go from 0 to 9, with 0 corresponding to the base tree.

trust_weighted_majority_vote_classifier():

Uses the "final classifier" formula of the AdaBoost algorithm to pool together the classification decisions made by the individual trees while taking into account the trust factors associated with the trees. As mentioned earlier, we associate with each tree of the cascade a trust factor that depends on the overall misclassification rate associated with that tree.

See the example scripts in the ExamplesBoosting subdirectory for how to call the methods listed above for classifying individual data samples with boosting and for bulk classification when you place all your test samples in a single file.

USING RANDOMIZED DECISION TREES

Consider the following two situations that call for using randomized decision trees, meaning multiple decision trees that are trained using data extracted randomly from a large database of training samples:

(1) Consider a two-class problem for which the training database is grossly imbalanced in how many majority-class samples it contains vis-a-vis the number of minority class samples. Let's assume for a moment that the ratio of majority class samples to minority class samples is 1000 to 1. Let's also assume that you have a test dataset that is drawn randomly from the same population mixture from which the training database was created. Now consider a stupid data classification program that classifies everything as belonging to the majority class. If you measure the classification accuracy rate as the ratio of the number of samples correctly classified to the total number of test samples selected randomly from the population, this classifier would work with an accuracy of 99.99%.

(2) Let's now consider another situation in which we are faced with a huge training database but in which every class is equally well represented. Feeding all the data into a single decision tree would be akin to polling all of the population of the United States for measuring the Coke-versus-Pepsi preference in the country. You are likely to get better results if you construct multiple decision trees, each trained with a collection of training samples drawn randomly from the training database. After you have created all the decision trees, your final classification decision could then be based on, say, majority voting by the trees.

In summary, the RandomizedTreesForBigData class allows you to solve the following two problems: (1) Data classification using the needle-in-a-haystack metaphor, that is, when a vast majority of your training samples belong to just one class. And (2) You have access to a very large database of training samples and you wish to construct an ensemble of decision trees for classification.

Calling the RandomizedTreesForBigData constructor:

Here is how you'd call the RandomizedTreesForBigData constructor for needle-in-a-haystack classification:

use Algorithm::RandomizedTreesForBigData;
my $training_datafile = "your_database.csv";
my $rt = Algorithm::RandomizedTreesForBigData->new(
                          training_datafile => $training_datafile,
                          csv_class_column_index => 48,
                          csv_columns_for_features => [24,32,33,34,41],
                          entropy_threshold => 0.01,
                          max_depth_desired => 8,
                          symbolic_to_numeric_cardinality_threshold => 10,
                          how_many_trees => 5,
                          looking_for_needles_in_haystack => 1,
                          csv_cleanup_needed => 1,
         );

Note in particular the constructor parameters:

looking_for_needles_in_haystack
how_many_trees

The first of these parameters, looking_for_needles_in_haystack, invokes the logic for constructing an ensemble of decision trees, each based on a training dataset that uses all of the minority class samples, and a random drawing from the majority class samples.

Here is how you'd call the RandomizedTreesForBigData constructor for a more general attempt at constructing an ensemble of decision trees, with each tree trained with randomly drawn samples from a large database of training data (without paying attention to the differences in the sizes of the populations for the different classes):

use Algorithm::RandomizedTreesForBigData;
my $training_datafile = "your_database.csv";
my $rt = Algorithm::RandomizedTreesForBigData->new(
                          training_datafile => $training_datafile,
                          csv_class_column_index => 2,
                          csv_columns_for_features => [3,4,5,6,7,8],
                          entropy_threshold => 0.01,
                          max_depth_desired => 8,
                          symbolic_to_numeric_cardinality_threshold => 10,
                          how_many_trees => 3,
                          how_many_training_samples_per_tree => 50,
                          csv_cleanup_needed => 1,
         );

Note in particular the constructor parameters:

how_many_training_samples_per_tree
how_many_trees

When you set the how_many_training_samples_per_tree parameter, you are not allowed to also set the looking_for_needles_in_haystack parameter, and vice versa.

Methods defined for RandomizedTreesForBigData class

get_training_data_for_N_trees():

What this method does depends on which of the two constructor parameters --- looking_for_needles_in_haystack or how_many_training_samples_per_tree --- is set. When the former is set, it creates a collection of training datasets for how_many_trees number of decision trees, with each dataset being a mixture of the minority class and sample drawn randomly from the majority class. However, when the latter option is set, all the datasets are drawn randomly from the training database with no particular attention given to the relative populations of the two classes.

show_training_data_for_all_trees():

As the name implies, this method shows the training data being used for all the decision trees. This method is useful for debugging purposes using small datasets.

calculate_first_order_probabilities():

Calls on the appropriate method of the main DecisionTree class to estimate the first-order probabilities for the training dataset to be used for each decision tree.

calculate_class_priors():

Calls on the appropriate method of the main DecisionTree class to estimate the class priors for the training dataset to be used for each decision tree.

construct_all_decision_trees():

Calls on the appropriate method of the main DecisionTree class to construct the decision trees.

display_all_decision_trees():

Displays all the decision trees in your terminal window. (The textual form of the decision trees is written out to the standard output.)

classify_with_all_trees( $test_sample ):

The test_sample is sent to each decision tree for classification.

display_classification_results_for_all_trees():

The classification decisions returned by the individual decision trees are written out to the standard output.

get_majority_vote_classification()

This method aggregates the classification results returned by the individual decision trees and returns the majority decision.

CONSTRUCTING REGRESSION TREES:

Decision tree based modeling requires that the class labels be distinct. That is, the training dataset must contain a relatively small number of discrete class labels for all of your data records if you want to model the data with one or more decision trees. However, when one is trying to understand all of the associational relationships that exist in a large database, one often runs into situations where, instead of discrete class labels, you have a continuously valued variable as a dependent variable whose values are predicated on a set of feature values. It is for such situations that you will find useful the new class RegressionTree that is now a part of the DecisionTree module. The RegressionTree class has been programmed as a subclass of the main DecisionTree class.

You can think of regression with a regression tree as a powerful generalization of the very commonly used Linear Regression algorithms. Although you can certainly carry out polynomial regression with run-of-the-mill Linear Regression algorithms for modeling nonlinearities between the predictor variables and the dependent variable, specifying the degree of the polynomial is often tricky. Additionally, a polynomial can inject continuities between the predictor and the predicted variables that may not really exist in the real data. Regression trees, on the other hand, give you a piecewise linear relationship between the predictor and the predicted variables that is freed from the constraints of superimposed continuities at the joins between the different segments. See the following tutorial for further information regarding the standard linear regression approach and the regression that can be achieved with the RegressionTree class in this module: https://engineering.purdue.edu/kak/Tutorials/RegressionTree.pdf

The RegressionTree class in the current version of the module assumes that all of your data is numerical. That is, unlike what is possible with the DecisionTree class (and the other more closely related classes in this module) that allow your training file to contain a mixture of numerical and symbolic data, the RegressionTree class requires that ALL of your data be numerical. I hope to relax this constraint in future versions of this module. Obviously, the dependent variable will always be numerical for regression.

See the example scripts in the directory ExamplesRegression if you wish to become more familiar with the regression capabilities of the module.

Calling the RegressionTree constructor:
my $training_datafile = "gendata5.csv";
my $rt = Algorithm::RegressionTree->new(
                          training_datafile => $training_datafile,
                          dependent_variable_column => 2,
                          predictor_columns => [1],
                          mse_threshold => 0.01,
                          max_depth_desired => 2,
                          jacobian_choice => 0,
                          csv_cleanup_needed => 1,
         );

Note in particular the constructor parameters:

dependent_variable
predictor_columns
mse_threshold
jacobian_choice

The first of these parameters, dependent_variable, is set to the column index in the CSV file for the dependent variable. The second constructor parameter, predictor_columns, tells the system as to which columns contain values for the predictor variables. The third parameter, mse_threshold, is for deciding when to partition the data at a node into two child nodes as a regression tree is being constructed. If the minmax of MSE (Mean Squared Error) that can be achieved by partitioning any of the features at a node is smaller than mse_threshold, that node becomes a leaf node of the regression tree.

The last parameter, jacobian_choice, must be set to either 0 or 1 or 2. Its default value is 0. When this parameter equals 0, the regression coefficients are calculated using the linear least-squares method and no further "refinement" of the coefficients is carried out using gradient descent. This is the fastest way to calculate the regression coefficients. When jacobian_choice is set to 1, you get a weak version of gradient descent in which the Jacobian is set to the "design matrix" itself. Choosing 2 for jacobian_choice results in a more reasonable approximation to the Jacobian. That, however, is at a cost of much longer computation time. NOTE: For most cases, using 0 for jacobian_choice is the best choice. See my tutorial "Linear Regression and Regression Trees" for why that is the case.

Methods defined for RegressionTree class

get_training_data_for_regression():

Only CSV training datafiles are allowed. Additionally, the first record in the file must list the names of the fields, and the first column must contain an integer ID for each record.

construct_regression_tree():

As the name implies, this is the method that construct a regression tree.

display_regression_tree(" "):

Displays the regression tree, as the name implies. The white-space string argument specifies the offset to use in displaying the child nodes in relation to a parent node.

prediction_for_single_data_point( $root_node, $test_sample ):

You call this method after you have constructed a regression tree if you want to calculate the prediction for one sample. The parameter $root_node is what is returned by the call construct_regression_tree(). The formatting of the argument bound to the $test_sample parameter is important. To elaborate, let's say you are using two variables named $xvar1 and $xvar2 as your predictor variables. In this case, the $test_sample parameter will be bound to a list that will look like

['xvar1 = 23.4', 'xvar2 = 12.9'] 

Arbitrary amount of white space, including none, on the two sides of the equality symbol is allowed in the construct shown above. A call to this method returns a dictionary with two key-value pairs. One of the keys is called solution_path and the other prediction. The value associated with key solution_path is the path in the regression tree to the leaf node that yielded the prediction. And the value associated with the key prediction is the answer you are looking for.

predictions_for_all_data_used_for_regression_estimation( $root_node ):

This call calculates the predictions for all of the predictor variables data in your training file. The parameter $root_node is what is returned by the call to construct_regression_tree(). The values for the dependent variable thus predicted can be seen by calling display_all_plots(), which is the method mentioned below.

display_all_plots():

This method displays the results obtained by calling the prediction method of the previous entry. This method also creates a hardcopy of the plots and saves it as a .png disk file. The name of this output file is always regression_plots.png.

mse_for_tree_regression_for_all_training_samples( $root_node ):

This method carries out an error analysis of the predictions for the samples in your training datafile. It shows you the overall MSE (Mean Squared Error) with tree-based regression, the MSE for the data samples at each of the leaf nodes of the regression tree, and the MSE for the plain old Linear Regression as applied to all of the data. The parameter $root_node in the call syntax is what is returned by the call to construct_regression_tree().

bulk_predictions_for_data_in_a_csv_file( $root_node, $filename, $columns ):

Call this method if you want to apply the regression tree to all your test data in a disk file. The predictions for all of the test samples in the disk file are written out to another file whose name is the same as that of the test file except for the addition of _output in the name of the file. The parameter $filename is the name of the disk file that contains the test data. And the parameter $columns is a list of the column indices for the predictor variables in the test file.

GENERATING SYNTHETIC TRAINING DATA

The module file contains the following additional classes: (1) TrainingDataGeneratorNumeric, and (2) TrainingDataGeneratorSymbolic for generating synthetic training data.

The class TrainingDataGeneratorNumeric outputs one CSV file for the training data and another one for the test data for experimenting with numeric features. The numeric values are generated using a multivariate Gaussian distribution whose mean and covariance are specified in a parameter file. See the file param_numeric.txt in the Examples directory for an example of such a parameter file. Note that the dimensionality of the data is inferred from the information you place in the parameter file.

The class TrainingDataGeneratorSymbolic generates synthetic training for the purely symbolic case. The relative frequencies of the different possible values for the features is controlled by the biasing information you place in a parameter file. See param_symbolic.txt for an example of such a file.

THE Examples DIRECTORY

See the Examples directory in the distribution for how to construct a decision tree, and how to then classify new data using the decision tree. To become more familiar with the module, run the scripts

construct_dt_and_classify_one_sample_case1.pl
construct_dt_and_classify_one_sample_case2.pl
construct_dt_and_classify_one_sample_case3.pl
construct_dt_and_classify_one_sample_case4.pl

The first script is for the purely symbolic case, the second for the case that involves both numeric and symbolic features, the third for the case of purely numeric features, and the last for the case when the training data is synthetically generated by the script generate_training_data_numeric.pl.

Next run the following script as it is for bulk classification of data records placed in a CSV file:

classify_test_data_in_a_file.pl   training4.csv   test4.csv   out4.csv

The script first constructs a decision tree using the training data in the training file supplied by the first argument file training4.csv. The script then calculates the class label for each data record in the test data file supplied through the second argument file, test4.csv. The estimated class labels are written out to the output file which in the call shown above is out4.csv. An important thing to note here is that your test file --- in this case test4.csv --- must have a column for class labels. Obviously, in real-life situations, there will be no class labels in this column. What that is the case, you can place an empty string "" there for each data record. This is demonstrated by the following call:

classify_test_data_in_a_file.pl   training4.csv   test4_no_class_labels.csv   out4.csv

The following script in the Examples directory

classify_by_asking_questions.pl

shows how you can use a decision-tree classifier interactively. In this mode, you first construct the decision tree from the training data and then the user is prompted for answers to the feature tests at the nodes of the tree.

If your training data has a feature whose values span a large range and, at the same time, are characterized by a heavy-tail distribution, you should look at the script

construct_dt_for_heavytailed.pl                                                     

to see how to use the option number_of_histogram_bins in the call to the constructor. This option was introduced in Version 2.22 for dealing with such features. If you do not set this option, the module will use the default value of 500 for the number of points at which to sample the value range for such a feature.

The Examples directory also contains the following scripts:

generate_training_data_numeric.pl
generate_training_data_symbolic.pl

that show how you can use the module to generate synthetic training. Synthetic training is generated according to the specifications laid out in a parameter file. There are constraints on how the information is laid out in a parameter file. See the files param_numeric.txt and param_symbolic.txt in the Examples directory for how to structure these files.

The Examples directory of Versions 2.1 and higher of the module also contains the following two scripts:

evaluate_training_data1.pl
evaluate_training_data2.pl

that illustrate how the Perl class EvalTrainingData can be used to evaluate the quality of your training data (as long as it resides in a `.csv' file.) This new class is a subclass of the DecisionTree class in the module file. See the README in the Examples directory for further information regarding these two scripts.

The Examples directory of Versions 2.31 and higher of the module contains the following three scripts:

introspection_in_a_loop_interactive.pl

introspection_show_training_samples_at_all_nodes_direct_influence.pl

introspection_show_training_samples_to_nodes_influence_propagation.pl

The first script illustrates how to use the DTIntrospection class of the module interactively for generating explanations for the classification decisions made at the nodes of the decision tree. In the interactive session you are first asked for the node number you are interested in. Subsequently, you are asked for whether or not you are interested in specific questions that the introspector can provide answers for. The second script generates a tabular display that shows for each node of the decision tree a list of the training samples that fall directly in the portion of the feature space assigned that node. (As mentioned elsewhere in this documentation, when this list is empty for a node, that means the node is a result of the generalization achieved by probabilistic modeling of the data. Note that this module constructs a decision tree NOT by partitioning the set of training samples, BUT by partitioning the domains of the probability density functions.) The third script listed above also generates a tabular display, but one that shows how the influence of each training sample propagates in the tree. This display first shows the list of nodes that are affected directly by the data in a training sample. This list is followed by an indented display of the nodes that are affected indirectly by the training sample. A training sample affects a node indirectly if the node is a descendant of one of the nodes affected directly.

The latest addition to the Examples directory is the script:

get_indexes_associated_with_fields.py

As to why you may find this script useful, note that large database files may have hundreds of fields and it is not always easy to figure out what numerical index is associated with a given field. At the same time, the constructor of the DecisionTree module requires that the field that holds the class label and the fields that contain the feature values be specified by their numerical zero-based indexes. If you have a large database and you are faced with this problem, you can run this script to see the zero-based numerical index values associated with the different columns of your CSV file.

THE ExamplesBagging DIRECTORY

The ExamplesBagging directory contains the following scripts:

bagging_for_classifying_one_test_sample.pl
                                                                                           
bagging_for_bulk_classification.pl

As the names of the scripts imply, the first shows how to call the different methods of the DecisionTreeWithBagging class for classifying a single test sample. When you are classifying a single test sample, you can also see how each bag is classifying the test sample. You can, for example, display the training data used in each bag, the decision tree constructed for each bag, etc.

The second script is for the case when you place all of the test samples in a single file. The demonstration script displays for each test sample a single aggregate classification decision that is obtained through majority voting by all the decision trees.

THE ExamplesBoosting DIRECTORY

The ExamplesBoosting subdirectory in the main installation directory contains the following three scripts:

boosting_for_classifying_one_test_sample_1.pl

boosting_for_classifying_one_test_sample_2.pl

boosting_for_bulk_classification.pl

As the names of the first two scripts imply, these show how to call the different methods of the BoostedDecisionTree class for classifying a single test sample. When you are classifying a single test sample, you can see how each stage of the cascade of decision trees is classifying the test sample. You can also view each decision tree separately and also see the trust factor associated with the tree.

The third script is for the case when you place all of the test samples in a single file. The demonstration script outputs for each test sample a single aggregate classification decision that is obtained through trust-factor weighted majority voting by all the decision trees.

THE ExamplesRandomizedTrees DIRECTORY

The ExamplesRandomizedTrees directory shows example scripts that you can use to become more familiar with the RandomizedTreesForBigData class for solving needle-in-a-haystack and big-data data classification problems. These scripts are:

randomized_trees_for_classifying_one_test_sample_1.pl

randomized_trees_for_classifying_one_test_sample_2.pl

classify_database_records.pl

The first script shows the constructor options to use for solving a needle-in-a-haystack problem --- that is, a problem in which a vast majority of the training data belongs to just one class. The second script shows the constructor options for using randomized decision trees for the case when you have access to a very large database of training samples and you'd like to construct an ensemble of decision trees using training samples pulled randomly from the training database. The last script illustrates how you can evaluate the classification power of an ensemble of decision trees as constructed by RandomizedTreesForBigData by classifying a large number of test samples extracted randomly from the training database.

THE ExamplesRegression DIRECTORY

The ExamplesRegression subdirectory in the main installation directory shows example scripts that you can use to become familiar with regression trees and how they can be used for nonlinear regression. If you are new to the concept of regression trees, start by executing the following scripts without changing them and see what sort of output is produced by them:

regression4.pl

regression5.pl

regression6.pl

regression8.pl

The regression4.pl script involves only one predictor variable and one dependent variable. The training data for this exercise is drawn from the file gendata4.csv. This data file contains strongly nonlinear data. When you run the script regression4.pl, you will see how much better the result from tree regression is compared to what you can get with linear regression.

The regression5.pl script is essentially the same as the previous script except for the fact that the training datafile used in this case, gendata5.csv, consists of three noisy segments, as opposed to just two in the previous case.

The script regression6.pl deals with the case when we have two predictor variables and one dependent variable. You can think of the data as consisting of noisy height values over an (x1,x2) plane. The data used in this script is drawn from the csv file gen3Ddata1.csv.

Finally, the script regression8.pl shows how you can carry out bulk prediction for all your test data records in a disk file. The script writes all the calculated predictions into another disk file whose name is derived from the name of the test data file.

EXPORT

None by design.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'DecisionTree' in the subject line.

INSTALLATION

Download the archive from CPAN in any directory of your choice. Unpack the archive with a command that on a Linux machine would look like:

tar zxvf Algorithm-DecisionTree-3.43.tar.gz

This will create an installation directory for you whose name will be Algorithm-DecisionTree-3.43. Enter this directory and execute the following commands for a standard install of the module if you have root privileges:

perl Makefile.PL
make
make test
sudo make install

If you do not have root privileges, you can carry out a non-standard install the module in any directory of your choice by:

perl Makefile.PL prefix=/some/other/directory
make
make test
make install

With a non-standard install, you may also have to set your PERL5LIB environment variable so that this module can find the required other modules. How you do that would depend on what platform you are working on. In order to install this module in a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment variable by

setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

Another option for a non-standard install would be to place at the top of your own scripts the following invocation:

BEGIN {
    unshift @INC, "/some/other/directory/blib/lib";
}    

THANKS

I wish to thank many users of this module for their feedback. Many of the improvements I have made to the module over the years are a result of the feedback received.

I thank Slaven Rezic for pointing out that the module worked with Perl 5.14.x. For Version 2.22, I had set the required version of Perl to 5.18.0 since that's what I used for testing the module. Slaven's feedback in the form of the Bug report #96547 resulted in Version 2.23 of the module. Version 2.25 further downshifts the required version of Perl to 5.10.

On the basis of the report posted by Slaven at rt.cpan.org regarding Version 2.27, I am removing the Perl version restriction altogether from Version 2.30. Thanks Slaven!

Version 3.43 was prompted by a bug report sent by Jan Trukenmüller regarding a missing backslash in a regex in the csv_cleanup_needed() function. Thanks, Jan!

AUTHOR

The author, Avinash Kak, recently finished a 17-year long "Objects Trilogy Project" with the publication of the book Designing with Objects by John-Wiley. If interested, visit his web page at Purdue to find out what this project was all about. You might like Designing with Objects especially if you enjoyed reading Harry Potter as a kid (or even as an adult, for that matter).

If you send email regarding this module, please place the string "DecisionTree" in your subject line to get past my spam filter. Avi Kak's email address is kak@purdue.edu

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Copyright 2017 Avinash Kak