The Random Forest Algorithm

Random Forest for Classification

The following section explains the steps for the algorithm of Random Forest Classification. Keep the following parameters in mind:

  • T = Number of Decision Trees to Create
  • N = Number of observations in the Training Data Set
  • n = Number of observations selected from N with replacement. With Replacement means every observation has an equal chance of being selected. If the same record is selected more than once it replaces the same existing record. That is, it does not get included twice in the same sample. Records drawn more than once reduce down to one instance of that record in that sample.
  • M = Number of input variables included in the model
  • m = Number of input variables randomly chosen from M. m is typically less than M.

Step 1: Build T trees using the CART algorithm

Figure 15.1: Random Forest Simiplified
  • For each tree, randomly select n observations with replacement from the N available observations in the training set. The number of observations selected (n) will always be the same for all T trees. In practice, when n = N, each tree is grown using about 63% of the original training observations, due to the bootstrapping sampling process, i.e. with replacement. The remaining 37% of the data is available to test the tree. The sampled training set and remaining data for testing is different for each tree.

  • At each node to be split, randomly select m input variables from the M available candidates. The number of input variables m will always be the same for all splits in all trees. Typically, = SquareRoot(M) . For example, if there are 100 possible input variables, 10 are randomly selected as candidate variables for the best split. So the node is split using the best variable among the 10, not the best variable among the 100. This forces different variables to be considered for splits. It also radically speeds up the tree growing process because fewer variables are evaluated to pick the best splitter for each node split.

  • The best splitter is picked from among m (10 in the example) to split the node in question. The splitter might be the best overall, but probably is not. The splitter might be a fairly good splitter. Or the splitter may not very helpful. If the splitter is not very good, the two children resulting from the split are essentially alike. For example, if we split on GENDER when there is little different between males and females, this means that the essential work has been deferred to lower in the tree. A new list of eligible splitters is selected at random for each node. With a large number of predictors (M), the splitter candidates that make up m will be quite different from node to node. In this way, important predictors eventually make it into the tree. This explains in part why the trees must be grown to maximum full size.

  • Grow each tree to the fullest extent possible. Aim for terminal nodes with one data record. There is no need for trimming. The voting across trees limits what would otherwise be a serious overfitting problem.

Step 2: Cast Votes for test observations

Figure 15.2: Random Forest Voting for Classification

To generate a predicted class for a given test observation, generate a prediction for each of the T trees by feeding the test observation to the tree. The predicted class is the one predicted by the majority of the individual trees. Each tree casts a vote at its terminal nodes.

Training and Testing of Random Forests

The parameter values for T, m, and n are chosen by the analyst before processing begins. Typically n is set equal to N and the number of input variables (m) is set to the square root of M rounded to the nearest integer.

When n = N, because of replacement in sampling, about 63% of the observations in the full dataset will appear in the resulting training set. The formula to compute this percent is:

When N is large this converges to:

which is approximately 0.632.

Why are only about 63% of the total records in the dataset included in each training sample? It is because sampling is done with replacement. This means that some records are randomly selected multiple times for the same tree. When this happens, duplicates are dropped so that only one of any given record is retained for training. Because of this choosing and dropping of duplicates, some records are not selected at all for training. Those not selected for training are used for testing the trained model.

For example, assume that the number of records (N) in the entire dataset is 1000 and the number of records (n) to be selected for each subsample is set to 1000. 1000 records are drawn with replacement. About 37% are duplicates of previously sampled records, and get dropped from the sample so they are not retained multiple times in that sample. This duplication and reduction means that only about 63% of the records get used for training each tree. The records not chosen at all are used to test the tree. So this partition is used to train and validate one tree. A different division of records are used for each each tree. 

When sampling begins again to select records to train the next tree, some of the same records are selected that were selected for the previous tree. But some records selected are different. Because of this, all records get to participate in training some of the trees and in testing some of the trees.

Random Forests for Regression

Figure 15.3: Random Forest for Regression

The random forest algorithm for regression is similar to that used for classification. The differences are:

  • The split variables employed by each node are selected based on their ability to maximize the difference in mean response variable values in the resulting split nodes.
  • The predicted value of the forest is the mean predicted value of individual trees.

A single tree is like a nearest neighbor selection because when a new record that needs to be estimated is “dropped” down an established tree, it follows a path down the tree. To reach a node in the tree, a record must satisfy the conditions of every parent (e.g., Age > 35, Education = 12, City = Yes, etc.). Therefore, reaching a terminal node means being very similar to the training records that occupy that node. This means this is similar to the nearest neighbor prediction mechanism: find a historical record that is as similar as possible to the new record. Then predict that the new record behaves like the historical record. However, it is different from KNN because each record to be predicted is dropped down every trained tree instead of being based on k nearest neighbors.