Introduction to Random Forests

Limitations of CART

Classification trees built using the CART algorithm have the advantage of being very easy to interpret and understand. However, that simplicity comes at a cost – in general, they do not classify as well as other classification algorithms and the trees constructed are not necessarily optimal even when applied to the dataset used to build the tree.

It has previously been demonstrated that the splits in a CART tree at the second and subsequent lower levels depend on the predictor used to perform the first split. For example, if predictor A is chosen as the top-level split, the resulting tree may be quite different than if predictor B is chosen. Yet when choosing the top level split variable, the CART algorithm does not consider subsequent splits in the tree. When determining each split, CART chooses a variable and a level of that variable for that split. CART calculates the gain for each split by looking at every potential predictor and every record in the dataset for every predictor.

CART has a singular focus. It looks only to maximize the gain generated at each immediate split. Thus CART is near-sighted. What is best for the immediate split is chosen even if it produces suboptimal results from the overall tree. In some cases, the gain of a variable and the level of that variable chosen by CART used for the top split is only slightly better than the split that would result from using a contending variable. Yet the overall tree that would result from a split on the contenting variable may outperform the tree produced by the original top-level variable.

CART calculates the gain for each split based on all of the records available. Sometimes splitting on one predictor produces only a slightly higher gain for the immediate split decision than would result from choosing the second best split variable. Indeed some input variables are so similar in the gain calculation for the immediate split, that were a few observations to be removed from the training dataset, another variable would be chosen. And if the second variable is chosen over the first, the performance of the resulting tree may outperform that of the tree built using the first variable as the top split. In other words, by removing a few observations from the training set, CART can choose to split on another variable—and the overall tree that results from the split can outperform a tree built using all of the records.

Solution to CART's Limitations: Random Forest

One way to remediate CART’s weaknesses would be to construct an algorithm that forces a different subset of variables to be chosen that have the potential to be chosen for the split. Then pick which variable in the smaller set produces the best split. A second way to remediate the problem would be to create a variety of samples from the available training records.

To accomplish this, researchers at The University of California, Berkeley, led by Leo Breiman and Adele Cutler came up with the idea of building multiple trees. Each tree generates a prediction. Then for categorical predictions, the prediction of the collection of trees (forest) is determined by a majority vote of the individual tree predictions. For continuous numeric prediction problems, the average values produces across the trees is used for the prediction.

They proposed that each tree be built using only a randomly sampled subset of the training observations and also designed the algorithm to force it to randomly select variables that fall within different subsets of predictor variables. Thus, the problem of one or two variables dominating split decisions is avoided.

They then proved that the resulting algorithm outperforms CART and other classification algorithms when they tested it on numerous benchmark datasets. Later they developed mathematical formulations demonstrating the superiority of the random forest algorithm over competing algorithms.

Advantages of Random Forests

They reported the following benefits of the random forest algorithm (Breiman, 2001):

  • It is often the most accurate algorithm of those currently available. High levels of predictive accuracy are delivered automatically.
  • It runs efficiently on large data bases.
  • There is no need for prior culling of many input variables down to a small number. The process handles variable selection automatically.
  • Because it draws many subsamples of the data, it avoids picking a single non-representative sample (sampling error) that can occur when only one random sample is drawn to create the training dataset.
  • Random forests tend to not overfit the data.