The Overfitting Problem

Two key ideas are important to keep in mind to understand how to manage the overfitting problem when working with decision trees: recursive partitioning and pruning.

Recursive Partitioning

As discussed previously, decision trees are created by splitting the data into subgroups based on a set of rules. This splitting is referred to as recursive partitioning. Recursive partitioning is accomplished through repeatedly splitting the records into two groups so as to achieve maximum homogeneity within those subgroups.

Pruning

The process of finding this balance between accuracy and simplicity is accomplished through pruning. Pruning is the process of simplifying the tree by pruning peripheral branches to avoid overfitting.

Overfitting

The process of recursive partitioning naturally ends after the tree successfully splits the data such that there is 100% purity in each leaf (terminal node) or when all splits have been tried so that no more splitting will help. Reaching this point, however, overfits the data by including the noise from the training data set. In other words the decision tree learns from the training data set so well that accuracy falls when the decision tree rules are applied to unseen data.

Overfitting occurs when a model includes both actual general patterns and noise in its learning. This negatively impacts the overall predictive accuracy of the model on unseen data. In short, overfitting leads to low predictive accuracy of new data.

You can see evidence of overfitting in the error rate of the training and validation data. As recursive splitting continues, lower branches become more differentiated as more split decisions are added. As splitting continues on the training data, error on the training data will continue to decline. Past a certain point, the error rate in the validation data bottoms out and then starts to increase rather than decrease (see Figure 5.21).

Figure 5.21: Decision Tree Error Rate for # of Splits

Below is an example Split History graph from JMP produced when creating a regression tree. The blue line and red line represents the R-square of the model as the number of splits increase on the training data and validation data, respectively. As the tree evolves during training, CART adds more and more splits. Early on, as splits are added, the value of R-square improves on both the training data and validation data. As more splits are added, overfitting to the training data occurs. Because the details in the training data do not generalize well to the general pattern in the validation data, the R-square of the model on validation data peaks and then declines.

Figure 5.22: JMP splits graph on CART regression

Solving Overfitting through Pruning

Recall the mushroom classification tree example. Not all of the terminal nodes were completely one category or the other. See the Color (brown) terminal node, for example. It shows both blue and red, or both edible and poisonous mushrooms.

Figure 5.23: Pure and non-pure leaf nodes

The automatic stopping point in JMP finds a level of decomposition in the tree that strikes the good balance between accuracy and simplicity. If you were to continue to split the data until all terminal nodes are entirely one category or where more splitting will not help, the tree can become so complex that it does not generalize well to accurately estimating future records.

CART lets the tree grow and then prunes it back. The point at which it stops pruning is when the validation error begins to rise. It generates successively smaller trees by pruning leaves. Branches with too much splitting get removed.