5.2 How Decision Trees Are Created
Tree Structure and Terminology
A decision tree is made up of branches, leaves, and nodes. Non-leaf nodes represent a set of records that will be split. Branches connect nodes to other nodes. Terminal/leaf nodes are nodes at the bottom that will not be split further. An example tree is shown below. A root node is the node in the tree represents the pool of all data before the first split, or first decision. Each time the data is split, a rule is derived to best split the data based on the value of one input variable. Records that meet that rule are branched off to the next node on one side, and records that do not meet that rule are branched off to a different node.
After passing through the root node of a tree, a record may next arrive at either a decision node, or a terminal node. A decision node represents another node containing a decision, that is, a splitting rule. A terminal node, commonly known as a leaf node, contains records that do not pass through any further decisions. This occurs when a stopping rule tells CART to stop splitting. Examples of common stopping rules are stop splitting when the outcome value in the leaf node are the same or similar or when the number of records in the leaf node meet some specified number.
The image below shows all portions of the structure of a decision tree.
The predicted value of a new record is determined by starting the record to be predicted at the top of tree. Each decision rule is used as a way of directing the new record down the splits and branches of the tree until the record reaches a terminal leaf node. The value of the records in the leaf node are then used to determine the predicted value.
In the above example, a mushroom with a white color would go down the left branch of the tree. If it has zero or one ring, it would be classified as poision. If it has two rings, it would go to the next node. Then if it is green, it would be classified as poison. Otherwise, it would be classified as edible.
Steps for Building a Decision Tree
Different data mining tools implement the CART algorithm a bit differently, but the general process for creating decision trees is described in this section.
Building a Decision Tree consists of three repetitive steps that can be carried out until tree growth is stopped by some stopping rule.
The following three steps are used to create a decision tree:
Step 1 - Consider each input variable as a possible splitter. For each input variable, determine which value of that variable would produce the best split in terms of having the most homogeneity on each side of the split after the split. All input variables and all possible split point values for each input variable are evaluated and chosen. This choice is done in a greedy manner. That is, the best splitter variable and the best split value for that variable are chosen to produce each immediate split regardless of how it will affect subsequent splits lower down the tree.
Step 2 - Rank the splits in terms of which would produce the best results. Then choose the best split. So a split includes a best splitter variable and the value of that variable that produced the best results.
Step 3 - For each node obtained by splitting, repeat Steps 1 and 2, until no more good splits are possible.
Splitting stops when the leaf nodes of the tree have reached complete purity or when some other stopping rule prevents further splitting. Purity is having complete homogeneity (one class only) in each leaf/terminal node.
Unless the predictive pattern learned from the data is especially powerful and consistent, purity for all leaf nodes is not achieved or even desirable with decision trees. Why? Because to obtain purity would required too much splitting. The tree would become so tall and bushy that it would not be sufficiently general for new data. This would result in overfitting. So, typically, some terminal nodes are pure and some are not.
Riding Lawn Mowers Example
To help you understand how nodes on a decision tree are determined, we will be using a simple Riding Lawn Mowers dataset to visualize cutoff points and subgroups using scatter plots and parallel plots. A screenshot of the data is shown below.
First Split
Most machine learning tools will begin by looking at the training data and checking each variable to see where the best split can be found. These splits are usually ranked according to which split amongst the variables can cause the most homogeneous set of subgroups based on a cutoff value. The default cutoff value in most cases is 0.5, meaning that the leaf node contains a majority of one class.
Figure 5.5 shows that the first best split is for Lot Size at 19,000 square feet. As you can see, a majority of the owners of lawn mowers (dark circles) are above the split, while a majority of non owners are below the split.
The first split from the scatter plot results in the following decision tree (Figure 5.6).
Second Split
Now looking below the lower half of the split (less than 19,000 square feet for Lot Size), the next best split is for the variable Income on the x-axis at $84.75. Looking at Figure 5.7, this second split allows us to have complete purity of riding lawn mower owners in the lower right-hand section of our scatter plot.
The second split from the scatter plot results in the following decision tree (Figure 5.8).
All Splits Performed
This process of splitting for each node will continue until purity has been reached. Figure 5.9 shows all splits that have been performed for the Riding Lawn Mowers dataset.
All splits from the scatter plot above result in the overall decision tree (Figure 5.10).
The Riding Lawn Mowers problem is a simple example of how decision trees are split and created. Since the Riding Lawn Mowers dataset contains only two variables, we can easily see how this is performed on a two-dimension y by x scatterplot. However, adding more variables makes this harder to visualize, but the CART algorithm is capable of keeping track of the different splits and subgroups within the data.