5.3 Creating CART Examples
CART Logic and Rules
Classification and Regression Trees (CART), as mentioned before, is a method commonly used for both classification and prediction, and works particularly well in situations when it is important to easily interpret and explain model results. However, understanding the logic and how rules are created is slightly different for Classification and Regression Trees.
Classification Trees
The purpose of a classification tree is to classify records into the correct category. The rules that make up a classification tree can then be used to classify the likely category of new records. Classification trees involve the use of a categorical outcome variable. Input variables can be either categorical or numeric.
In the mushroom example used previously, the classification tree set out with the goal to classify mushrooms as either "Edible" or a "Poison". The end result was a set of rules, or a tree, that allows users to correctly classify mushrooms as either edible or poisonous. Figure 5.11 below shows details those rules. As you can see, these rules are easy to interpret and communicate to other interested groups.
Note: This is not the entire finished tree, so the rules stated above are only partial as the finished tree likely consists of further splits.
How to Build a Classification Tree in JMP
To see the entire finished tree of the Mushroom data, let’s build the classification tree in JMP.
Step 1 - Open the MyMushroomDemo dataset in JMP.
Step 2 - Create a Validation column.
We need to divide our data into training data and validation data. Validation data is held aside while the model is created using training data. Validation data is then used to validate the accuracy of the model. For example, if you were to leave 10 records out of your model as validation data, your model would be created with the remaining training data. After your model is complete, it can create predictions for the missing 10 records. You can then evaluate the accuracy of your model by comparing those 10 predictions to the actual values of the 10 validation records.
To create a Validation column, select Analyze -> Predictive Modeling -> Make Validation Column.
Next, make sure that the portion of training records is .75, and validation records .25. Place a “1” in the “Seed” input box, and click the “Fixed Random” button.
Step 3 - Create a classification tree.
Select Analyze -> Predictive Modeling -> Partition. This is JMP’s name for CART. Fill in the Y, X, and Validation inputs as shown in the image below. Click OK.
You will then be shown a screen like the image below.
Click the Go button to create the tree. JMP automatically creates all of the splits necessary to classify records as either edible or poisonous.
Congratulations, you have built your first classification tree!
It is helpful to see the number of edible and poisionous mushrooms in each node. Select Display Options -> Show Split Count.
Now the count and portion of edible and poison mushrooms are shown for each node .
Above the tree, you can see the number of splits in the tree and the R-Squared for both the training and validation data.
In a later chapter, we will explain how to evaluate the efficiency of the decision tree model.
Regression Trees
The purpose of a regression tree is to accurately predict the value of future records based on a defined set of rules. Regression trees differ from Classification trees as they involve the use of a numeric outcome variable. Similar to Classification trees, however, input variables of Regression trees can be either categorical or numeric.
The predictions in a regression tree are computed as the average of the numeric target variable in the rectangle, whereas in classification trees it is a majority vote. In other words, the leaves of a regression tree predict the average value for examples that reach that node. Impurity in a rectangle is measured by the sum of squared deviations from that leaf’s mean.
How to Create a Regression Tree in JMP
Keep in mind that when creating a regression tree, we are no longer trying to classify our data but rather predict an average for our output variable. Performance of regression trees is measured by the RMSE (Root Mean Squared Error) and Average Error.
We will be using the Wine dataset, which contains input variables of Country, Score, Grape_Varietal, and Type to predict the price of a bottle of wine. Notice that Score is a numeric variable, and the rest are categorical. Since you’ve already had practice building a classification decision tree, let’s learn more about regression trees by walking through our Wine example.
Step 1 - Open the Wine Dataset.
Step 2 - Create a Validation Column.
Select Analyze -> Predictive Modeling -> Make Validation Column. Verify that the portion of training records is .75, and validation records .25. Place a “1” in the “Seed” input box, and click the “Fixed Random” button.
Step 3 - Create a regression tree.
Select Analyze -> Predictive Modeling -> Partition. Remember that this is JMP’s name for CART. Fill in the Y, X, and Validation inputs as shown in the image below. Click OK.
You will then be shown a screen like the image below. Click the Go button to create the tree. JMP automatically creates all of the splits necessary to predict the price of a bottle of wine.
You’ll notice that this tree contains more splits than the classification tree used to classify mushrooms. Take a look at the tree to see its set of splits, then scroll up to view the R-Squared of the tree.
Congratulations, you’ve now built both a classification tree and a regression tree.