9.4 Classifier Quality Metrics
A number of quality metrics should be considered when evaluating how well a classifier performs. The image below shows the formulas for common quality metrics.
By looking down the predicted columns you can see that the algorithm predicted 65% of instances are positive and the remaining 35% were predicted to be negative. By looking across the actual rows, you can see that 70% of the instances are actually positive and the remaining 30% are actually negative.
Classification Accuracy. Accuracy is when the algorithm predicts the correct classifications. It is the sum of TPs and TNs. In the example, the accuracy rate is 85%, which is the sum of 60% and 25%.
Misclassification Error. Error is when a record is predicted to belong to one class when it belongs to another class. It is the sum of FPs and FNs. In the example, the error rate is 15%, which is the sum of 5% and 10%. Error = 1 – accuracy % and accuracy = 1 – error %.
It is important to look at specific measures in addition to accuracy and error.
Recall, also known as Sensitivity and the True Positive Rate, is the proportion of the actual true class that is correctly predicted to be true. It is the proportion found of the actual positives in the dataset. In the example, the recall is 60 out of 70, which is 0.857.
Specificity, also known as the True Negative Rate (TNR), is like recall on the negatives. It is the proportion of the actual negatives that you found in the dataset. In other words, it is the proportion of actual negative instances that are correctly predicted to be negative. In the example, the algorithm found 25 of 30 actual negative values, so specificity is 0.833.
Precision is the number of true positives divided by the total number of items predicted to be positive. In other words, precision is the proportion of the items that were predicted as true that are actually true. In the example, the algorithm predicted 65 instances as positive, of which 60 were correct true positives. The precision is .923, which is 60/65.
The F-score is a single measure of a classifier’s usefulness. It considers both the recall and the precision of the procedure. The higher the F-score the better the predictive power of the classifier. The possible value of the F1 score ranges from zero to one: 0 ≤ F ≤ 1. A score of one means the classifier is perfect. A score of zero means the classifier is not effective at all, specifically that recall is zero and precision are zero.
The F-score is the harmonic mean of precision and recall. The figure below shows how the F-score is calculated. The figure above shows how the F-score was calculated in the example.
Tradeoff between Recall and Precision
In categorical predictive modeling, a perfect precision score of 1.0 means that every item predicted to be the class of interest, is indeed the class of interest (but says nothing about whether all relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all items belonging to the class of interest were found (but says nothing about how many irrelevant documents were also retrieved).
Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. Brain surgery provides an illustrative example of the tradeoff. Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon does not want to remove healthy brain cells since that would leave the patient with impaired brain function.
The surgeon may decide to remove more of the brain to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. Greater recall increases the chances of removing all cancer cells (positive outcome) but also increases the chances of removing healthy cells (negative outcome).
Conversely, the surgeon may be very careful to try to remove only cancer cells. This increases precision but reduces recall. The greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).
This trade-off between recall and precision is the reason that both measures should be considered.
Summary of classifier evaluation metrics
The table below summarizes the classifier evaluation measures.
How Data Mining Tools Display Categorization Quality Metrics
Some data mining tool manufactures put the characteristic of interest in the left-most predicted column. Others put characteristic of interest in the right-most column. Likewise, sometimes the actual positives are on the top row and sometimes they are on the second row. So you must pay attention to the headings not just the order of the columns and rows. For example, the Azure ML Studio classification matrix has the class of interest in the left column. JMP and SciKit Learn puts the class of interest in the right column.
The classifier evaluation metrics presented by Azure ML Studio, JMP, and SciKit Learn (in Python) have some similarities and some differences.
Azure ML Studio
Azure does not show the confusion matrix for the training partition. This means you cannot easily evaluate the degree of overfit. On the other hand, Azure shows the confusion matrix for the validation partition along with accuracy, recall, precision, and F1 (F-measure) for the validation partition. This if convenient in that you do not have to calculate recall, precision, and the F-measure from the confusion matrix.
JMP
JMP provides the confusion matrix for the training and validation partitions. It shows the misclassification rate for both training and validation partitions. Comparing performance between training and validation partitions makes it easy to see the amount of overfitting. JMP does not, however, calculate recall, precision, and the F-measure. However, these can easily be calculated from the data in the classification matrix.
JMP also includes calculation approximations for measures typically used to evaluate numeric prediction models, like regression. For example, it calculates approximations for R-square (Entropy RSquare), RMSE, and mean absolute error.
Sci-Kit Learn
The figure below shows and example of how SciKit Learn presents the confusion matrix and a classification report. SciKit Learn only shows the confusion matrix for the validation data. In the figure below, an Excel table has been added below the output produced by SciKit Learn so you can more easily see what columns and rows represent in the confusion matrix and how the classification report maps to the confusion matrix.
In the classification report, the term support is used to represent the sum of actual negatives. Support is also calculated separately for actual positives. SciKit learn not only calculates recall for actual positives but it also calculates it for actual negatives. Thus, what it calls recall on the actual negatives is what we typically refer to as specificity. SciKit learn also calculates precision for the predicted positives and the predicted negatives, so you can see how accurate it is for both types of predicted values. The classification report also shows accuracy.