9.3 Cutoff Rate and Probability Table
DM algorithms classify via a two-step process. First, the algorithm estimates the probability of belonging to the class of interest. Next, it compares that computed probability to the cutoff value, and classifies the instance accordingly. If the probability meets or exceeds the cutoff value, it assigns it to the class of interest.
The default cutoff is 50%
If probability of being true >= 0.50, classify as "class of interest"
If < 0.50, classify as "not class of interest"
For example, below are some records from a scored dataset in Azure ML studio. The four columns on the left are actual values. The Scored Labels column shows the predicted outcome. The Scored Probabilities column shows the probability calculated by Azure that the instance in the record is poison. The first record has no probability of being poison, so it is classified as edible. The remaining three records in the example have probabilities of being poison above or equal to the cutoff probability of 0.50, so they are classified as poison.
Typically, over-all-error rate is lowest when the cutoff is 0.50, so this is the default cutoff rate in data mining tools. This makes sense if the cost of both types of errors have approximately equal costs.
This is rarely true. Typically, one type of error is more expensive than the other error and this varies from problem to problem. For example, in a breast cancer detection problem, which cost is more expensive? Is it more expensive to predict people have cancer that do not have cancer (a false positive), or is it costlier to predict people do not have cancer who in fact do have cancer (a false negative)? This can be determined by thinking about the consequence of each type of error. If people are falsely predicted to have cancer then further test are typically run to confirm the diagnosis. So, they incur a bit of uncertainty and the cost of additional testing but the additional tests will disconfirm the initial prediction. Conversely, if a person is predicted to not have cancer who really has cancer, they may assume no treatment is necessary, when in fact treatment might stop the cancer from progressing. Clearly this is the more expensive outcome.
In practice, competent medical doctors learn the false positive and false negative rates of tests. So that if a test indicates a type of error, they can make intelligent decisions of whether to order additional tests. Sometimes a cheap and fast but less reliable test is conducted first. If the results are a cause for concern a more expensive but more reliable diagnostic test can be ordered.
When a specific outcome is very costly, the analyst may choose to change the cutoff rate from the default of 50% to change the proportion of responses that are classified correctly and incorrectly. Consider the table below. It shows the actual class and the assigned probability of each instance having the classification of interest. Then, the instances are sorted from the highest to the lowest probability. The blue line shows what happens if the cutoff is set to the default of 50%. Three out of 24 or 12.5% of records are misclassified: two false positives and one false negative.
The image below shows what happens when the cutoff point is moved to 0.75. The blue line shows this cutoff. Now 6/24 or 25% of the instances are misclassified: one false positive and five false negatives. This higher overall error rate may be tolerated if the cost of a false positive is very expensive.
The image below shows what happens when the cutoff point is moved to .25. Now 5/24, or 21% of the records are misclassified: four false positives and one false negative. Again, this higher overall error rate may be tolerated if false negatives are very expensive, like in the breast cancer detection scenario.