Graphs for Classifier Evaluation

In addition to metrics used to evaluate the quality of a classifier, there are  two useful graphs that help in this also: the ROC curve and lift chart.

ROC Curve

The Receiver Operating Characteristics (ROC) Curve helps the analyst evaluate how well the data mining algorithm does at differentiating between true positives (TPs) and false positives (FPs). It provides a way to visualize the tradeoff between the ability of a classifier to correctly identify TPs versus the classifier’s failure to identify FPs.

How is the ROC curve constructed? To build the ROC curve, the probability table is sorted by the probability of each record belonging to the class of interest from the most probable to the least probable record.  Then beginning from the most probable records at the top, it determines whether the record is a true positive or a false positive. The vertical axis on the ROC curve shows the true positives rate. The horizontal axis shows the false positive rate.

As long as the classifier finds true positives with no false positives, it stays at the left most position on the horizontal axis and ascends from the bottom of the vertical axis upwards along the vertical axis. For example, if it gets through 58% of the TPs before it finds a FP, it would hug the vertical axis up until the true positive rate is 0.58. As it encounters false positives, it begins to move to the right. For example, if when it is at 0.58 on the vertical axis, it then encountered 8% of all false positive records, it would move 8% of the way to the right on the horizontal axis. In the ROC curve below, 100% of the true positives had been found at the same time that 36% of the false positives had been encountered. Then all of the remaining records were false positives. The ROC curve terminates when it has considered all cases that were predicted as positives, including TPs and FPs.

The diagonal line on the ROC curve represents where the chart would be if finding TPs happened at the same rate as finding FPs. That is, if the process of finding TPs and FPs was random, such as 20% of the TPs are found at the same point that 20% of the FPs are found.

The Area Under the Curve (AUC) is calculated from the ROC curve. After the ROC curve is plotted. The area below it is calculated. This provides an indication of the classifier’s quality. A perfect classifier has an AUC of 1.0. A classifier that finds all FPs before it finds any TPs has an AUC of zero.

In the ROC curve below from JMP, we see that about 45% of the TPs were found before any FPs were found. 100% of the TPs were found at the point that about three percent of the FPs were found. The AUC of .9941 shows that the classifier was quite successful in differentiating between TPs and FPs.

Figure 9.11: JMP ROC Curve

Limitations of the ROC Curve

ROC curves are good for evaluating the classifier performance in terms of TPs and FPs, but ignore the benefits of the classifier being correct and cost of the classifier being incorrect. ROC curves also ignore TNs and FNs. Thus, ROC curves are not sufficient for making management decisions. Conversely, the information in the confusion matrix is complete in the sense that it considers not just TPs and FPs but also TN and FNs. Since the confusion matrix is complete in this way, it can be mapped to costs and benefits and therefore support management decisions. This will be discussed more in a later chapter.

Lift Charts

Lift charts reflect how well a classifier does in relation to the base rate. The base rate is the proportion of the overall population that has the characteristic of interest. For example, if 20% of the records in the sample belong to the class of interest, 20% (or .20) is the base rate. The base rate reflects how well random sampling would be likely to find records that have the characteristic of interest.  Lift is the ratio of finding records that belong to the class of interest relative to the base rate.  

Lift = positive found proportion / base rate

For example, if a classifier helps you find three times the proportion of records that belong to the class of interest than the base rate, the lift is 3. That is, using the classifier, you would find three times the number of records that belong to the class of interest than could be found by random sampling.

To build a lift chart, the probability table is sorted by the probability of each record belonging to the class of interest from the most probable to the least probable record. The horizontal axis shows what proportion of the records have been included in the lift chart.  Lift diminishes as more records are included in the evaluation. At first, the records that are most likely to belong to the class of interest are examined. But as more records are included, the probability of them belonging to the class of interest drops, and therefore so does the average lift.

Continuous lift charts

There are two common types of lift charts: continuous lift charts and decile lift charts.

The JMP lift chart below is an example of continuous lift chart. In this example , when about 12% of the records that are most likely to belong to the class of interest are considered, the lift is about 3.4. By the time the top half of records that are most likely to belong to the class of interest have been considered, the lift has dropped to about 2.3. By the time all of the records have been considered, there is no lift.

Figure 9.12: JMP lift chart

Decile Lift Chart

The Decile lift chart is another form of lift chart. In this chart, lift is calculated by decile. That is, it reflects how well a classifier does in relation to the base rate for each decile.

As with the continuous lift chart described above, the probability table is sorted by the probability of each record belonging to the class of interest from the most probable records at the top to the least probable records at the bottom. A base rate is calculated that represents the proportion of the overall sample that are of the class on interest.

Then the sorted records in the probability table are taken in ten 10% sections, starting from the top decile. That is, the first decile that contains the records most likely to belong to the class of interest. The second decile contains the next most probable records to be of the category of interest. This process continues descending down to the bottom decile that contains the records that are least likely to contain the category of interest.

In the decile lift chart, the deciles are shown along the horizontal axis starting at the first decile and progressing to the tenth decile. The chart shows the lift calculated for each decile.

Example decile lift chart

Assume that a firm has mailed out 1000 offers to potential customers. A total of 50 customers responded positively to the offer. Thus, the base rate is 5%. the lift per decile is calculated by dividing the response rate per decile by the base rate.

The summary table used to calculate the lift per decile is shown below. In this example, in decile 1, there are 16 responders out of the 100 mailings (16%). So the lift calculated for the first decile is 16%/5% = 3.2. In the second decile, the lift is 2.4. As successive deciles are evaluated, lift declines. As the process continues the lift per decile falls below 1.

Figure 9.13: Summary Table for Decile Lift

The corresponding decile lift chart is shown below.

Why organizations use lift charts

An advantage of both the continuous lift charts and the decile lift charts is that organizations can use them to know how to invest their valuable time and money. Managers only want to spend these resources on instances that are most likely to have the desired outcome. Why?Becaue it costs money for an organization to find, evaluate, and act on the positive recommendation of the predictive process.

Let consider two instructive examples. For the first example, consider the marketing campaign that was used as the basis to build the example decile lift chart above. It costs money to identify portential customers, performing the analysis, prepare the offer, and contact the potential customer.

Consider another example. The IRS uses predictive algorithms to predict which returns are likely to be fraudulent. It costs money to perform the predictive process. It also costs money to examine each return that is predicted to be fraudulent and to perform actual audits of the returns.

These organization do not want to spend resources on customers that are unlikely to respond positively to the offer. Or, in the tax return scenario, the IRS does not want to spend resources examining and auditing returns that are not likely to be fraudulent. Lift charts provide valuable insights as to which instances the organization should spend money on and how many of them are likely to have the intended payoff sought by the organization