Evaluating Model Quality for Numeric Predictions

This chapter describes measures of the predictive accuracy of models that predict numbers. We can use the same measures to evaluate how well models "fit" existing data. Then we use them to measure how well the models predict new "unseen" data. The difference reflects the amount of overfit.

The methods of evaluating prediction accuracy for numeric outcome variables explained in this chapter are used for all numeric prediction methods, not just for multiple linear regression.

Predictions and Residuals

After a model is created, it can be used to generate predictions. Models typically do not produce perfect predictions. The difference between the actual value and the predicted value of the outcome variable is called a residual. Error is another name for the residual. Is is the part of the actual value of the outcome variable that was not accounted for by the model's prediction. It is the key component used in measures of model quality.

Figure 6.10: Definition of error

Below is an example of some actual and predicted values of an outcome variables along with the resulting residuals. Notice that some of the residuals are negative and some are positive. Some are small and some are large.

Figure 6.11: Actual, predicted, and residual values

Measures Based on Averaging Errors

These measures of numeric predictive quality are based on some form of average of prediction residuals. When evaluating the quality of numeric prediction models, at least one of these statistics should be examined along with R-squared because of the blind spots of the R-squared measure.

Mean absolute error (MAE) is the average of the absolute value of the errors. By taking the absolute value of the errors, negative and positive values of errors do not cancel each other out.

This table shows examples of how MAE and the remaining measures in this section are calculated.

Mean Squared Error (MSE) is calculated by squaring the errors and taking their average. This calculation of MSE is similar to how variance is calculated. The difference is that variance is a measure of variability around a mean. MSE is a measure of the difference between predicted and actual values of the outcome variable.

Root means square error (RMSE) is calculated by squaring the errors, take their average, and then take the square root of the average. In the example, the average of the squared deviation is 2.2. This is similar to variance around a mean but these are squared deviations between the actual and predicted values. The square root of the average is RMSE. This calculation of RMSE is very similar to how standard deviations is calculated. The difference is that standard deviation is a measure of variability around a mean. RMSE is a measure of the difference between predicted and actual values of the outcome variable.

Mean absolute percentage error (MAPE) is calculated by dividing the absolute value of each error by the actual value of the outcome variable associated with that error. Then those averaged and converted into a percentage. The advantage of MAPE is that it reflects average error as a percentage, which is very useful for providing a perspective on how large the average error is relative to the size of the actual values.

R-Squared

Definition of R-Squared (R2)

The methods of evaluating prediction accuracy for numeric outcome variables explained in this chapter are used for all numeric prediction methods, not just for multiple linear regression.

R2 is the coefficient of determination, which broadly speaking is a measure of how much of the variance in the outcome variable is explained by the model given the data. The formulas below show how the sums of squares that are used to calculate R2 and how R2 are calculated.

Figure 6.12: R-squared and Sum of Squares

The process of calculating the best fit using linear regression finds the linear equation that produces the smallest difference between all of the observed values and predicted (fitted) values. Linear regression finds the smallest sum of squared residuals that is possible for the given data. A regression model fits the data well if the differences between the actual values and predicted values of the outcome variable are small and unbiased. Unbiased in this context means that the fitted values are not systematically too high or too low anywhere in the observation space.

What is the intuitive meaning of R2? Start with the basic idea of variation. Think of the actual observations of yi as manifestations of a beginning model, the Total model. The R2 value is the proportion of the variation of the total model that is accounted for by using an alternative model, the Regression model.

Consider a typical scatterplot of data points where the value of the predictor variable, X, is on the horizontal axis and the value of the response variable, Y, is on the vertical axis. A regression line of best fit is calculated and plotted on the scatter plot. The mean of the actual values of Y is plotted as a horizontal line where Y is constant.

The following distances are calculated for each observation:

The total variation in Y actual is the sum of squared T distances. The total variation of Y predicted is the sum of squared R distances. The total variation of errors is the sum of squared e distances.

The example below shows how the sum of squares is calculated and used to calculate R2.

Figure 6.13: Sum of Squares and R-Squared Example
Figure 6.14: Graphical presentations of components of sum of squares

The length of R is sometimes longer than the length of T. So it is not correct to assume that T is always longer. But in general, when T is longer than R, the magnitude of the difference between T and R is greater than when R is larger than T. Thus, when all observations are considered together, the combination of the sum of squares for R is less than the sum of squares for T. Thus, dividing SSR by SST will aways produce a number within the range from zero to one inclusive. When numeric prediction is perfect, the value of R-squared will be equal to one. If there is no predictive value in a continuous numeric prediction model, the value of R-squared will be zero. These are the extreme cases. In most cases the value of R-squared is greater than zero and less than one.

Limitations of R-Squared

Unfortunately, the coefficient of determination, R2, like the correlation coefficient, r, is often misused and misunderstood. Why? Because both R2 and r can provide useful information; they can tell part of the story but they do not tell the whole story. In fact, R2 has significant limitations and blind spots.

Non-Linear Relationships

Both R2 and r were created to quantify the strength of a linear relationship. When the relationship between x and y is curvilinear, it is possible for both R2 and r to be zero, suggesting there is no linear relation between x and y, and yet a perfect curved (or "curvilinear" relationship) exists. In the example below, R2 , r, and slope are all zero. Without visual inspection, it would be easy to concluding that there is no relationship between x and y. But, that is not true! A strong relationship between x and y exists; it is just not linear. In this example, when a linear model is fit to the data, it is a perfectly horizontal line. The value of the predicted y is 14 regardless of the value of x. This is a clear example of a blind spot of R2 . It is also a clear example of underfitting and bias. If instead of linear model, a curvilinear model of y = 1x2 is fit to the data, R2 is 1, or 100%.

R2 Can Be Increased Just by Adding Input Variables

The problem with R2 is that it will either stay the same or increase with addition of more input variables, even if they do not have any relationship with the output variables. Obviously, this is not a desirable property of a goodness-of-fit statistic. Conversely, adjusted R-squared provides an adjustment to the R-squared statistic such that an input variable that has a correlation to y increases adjusted R-squared and any variable without a strong correlation will make adjusted R-squared decrease. That is the desired property of a goodness-of-fit statistic.

More Examples of R2 Blind Spots

Consider the four datasets below, which were created by Anscombe to illustrate possible pitfalls in linear regression. [Anscombe, F.J. "Graphs in Statistical Analysis." American Statistician 27(1): 17–21].

Figure 6.15: Ascombe Datasets

When a simple linear regression modeler is applied to each of the four datasets, the resulting models are nearly identical. The value of R2 is 0.67, the intercept is 3.00, and the slope coefficient is 0.50. Yet, the scatter plots shown for the four datasets below reflect that each dataset has unique characteristics.

Figure 6.16: Four Scatterplots

Dataset (a) appears to contain points scattered around an upward-sloping imaginary line. This is what one would typically expect to find in a dataset with a linear relationship between variables. In dataset (b), the points appear to perfectly fit a curve, which would be better fit by a curvilinear model. Dataset (c) looks like it should generate a perfect linear fit except for the one outlying point. If thee outlier is removed, R2 would be 1.0. Dataset (d) has no distribution that would support a relationship between y and x by any kind of model. All of the x values are 8 except one. It is this single outlier that dictates the slope and intercept of the fitted line. If, for example, there was an outlier with a value of 2.5 instead of 12.5, the slope would be negative instead of positive because the fitted line would pass through this point no matter where it appeared on the y axis.

The examples of Anscombe highlight the need to explore and understand the nature of the data before choosing and applying a modeling technique. Dataset (b) needs to have a non-linear modeler applied; in dataset (c) outliers should be removed before modeling; and since dataset (c) does not suggest any kind of relationship, no modeling at all is recommended.

Because of these limitations, R2 is not a reliable measure of whether a regression model provides an adequate fit to your data. A good model can have a low R2 value. On the other hand, a biased model can have a high Rvalue!

How to Avoid Being Misled by R2

The following steps can help you avoid being misled by R2. First, it should be remembered that like the correlation coefficient, R2 is only part of the story. Given, its limitations, it should not be used as the sole measure of the predictive power of a model. Other measures of model quality that are based on the prediction errors, like MAE, MSE, RMSE, or MAPE, should be used as well. These additional measures do not suffer from the blind spots of R2.

Second, when more than one or two predictor variables are used, you should also look at adjusted R2. If the adjusted value is significantly smaller than the unadjusted value, this reflects a possible overfitting or inclusion of inputs that do not contribute to the predictive power of the model.

Finally, examine the residuals plots produced from the model. Residual plots show the part of the outcome variable not predicted by the model. If a pattern such as bias is visible in the residual plot, it demonstrates the model was not able to capture some systematic aspect(s) of the relationships in the data.

Are low R2 values in regression models inherently bad?

No! There are two main reasons why it can be just fine to have low R2 values in regression models.

In some fields, because prediction is inherently difficult, it is expected that your R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R2 values lower than 50%. Humans are simply harder to predict than, say, physical processes. This occurs because human behavior is very variable and because it is impossible to capture all input variables that would lead to better prediction.

Furthermore, it is possible to have very low R2 value for the overall model but to have statistically significant predictors. When this happens, you can still draw important conclusions about how changes in the predictor values are associated with changes in the response value. Regardless of the R2, the significant coefficients still represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. Obviously, this type of information can be extremely valuable.

Turn-in Numeric Prediction HW

Please turn your homework in here.

The embedded activity could not be inserted. (g1c5624ad3759a001x2)
Click here to view a list of available activities.