Introduction to MLR

Introduction

Multiple linear regression (MLR) attempts to model the relationship between two or more explanatory variables. Explanatory variables are also called predictor variables. In MLR, the outcome variable is explained by fitting a linear equation to the observed data. Every value of the predictor variable x is associated with a value of the outcome variable y.

Regression models are extensions of the classic y = mx + b linear equation for a straight line, where y is the value of the outcome variable, m is the slope of the relationship between x and y, and b is the intercept. This reading uses different notation for the same linear equation, but it is the same concept. The figure below shows the notation for single predictor regression model. It also extends the linear model to include multiple predictors instead of one predictor.

Figure 6.1: Components of Multiple Linear Regression Equation

An important difference between perfect lines and the models we usually create is that our models typically are not perfect straight lines. Instead, there are scattered values that we hope are approximated by linear relationships between the outcome and predictor variables. The part of the outcome values that are not explained by the systematic portion of the model are called errors or residuals. That is, the residuals or error is what is left that is not explained by the model.

Inferential Statistics vs. Predictive Modeling

Multiple linear regression has a long history of being used in inferential statistics. MLR models used in inferential statistics are designed primarily to explain rather than predict. The main purpose of creating MLR models in inferential statistics is to fit a model that accurately explains the relationship between the predictors and the outcome variable. The goal is to provide accurate descriptions of the regression coefficients and to estimate the confidence intervals around those coefficients.

Conversely, in predictive modeling the primary purpose of the model is to create a model that can do a good job of predicting the outcome variable. Explanation is an important but secondary objective. It is more important to come up with accurate predictions than to provide exact estimates of the coefficients and confidence intervals for those coefficients. This is not to say that getting useful estimates of coefficients is not important in predictive modeling, but obtaining useful estimates of the coefficients is a secondary benefit of the modeling process.

In inferential statistics, there are a number of assumptions that the data must fit to be able make valid estimates of confidence intervals around regression coefficients. Failure to meet these assumptions means that you cannot make valid inferences about the width of the confidence intervals. In predictive modeling, we typically do not define confidence intervals around the regression coefficients. Therefore, we are less concerned about making sure that every assumption is met.

Finally, with inferential statistics, you never know if your coefficients are correct because you use all of your data to train the model. Inferential statistics were developed at time where data was rare and expensive. Therefore, the objective was to use a small sample and make educated estimates of what the actual coefficients are for the larger population.

With inferential statistics you do not use a separate validation dataset that contains "unseen data" to assess the degree of overfitting. Thus, we know if our models predict accurately. Conversely, in predictive modeling we use unseen data to assess the degree of overfitting. We measure accuracy directly by predicting outcome values and then seeing how well predicted outcome values match actual outcome values.

Assumptions of MLR Models

Linear relationships. A linear relationship is assumed between the outcome variable and the predictor variables. When this is true and when the those relationships are strong, MLR is often a very good modeling approach. Sometimes this assumption that the relationships are linear is valid and sometimes it is not. If the nature of the relationship between predictors and the outcome variable is not approximately linear, regression models may not create the most accurate predictions. Other modeling methods that can approximate non-linear relationships may be better.

MLR models are additive. This means that the contribution of different variables are added together so that they contribute to the outcome variable.

Figure 6.2: Additive nature of components in multiple regression models

Relationships are compensatory. That is, a high value of one predictor can compensate somewhat for a low value of another variable. The example below reflects how high mileage can compensate somewhat for a newer vehicle. And to a lesser extent, low miles can compensate for an older vehicle. Thus, one variable can to some extent compensate for the other.

Figure 6.3: Compensatory nature of muliple regression models

Absence of strong collinearity between predictor variables. Multicollinearity exists whenever two or more of the predictors in a regression model are moderately or highly correlated. When there is a high correlation between predictor variables, they are collinear. This can cause distortions in estimates of the regression coefficients. In effect, collinear variables make it difficult for the MLR algorithm to differentiate the contribution of each variable because the predictors are so highly correlated that their relative contribution can be impossible to accurately estimate.

Homoscedasticity. This means constant or uniform variance. Heteroscedasticity exists when as the value of one or more predictors increases the variability of the outcome variable increases or decreases. The diagram below shows two examples of heteroscedasticity. Note that the graphs include a line of best fit, which is the best estimate of the outcome variable. Note that as the variable X increases, the amount of the outcome variable unexplained by the line of best fit changes. In the example on the left, as X increases, the variability in Y increases. The opposite is true of the example on the right. By implication, a linear predictive model will have difficulty being accurate. It would be better to have a modeling method that can take into account how variability changes as the value of the predictor variable increases.

Figure 6.4: Examples of Heteroscedasticity

There are two easy ways to determine whether heteroscedasticity exists. One way is to create a scatter plot of the outcome variable by the input variable. The other way to do this is to plot the errors in the outcome variable estimate as the value of the outcome variable increases. In the residual plot in the graphic below, the error distribution is essentially constant. Conversely, in the residual plot on the right, error variance increases, so it shows that heteroscedasticity exists.

Figure 6.5: Residuals that are and are not homoscedastic