Create MLR Model

Understanding Dummy Variables

Not all modeling methods use dummy variables to represent categorical input variables. But MLR does require the use of dummy variables. A dummy variable is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. Regression does numeric estimation of coefficients. As such it needs to work with numbers to estimate those values. So categories must be converted into a number so the effect of the category can be determined.

The graphic below shows how categorical values can be converted to binary dummy variables. Each category is converted to a column where a 1 indicates the presence of the category and a zero reflects the absence of that category. So each category value becomes its own column of data. Then one category may be removed. Notice how this works with two categories. In the example on the left, the column shows whether a car has an automatic transmission (Yes or No). The presence can be adequately presented by one column where 1 equals yes and zero equals no.

Figure 6.6: Converting categories to dummy variables

In the example on the right, there are three types of fuel that may be used to power cars. Cars are either run by diesel, petrol (gasoline), or compressed natural gas (CNG). Only two dummy variable columns are necessary because if you know two of the categories the third category can be derived. If there is a zero in two columns, there must be a one in the other column.

When this occurs, the regression creates estimates for two of the categories. The one that is omitted is included in the intercept value of the regression equation. Then when one of the other categories is present, it represents the difference in value from the default. For example, assume that a dummy variable created for diesel and CNG but not gasoline. Also assume that a diesel-powered car is worth $1000 more than a gasoline car and a CNG-powered car is worth $1400 more. MLR will add the value of the gasoline to the intercept value. Thus, if neither dummy variable have 1s, the intercept reflects the value of a gasoline-powered car. Conversely, if a 1 is in the dummary variable for diesel, $1000 is added to the price of the car in addition to the value that was included in the intercept. Likewise, if there is a 1 in the CNG dummy variable, $1400 is added to the price of the car.

Managing the Number of Dummy Variables

The rule of thumb for MLR is that you should have at least 10 times as many observations (records) as independent variables. For categorical variables the number of dummy variables is n-1, where n = the number of choices, or unique categories in the categorical variable. So for example, a categorical variable with six categorical choices would require 5 dummy variables. A categorical variable with 30 categories would require 29 dummy variables. Thus, large numbers of categories can quickly result in a prohibitive number of dummy variables. This needs to managed so that too many dummy variables do not get created.

Figure 6.7: Avoid too many dummy variables

There are number of ways to keep the number of dummy variables to a reasonable level. The number of categories can be reduced by combining like categories. You can also determine which categories make a difference and which do not. Those that do not can be combined. Those that do can also be combined, if they have the same effect on the outcome variable. You can also determine which categorical variables do not help improve predictions. For example, the color or paint for a car may not influence price. If not, then color can be eliminated as a predictor variable. This example is instructive. Consider how many different car colors actually exist. Clearly there are dozens of just common colors, such as blue, red, and black. When variations of these colors, like many different types of red, are included, there are easily hundreds of colors.

Select Input Variables

When creating models, it is best to make the most simple model that provides adequate prediction. Occam's (or Ockham's) razor is a principle attributed to the 14th century English logician William of Ockham. Occam's razor states that it is best to proceed to simpler explanatory theories until simplicity results in a loss of explanatory power. Suppose there exist two explanations for an occurrence. In this case the simpler one that explains the occurrence is usually better. Another way of saying it is that the more assumptions you have to make, the more unlikely an explanation is.

When creating models, sometimes simple models can be quite predictive. If there is no or little benefit to adding more input variables, the predictors should be omitted. The analyst should determine whether a model can be created with fewer variables.

Evaluating Coefficients

Regression coefficients represent how much the value of the outcome variable will change with the addition or reduction of one unit in value of the predictor variable. In effect they are slopes. These slopes are created as average changes in Y for a one-unit change in X when holding the effects of other input variables constant.

Figure 6.8: How to interpret regression coefficients

The significance of regression coefficients is determined by whether they are statistically significant. The statistical value of a coefficient is reflected by its p-value. The p-value measures the likelihood that a coefficient is really zero.

Figure 6.9: How to interpret p-value of regression coefficients

The significance of a variable is determined by whether it is .05 or less. If a coefficient of an input variable is not statistically significant, the variable should be removed from the model. Then model quality should be evaluated. If the model's predictive quality is as good or better without the variable, then the variable should be excluded. Sometimes exclusion of insignificant variables will increase model quality. Most of the time, when insignificant variables are dropped it does not reduce overall model quality. However, there are exceptional cases when dropping non-significant variables will reduce the predictive quality of the model. In that case, the predictor should be added back unless the improvement is too small to make it worth adding back the predictor.