7.3 Related input variables
Collinearity and Multi-Collinearity
This section will answer the following questions: What is multi-collinearity? Why is multi-collinearity important? How can it be measured? What can be done about it?
What is collinearity?
Collinearity occurs when when there is a strong bivariate correlation relationship between two input variables (IVs). Multi-collinearity occurs when there are multiple strong bivariate correlations between IVs that together would predict a lot of variance in another IV.
In Example 1, there is a single high bivariate correlation relationships between two IVs.
In Example 1, IV2 and IV3 are so strongly related, you could predict most of IV2 with IV3 or vice-a-versa. In this case, we should drop IV2 because it has a weaker correlation with the outcome variable than IV3.
In Example 2, there is strong negative bivariate correlations between IVs. This is also collinearity. The problem is just as bad as strong positive correlations between input variables.
Here again, we would drop IV2.
In Example 3, we see multi-collinearity, which occurs when there are multiple high bivariate relationships between IVs that together would predict a lot of variance in another IV.
What should be done in Example 3? IV3 is stongly collinear with both IV1 and IV2. But IV1 and IV2 are not collinear. By dropping IV3, we can keep both IV1 and IV2.
Problems caused by Collinearity
-
Collinearity causes the β weights for collinear or multi-collinear input variables to be poorly estimated. This can lead to misleading interpretations when using the regression coefficients.
-
Collinearity causes regression coefficients to be unstable ("bouncing betas"). Drawing a different sample can lead to big changes in the estimates of the coefficients. Also, adding or removing a variable causes a large change in the coefficients.
-
Collinearty can lead to bizarre results, such as R2 is large but none of the individual beta weights are statistically significant. Also, the beta weight estimates can have the opposite sign than you would expect based on the correlations in correlation matrix.
-
Collinearity can confuse predictive data mining algorithms other than MLR as well, leading to confusion over whether collinear IVs are one or multiple input variables. Should influence be ascribed to one or all collinear variables?
-
Collinearity can also cause problems for Data Mining Software. Very high collinearity ( r ≥ .90) can cause DM software to be unable to compute some algorithms because the algorithms cannot complete the math. When this happens, the software may fail or warn you that you have a collinearity problem.
Correlation Ranges associated with Problems
The severity of problems caused by collinearity depends on the strength of the correlation. The following image shows ranges associated with distorted regression coefficients and problems caused for data mining tools.
What to do when collinearity exists
The analyst has three primary options when collinearity is present.
-
Use collinear input variables anyway. If your main goal is to optimize prediction and not get accurate estimates for regression coefficients, you can use collinear input variables, if the collinearity problem is not so bad that it prevents the data mining software from running. But it is important to communicate to possible users of the model that regression coefficient are not well estimated. If the regression weights are likely distorted, inform the decision makers of that fact so they do not make bad decisions based on these estimates.
-
Drop one or more collinear input variables. A good rule of thumb is to retain the IV that has the strongest correlation with the outcome variable. Drop the other(s). We do this a lot in this class. Always check for collinearity when evaluating a dataset.
-
Combine collinear input variables. There are multiple ways to combine information from input variables. An indepth treatment of this topic is beyond the scope of this class. But here is a brief summary of what can be done. If you have multiple indicators of the same variable (e.g., two IQ tests) on a similar scale consider averaging them or adding them together. If IVs are on different scales (e.g., one is 1-to-5 and one is 1-to-7), so adding or averaging them does not make sense. But you can standarize the scores to standard deviations from the mean and average the standardized scores. Another option is to use principal components analysis to extract information from multiple variables into fewer synthesized variables. This can be an effective way to reduce related variables into fewer factors. The cost of this is that the ability to explain what the factors represent becomes difficult.
An inportant implication of this discussion on collinearity, is that you should do a collinearity analysis before any predictive data mining exercise. It should be a standard step in the data preparation phase. Then you can make an informed decision of whether you have collinearity and you can determine what you should do about it.
Related Categorical variables
Sometimes categorical variables are not independent. In other words one categorical variable will determine or match categories in one or more other categorical variables. When this occurs, it is sometimes referred to as a singularity problem. Such dependencies between input variables will confuse the machine learning algorithms.
Assume you are trying to predict home prices in a specific county. In the dataset there are a number of nominal (categorical) data columns including zip code, city name, school district, high school name, elementary school name, and junior high school name. All of these have very specific deterministic relationships to one another. What does this mean? By knowing one, you know the corresponding value or small number of values in another categorical variable.
In a small city, a city name usually corresponds to only a single zip code. So if you know the zip code, you know the city name or vice-a-versa. In a large city, A city name typically corresponds to a small number of zip codes. So if you know the city name, you know the zipcode would be one of a small number of codes. Likewise, high school name can correspond to a specific city or zip code. Or in a larger city or with school choice, a single city may have multiple high schools, and so forth. So these input variables are not independent.
Is this information valuable? It might be if there are systematic relationships between the information with finer granularities and the outcome variable price. For example, if specific zipcodes in the cities have higher or lower priced homes, zipcode could be better than city. If some high schools correspond to richer or poorer parts of a city, high schools might be a better predicter of price than city or zipcode.
You cannot use all related categories because it will likely confuse your machine learning algorithm. So which categorical variables have the best degree of specificity as it relates to price? Is city, zipcode, or high school best? We need to determine the answer to this question by testing how helpful each related input variable is for our objective of predicting price.
Ideally, the information in different input variables is independent. It is independent when the values in one column do not depend on the values in the other column.
This problem occurs when different input variables contain redundant information or the values in one input variable determines the values in a different input variable. When data has deterministic relationships between categorical input variables, it creates problems for data mining algorithms, and this problem is especially severe for MLR. The problem arises because the learning algorithm wants to estimate the independent effect of each input variable on the outcome variable. But the information in the input variables is not independent.
In this example, if you are to create a MLR model, you will need to choose which related data column should be used, but not all of these related columns should be used.
Also, if you have only one value in a data column that applies to all instances, it cannot be predictive. For example, if you are estimating housing prices for Boston Massachusetts, all homes in Boston have the same state name. Thus, state cannot be differentially predictive for any given home. State should not be used as an input variable in this situation. Some data mining tools will tell you that this is a problem. Some will not.