3.5 To Remove or Not to Remove Outliers
Knowing When to Remove Outliers
When a potential outlier is found, the analyst must determine whether the observation should be fixed, removed, or retained. Typically, we fix or omit records that contain obvious data errors. But outliers may or may not be data errors. When outliers are not data errors, judgment must be used to determine whether to include or exclude outliers. For example, assume you determine that a few young people earn very high incomes. Should these individuals be included in your analysis? It depends on what you are trying to accomplish. If you want to create a model of typical cases, it may make sense to exclude them because they are so unusual that they would confuse machine learning algorithms trying to create models of typical income earners. On the other hand, if your purpose is to just to accurately describe the population of income earners, it may make sense to include extreme outliers.
We have not yet covered how to run and evaluate predictive models based on the data, but if outliers are present, then separate models should be run with and without the outliers. These models can be compared to see if the presence of outliers improves or does not improve the predictive model. If outliers are removed to improve the model, document that fact and describe the removed outliers to relevant decision makers.