3.6 Outliers Summary
Outliers are observations that are extreme or distant from the overall pattern. “Distant,” however, is a subjective term and should be based on a prior knowledge of the data and the main objectives at hand. The purpose for identifying and removing outliers is to help our models perform more accurately and not provide distorted results. Detecting outliers can also be the main purpose of creating our models, which is also known as anomaly detection.
The first step is to visualize the data before building predictive models. Many outliers can be detected through visual inspection of histograms, box plots, scatter plots, and density plots. Multivariate statistical methods for outlier detection can also be used to confirm visual analysis or to reveal outliers that are not obvious through visual inspection. In this way, many data errors and outliers can be identified.
Logical analysis from both the univariate perspective and the multivariate persepctive should be performed to see if outliers are impossible or extremely unlikely.
If outliers are found in the data, the decision of whether to keep or remove outliers can made by evaluating the effect of keeping or excluding them in predictive models. If inclusion of outliers diminishes the performance of the predictive models, they should likely be removed.