Intro to Outliers

Outliers

What is an outlier?

An outlier is an observation that is extreme or distant from the overall pattern in a sample. However, “distant” is intentionally a subjective term. Many times what is considered distant is left to the discretion of the analyst or data scientist, who must determine if the observation should be included or not. In order to make wise decisions about which data points to include and exclude, the observer must take into account the main objective at hand, make an extra effort to understand the data, and know its limitations.

Why look for outliers?

In machine learning, algorithms are designed to learn from the data. Outliers, which are not entirely representative of the population, can bias the results and have a disproportionate influence on models. This can lead analysts to make incorrect judgments or decisions. For this reason, it is important to check for outliers before creating models based on the data.

The influence of outliers

To illustrate how outliers can influence descriptive statistics from a sample, let’s begin by observing the effect an outlier can have on simple statistics. Take, for example, five numbers 22, 24, 21, 26, and 22. These numbers have a mean of 23, a median of 22, and a standard deviation of 2. Now if we add an outlier with a number of 299, some of our statistics suddenly become highly influenced. With 299 included, our new statistics show the mean is 69, the median is 23, and the standard deviation is 112.69. From our observations, we can see that the mean and the standard deviation are highly influenced by the outlier. The median, however, manages to be more resistant to the influence of outliers. Drastic distances between the mean and the median could be a good indicator that an outlier is influencing your data.

Figure 3.1: Simple Influence of Outliers

Outliers can distort fully implemented models as well. In the BeesAndPollen dataset, we can see the affect that an outlier has on a simple linear regression model and its line of fit, which will be explained in more detail in Chapter 6. The BeesAndPollen dataset contains how much pollen was removed from queen and worker bees based upon the duration of their visit. We can still see in Figure 3.2 that the red line of fit for pollen collected by queen bees does not seem entirely representative of the sample data. This is due to the two outliers that appear distant to the right of the cluster of data.

Figure 3.2: Queen Bee Simple Linear Model Including Outliers

However, after removing these data points we can see in Figure 3.3 the green line of fit has adjusted dramatically and now appears to be more representative of the majority of the sample population. Notably, two outliers are all it took to influence and change the results of our model.

Figure 3.3: Queen Bee Simple Linear Model Excluding Outliers

Anomaly Detection

Sometimes the entire purpose of data modeling is to find and discover outliers. This is also known as anomaly detection. Anomaly detection involves identifying items, events, or observations that do not conform to a specific pattern in the data. This also involves finding unique spikes and changes in the data over time.

Detecting unauthorized expenditures or fraud within a company is an example of anomaly detection. Amongst the many transactions and noise, the analyst’s objective is to discover who may be committing fraud. Different models that we will discuss later can be used to identify potential outliers in anomaly detection.

The following are some common applications of anomaly detection:

  • Cybersecurity and network intrusion

  • Identifying spam

  • Airport security screening

  • Credit card fraud

  • Falsified currency

  • Stock market

Common Causes of Outliers

Outliers can come about in many ways. Below is a list of common causes for outliers in data:

  • Data entry errors (human errors)

  • Measurement errors (instrument errors)

  • Experimental errors (data extraction or experiment planning/execution errors)

  • Intentional (dummy outliers made to test detecting methods)

  • Data processing errors (data manipulation or dataset unintended mutilations)

  • Sampling errors (extracting non-representative data or mixing data from wrong or various sources)

  • Naturally occurring outliers (is not an error)

Outlier Detection Methods in This Chapter

This chapter will discuss some of the most common methods for identifying outliers:

  • Univariate Methods

    • Logical Detection

    • Z-Score/Standard Deviation Approach

    • Tukey’s Boxplots

  • Multivariate Methods

    • Scatter Plots

    • Density Plots

    • Mahalanobis Distances

    • Jackknife Distances

    • K-Nearest Neighbor