3.2 Univariate Outlier Detection
As mentioned before, a data analyst or scientist must make sure that their data is understood. Obvious outliers can be spotted by first checking for values that are logically impossible. In addition, visualizations can be used to help understand the data better and identify extreme observations.
Univariate outlier detection methods are designed to examine each column of data by itself to see if some of the values are unusual. Detection methods are typically a combination of calculating descriptive statistics relative to the distribution of values and using visualizations along with those statistics.
This section will discuss four common methods of univariate outlier detection methods: logical detection, histograms, z-score based methods, and Tukey's Box Plot method.
Logical Detection
Impossible Values. When beginning your search for outliers, you need to look at each variable and ask yourself, “What values are impossible?” For example, in a dataset containing human body temperatures, a value of 120° would be impossible to have, as it would be terminal to the patient. However, a temperature of 102°, which has the same digits but in a different order, is quite common. Also, a record of someone running a 3-minute mile is highly unlikely to occur, or you have just witnessed a world record for the fastest mile ever. Many of these logical outliers tend to occur from data recording or human error.
Unlikely Values. The analyst must also determine values that are extremely rare or unlikely. Such a determination cannot be done without domain knowledge. Domain knowledge is necessary when evaluating possible data errors and outliers. For example, domain knoweldge of typical human body temperatures is necessary to determine whether a high or low temperature is impossible or unlikely. If the analyst lacks adequate domain knoweldge, the analyst should seek domain knowledge or ask experts who have domain knowledge.
A great way to begin searching for logical errors is by checking the summary statistics for all your variables, which include mean, median, mode, standard deviation, minimum, maximum, and many other statistics. Many statistical packages offer this feature.
In JMP, you can access the summary statistics by going to Analyze -> Distribution and then selecting which variables you desire to see summary statistics for.
You can customize your summary statistics by going to File -> Preferences -> Platforms -> Distribution Summary Statistics and then select what you would like to see as your default. In this example, additional values are selected via the preferences.
Histograms
Histograms can help analysts quickly identify outliers that fall on the outside edges of a data distribution. Another point to remember when working with histograms is being aware of the bin sizes of your distributed data. Large bin sizes at the end of the distribution may conceal outliers. Conversely, having smaller bin sizes will give more spread to your histogram and are more likely to reveal outliers. Overall, the bin size is left to the discretion of the analyst and his or her understanding of the data.
Z-Score/Standard Deviation Approach
With normally distributed data, obvious outliers can be identified by combining descriptive statistics with visualizations and by calculating the Z-Score equivalent of observations.
What Is a Z-Score?
A Z-Score is a standardized approach for converting unstandardized data points into standardized data points in terms of the number of standard deviations that each point lies from the mean.
However, you should only use this approach when the data is continuous and approximately normally distributed. Continuous data means that data is numeric and there are many possible values such as height, weight, or age that can be divided into a series of ranges or bins. Normally distributed data looks like a bell curve and can easily be identified visually by looking at a histogram (see Figure 3.7). If the data is not normally distributed, you should use Tukey’s boxplot method.
When the data is normally distributed statistical theory allows us to make assumptions about the distribution of our data. Figure 3.8 shows the percentage or likelihood of how much data is contained within each standard deviation from the mean. Usually going one standard deviation above and below the mean contains over half (68.2%) of the data points. Expanding two standard deviations above and below the mean usually contains 95.4% of the data. Expanding to three standard deviations above and below the mean captures about 99.7% of the data. There is .15% of area under the curve on each side of the distribution that exceeds three standard deviations away from the mean.
As a result, a common rule to follow is to flag any data point above or below three standard deviations from the mean as a potential outlier. The reason being is the probability of any data point occurring outside of this range is highly unlikely to happen. If a data point does fall outside 3 standard deviations of the distribution, the analyst should investigate and determine based on the situation if the outlier should be removed or not.
The power of the normal distribution is that it can be easily mapped to values in the observed data. For example, the standardized normal distribution of IQ scores are standardized so that they will have a mean of 100. For example, the Weschler IQ test has a mean of 100 and a standard deviation of 15.
The graph below shows this distribution. Because it’s standardized, we know that about 68% of IQ scores will be between 85 and 115. This is also where percentiles come from. A person with the IQ of 130, is at the 97.9%, so about 98% of IQ scores are at or below that IQ, and that person is therefore in the 98th percentile. Applying this to the concept of outlier detection, IQ values below 55 are in the bottom 0.1% (p = 0.001) of the the distribution. IQ values above 145 are the top 0.1% of the population.
How to Identify Normally Distributed Data
JMP allows us to identify which variables or columns of data are normally distributed by building a Normal Quantile plot. This can be done by doing the following:
Select Analyze -> Distributions.
Select a desired number of known continuous variables you would like to analyze for normality and click Y.
Click OK, and you will see a histogram and summary statistics.
Click the red drop-down arrow (see Figure 3.11) and select Normal Quantile Plot.
A Normal Quantile plot shows observations plotted on the X axis against the expected Z-Score. If the points follow an approximate fat shaped line and fall within the red-dotted confidence intervals, we can conclude that the data is normally distributed (Figure 3.12 Left Graph). However, if the data points do not have an approximate straight line or fall outside of the red-dotted confidence intervals, we can conclude the data is not normally distriuted (Figure 3.12 Right Graph.
To statistically determine if the data is normally distributed, you can have JMP perform the Shapiro-Wilk W Test. This can be found by clicking the red drop-down arrow for your continuous variable (see Figure 3.11) and selecting Continuous Fit -> Normal. Then click the red drop-down arrow next to Fitted Normal and select Goodness Of Fit. If you have a ProbW statistic that is greater than .05, there is a high probability that your data is normally distributed (Figure 3.13 Left Table). However, if ProbW statistic is less than .05 then more than likely your data is not normally distributed (Figure 3.13 Right Table).
How to Perform the Z-Score/Standard Deviation Calculation
To standardize your data, calculate the Z-Score for each observation. The specific formula for calculating the standardized value, that is, the Z-Score, is as follows:
After the data has been standardized, you can then check for any Z-Scores that have a value greater than 3 or less than -3 and flag them as potential outliers.
Many software and statistical packages have the capability of flagging outliers that are so many standard deviations from the mean. If this option is not available, however, you can use simple Excel formulas such as =Average(data) and =STDEV.S(data). A non-standardized approach involves identifying the average/mean and standard deviation. Then you take the (standard deviation * 3) and then add and subtract that number from your average/mean in order to identify your range. After this point you can flag any data point that falls outside of this range as a potential outlier. Figure 3.15 shows how to find outliers using a non-standardized approach and standardized (Z-Score) appraoch for finding outliers in Excel.
How to Perform the Z-Score Approach in JMP
To identify outliers in JMP using the Z-Score approach, perform the following steps:
From the JMP data table, select Analyze -> Distribution.
Select a known continuous and normally distributed variable and click Y.
Click OK, and you will see a histogram and summary statistics.
Click on the red drop-down arrow for your variable and go to Save -> Standardized. A new column, Std [variable name], will then be saved to your table.
The new column that has just been created is the Z-Score for each observation. You can then scan the data table or use the Graph Builder in JMP to plot points and identify which observations have Z-Scores greater than 3 to be flagged as outliers for that specific variable you chose.
Quantile Box Plot
Another way to check for outliers is to compare a histogram up next to a Quantile Box plot. A quantile boxplot has whiskers that extend to the and max values. Keep in mind that a quantile box plot does not show outliers, but instead marks quantiles with a tic for pth quantile within the whiskers. For example, 10% of the data will lie under the 10th quantile and 90% of the data will lie below the 90th quantile. A snapshot of a Quantile Box Plot in JMP is shown below.
The Quantile Box Plot is not shown by default in JMP. To see the Quantile Box plot for your continuous variables, perform the following:
Select Analyze -> Distributions.
Select the known continuous variable(s) and click Y.
Click OK, and you will see a histogram and summary statistics.
Click on the red drop-down arrow for your variable and select Quantile Box Plot. Remember not to confuse the Quantile Box Plot with the Outlier Box Plot
By comparing the Quantile Box Plot to the histogram, you can find which data lies outside of the 10th or 90th quartile. These points can then be further investigated to see if they are potential outliers.
Tukey’s Boxplots
Due to the nature of different datasets, not all of the data will be normally distributed. Four other types of data distributions are Bi-Modal, Unitary, Negatively Skewed, and Positively Skewed data (see Figure 3.17).
A method to identify outliers amongst these types of distributions can be accomplished with Tukey boxplots. John Tukey, a famous statistician, is known for creating box-plots (see Figure 3.18) in 1970 that can easily help an analyst visualize and quickly identify which data points could be potential outliers.
How to Calculate Tukey’s Boxplot by Hand
A Tukey boxplot is created and calculated by doing the following:
Median (50% or middle point of the data)
3rd Quartile (Upper 25% of data)
1st Quartile (Lower 25% of data)
Inner Quartile Range (3rd Quartile – 1st Quartile)
Upper Threshold (3rd Quartile + 1.5 × Interquartile Range)
Lower Threshold (1st Quartile – 1.5 × Interquartile Range)
As Tukey worked with vast amounts of data, he found that the ideal threshold distance from the median is 1.5 times the interquartile range. 1.5 is common practice amongst many statisticians and data scientists today. The outer thresholds are marked (with a dotted line); any data point equal to or outside of the threshold is marked as an outlier. View the figures below for more detail on Tukey’s boxplots and how to create them.