3.3 Multivariate Outlier Detection
Comparison of Multiple Variables
Multivariate statistical analysis involves working with multiple variables to understand their relationships. Using multivariate analysis to detect outliers is important because univarite methods do not detect some kinds of outliers. Sometimes outliers only become apparent when viewing the relationship of data amongst multiple variables.
Multivariate methods for outlier detection allow the values of instances within one variable to be compared against values of other variables to look for extreme outliers. A simple example of this would be to compare age with income. Typically, high income is associated with adults during their primary income-earning years of ages between 30 and 65. It is more common for younger or older individuals to have lower incomes. Comparing income against age can highlight young and old income earners with unusually high incomes. It can also allow for the detection of especially low income from individuals who are in their primary income-earning years. Notice this is another example of logical analysis, where the analyst must use reason, common sense, and domain knowledge to identify and evaluate potential outliers.
Scatter Plots and Density Plots
To assist your efforts in finding values that are logically impossible or extreme, you can use visualizations to identify outliers. Two commonly used visualizations to identify outliers are Scatter Plots and Density Plots.
Scatter Plots
Scatter plots are commonly used because they allow the analyst to easily identify trends, patterns, and clusters in the data. Any extremities that fall outside of the pattern can be spotted and removed based upon the analyzer’s discretion. Figure 3.21 shows the relationship between palmitic and palmitoleic chemicals found in corn samples. The scatter plot contains many data points; however, it is still easy to spot out the extreme data point to the far-left side of the cluster.
Density Plots
Density plots allow an analyst to view the frequency at which different data values occur within a dataset. Density plots can be graphed in different ways. Some are like scatter plots, histograms, or line curves, with color maps that indicate the frequency. Density plots that indicate extremely low frequencies far below or far above the average could be an indicator of an influential outlier. Figure 3.22 shows an example of a density plot with the same data between the palmitic and palmitoleic chemicals found in Figure 3.21.
Detecting Outliers Through Multivariate Statistical Methods
After visualizing the data and checking for logically impossible and extreme values, you can use multivariate statistical methods to identify outliers. These methods can identify outliers you have missed or confirm whether the outliers you have identified beforehand are influential or not.
Three multivariate methods that will be discussed for detecting outliers are (1) Mahalanobis Distances, (2) Jackknife Distances, and (3) K-Nearest Neighbor.
Mahalanobis Distance
The Mahalanobis Distance is a measure of the distance between a point P and a distribution D. In other words, the Mahalanobis distance measures how many standard deviations each point P is from the mean (centroid) of D. The distance depends on estimates of the mean and standard deviation and the correlation of the data.
A more intuitive explanation would be to consider many observations plotted on a X and Y graph as shown in Figure 3.24. Of these observations, the first objective would be to find the center or the mass of where these observations reside. As we test each observation, we check to see how many standard deviations the observation is in distance from the center or mass. The more standard deviations an observation is away from the mass, the more likely the observation is to be classified as not belonging to the set of data (see Figure 3.25).
The formula for computing the Mahalobis distance is the following:
Where:
Yi is the data for the i th row
Y is the row of means
S is the estimated covariance matrix for the data
Keep in mind that the Mahalanobis distance assumes that the sample observations are distributed about the center of a mass in a spherical manner. If our distributions of data happen to be non-spherical, then we should be hesitant to use the Mahalanobis distance since it could provide us with distorted results and flag normal observations as being outliers.
Jackknife Distance
The Jackknife Distance is somewhat like an improved calculation of the Mahalanobis Distance. The Jackknife distance for each observation is calculated with estimates of the mean, standard deviation, and the correlation matrix without including the observation itself in part of its calculations.
By not including each observation itself, the jackknife distance is not distorted by potential outliers, or in other words allows us to remove any bias that may be present. By including each observation like the Mahalanobis distance, it tends to disguise other potential outliers or make other points look more outlying than they are.
The formula that is used for computing the Jackknife distance is the following:
Where:
n = number of observations
Mi = Mahalanobis distance for the i th
K-Nearest Neighbor
K-Nearest Neighbor is an algorithm that can be used to create data models as well as detect present outliers based on how similar an observation is compared to its neighbors. K-Nearest Neighbors will be explained in more detail in a later chapter, but for now we will cover the basics to understand how to flag potential outliers.
Unlike the Mahalanobis and Jackknife methods where distance is measured from the center of the data, the K-Nearest Neighbor focuses on measuring the distance of an observation from its neighbors that are most similar. When looking for outliers, K stands for the order of the closest neighbor. For example, K = 1 will find the closest distance an observation has to the other most similar observation. K = 2 is to find the second closest distance to the second most similar observation. The further K increases, the more likely the distance is to increase from one observation to the next most similar observation. Outliers can be detected by computing the distances for all observations for the same K and then flagging any distances that are extremely further compared to the other observations’ distances.
An observation that is extremely distant to other observations for a K = 1 is a good indicator of an outlier. However, there may be some outliers that go unnoticed. As K increases, the more likely other outliers may surface. Keep in mind, however, that increasing K too much may cause normal observations to be flagged as false outliers. JMP recommends K = 8 for finding potential outliers; however, this is left to the discretion and judgment of the analyst.
Performing K-Nearest Neighbor in JMP
You can go about performing a K-Nearest Neighbor test in JMP by doing the following:
-
Highlight the continuous variables you wish to test for by clicking on the column headers on the JMP data table. Hold down Ctrl if you wish to select more columns.
-
Go to Cols -> Modeling Utilities -> Explore Outliers.
-
Under Command click Multivariate k-NearestNeighbor Outliers.
-
Enter the amount for k.
-
Analyze the charts for data points that are extremely high or distant from the rest and mark these as potential outliers.