9.7 Managing Outliers
Outliers are extreme values that deviate significantly from the rest of the dataset. They can skew results, impact model performance, and distort statistical analyses. Proper identification and handling of outliers are crucial for effective data analysis.
Identifying Outliers
There are three primary techniques for idnetifying outliers which we'll discuss below:
Empirical Rule (Z-Score Method)
Tukey’s Boxplot Rule (IQR Method)
DBScan Clustering Algorithm
Each method has its own strenghts and wekanesses, making them suitable for different types of datasets.
Empirical Rule (a.k.a. Z-Score Method; a.k.a., 68-95-99.7 Rule)
The Empirical Rule is appropriate when linear assumptions are made and states that for a normal distribution:
68% of values fall within 1 standard deviation (σ) of the mean.
95% of values fall within 2 standard deviations (σ).
99.7% of values fall within 3 standard deviations (σ).
Again, this technique for identifying outliers is effective when the data follows a normal distribution. It is also easy to compute and interpret. However, it fails to accurately identify outliers when the data are skewed or non-normal. In addition, extreme outliers can distort the mean and standard deviation.
Tukey’s Boxplot Rule (a.k.a., Interquartile Range - IQR)
The A particular case of a box plot intended for skewed distributions where the max/min (or “whiskers”) of the plot are defined as the lowest/highest data point that is still within 1.5 * interquartile range (Q3 - Q1). is a particular type of box plot intended for skewed distributions where the max/min (or whiskers) of the plot are defined as the lowest/highest data points that are still within 1.5 * interquartile range ([IQR] defined as Q3 - Q1), as depicted in the image below:
Conveniently, the box plot function in the Matplotlib package (plt.boxplot()) defaults to the Tukey box plot with a 1.5|IQR distance for the min and max values. We can generate the box plot without actually showing it, just so that we can retrieve the 1.5|IQR min/max values. Then, as previously, we can use those values to perform the same set of logic as above to generate outlier replacement values at the theoretical min/max.
The boxplot() function returns a dictionary containing the following information:
boxes: the main body of the boxplot showing the quartiles and the median’s confidence intervals (if enabled).
medians: the horizontal lines at the median of each box.
whiskers: the vertical lines extending to the most extreme, non-outlier data points.
caps: the horizontal lines at the ends of the whiskers.
fliers: the points representing data that extend beyond the whiskers.
means: the points or lines representing the means.
In summary, the IQR, or Tukey's Box Plot, method works well when the data are skewed or non-normal. Nor is it affected by extreme outliers. However, it is less effective when there are multiple peaks in the data (a.k.a. "Multimodal"). It also assumes a symmetric distribution of the data which may not always be true.
DBScan Clustering Algorithm
Another limitation of both the Empirical Rule and Tukey's BoxPlot methods is that both are intended to identify outliers in only one feature-at-a-time. A given value may appear to be an outlier within a single feature, but when combined with all other values in the record, it may be apparent that it is not, in fact, an outlier. For example, let's say that there is a dataset of employees. One employee has a salary that is higher than all others. If you only examine the salary feature, this employee may appear to be an outlier. When if you include all other values, perhaps you'll realize that this is the CEO who has worked at the company longer than all others. In that case, you may decide that she or he is an outlier after all.
This is the primary limitation solved by clustering-based outlier detection techniques like DBScan (Density-Based Spatial Clustering of Applications with Noise). DBSCAN is an unsupervised learning algorithm that clusters data points based on their density. Outliers are points not assigned to any cluster.
Addressing Outliers
Once outliers have been identified, the next step is to decide how to handle them. The best approach depends on the context of the data, the impact of outliers, and the goals of the analysis. Let's briefly review the options.
Remove Outliers
The first option is to remove outlier records altogether
# Remove outliers detected by Z-Score
df_cleaned = df[df['Z-Score'].abs() <= 3]
# Remove outliers detected by IQR
df_cleaned = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]
This method is clean and simple, but it may eliminate useful data that we could actually benefit from including in our dataset. Only do this if you are sure those values will not appear in future data or if you know that they will not be relevant.
Cap or Winsorize Outliers (Capping)
Next, you can cap your outliers be replacing the outlier value with the theoretical max or min value that represents the meaningful edge of the range.
# Cap outliers using percentile-based Winsorization
lower_cap = df['Value'].quantile(0.05)
upper_cap = df['Value'].quantile(0.95)
df['Capped'] = np.where(df['Value'] < lower_cap, lower_cap,
np.where(df['Value'] > upper_cap, upper_cap, df['Value']))
The advantage is that this method keeps as much data as possible. The disadvantage is that it creates artificial "bumps" at the edges of the data set so that it appears the distribution is multimodal.
Transform Data
We likely already covered in a prior section how you can mathematically transform the entire feature from skewed to normal. Doing this often eliminates most or all outliers in a feature. Although you have seen this before, here is a brief example again:
# Log Transformation
df['Log_Transformed'] = np.log1p(df['Value']) # log1p avoids log(0) issues
# Square Root Transformation
df['Sqrt_Transformed'] = np.sqrt(df['Value'])
# Box-Cox Transformation (only for positive values)
from scipy.stats import boxcox
df['BoxCox_Transformed'], lambda_ = boxcox(df['Value'] + 1) # Avoid zero values
This is ideal for skewed features. But it may not be effective enough when the skewness is extreme.
Use Robust Models
Another option is to ignore the outliers altogether and simply plan to use alogorithms later in the modeling phase that are immune to the effect of outliers. You might be wondering why we don't always use this method. The answer is that sometimes linear algorithms that depend on a normal distribution will sometimes perform better than these algorithms (e.g. Decision Trees) if outliers can be addressed.
Automating in Functions
Now that we have an idea of how to identify and address outliers, let's create a couple of functions to put this altogether. In particular, let's create two functions: one to handle the one-at-a-time techniques (Empirical Rule and Tukey BoxPlot) and one to handle the more advanced all-at-once technique (DBSCAN).
One-At-A-Time
Let's begin with a function that uses either the Empirical Rule or the Tukey BoxPlot to identify and clean outliers depending on whether the skewness of each numeric feature is within acceptable boundaries (-1 to 1).
# Documentation
# required:
# Pandas DataFrame
# A list of features to clean outliers
# optional:
# The method of cleaning:
# remove = delete outlier rows
# replace = replace the value with the theoretical min/max
# impute = fill in a predicted amount based on a linear model
# null = keep the rows but replace the outliers with null
# Whether to include skip messages
def clean_outlier(df, features=[], method="remove", messages=True, skew_threshold=1):
import pandas as pd, numpy as np
for feat in features:
if feat in df.columns:
if pd.api.types.is_numeric_dtype(df[feat]):
if df[feat].nunique() != 1:
if not all(df[feat].value_counts().index.isin([0, 1])):
skew = df[feat].skew()
if skew < (-1 * skew_threshold) or skew > skew_threshold: # Tukey boxplot rule: < 1.5 * IQR < is an outlier
q1 = df[feat].quantile(0.25)
q3 = df[feat].quantile(0.75)
min = q1 - (1.5 * (q3 - q1))
max = q3 + (1.5 * (q3 - q1))
else: # Empirical rule: any value > 3 std from the mean (or < 3) is an outlier
min = df[feat].mean() - (df[feat].std() * 3)
max = df[feat].mean() + (df[feat].std() * 3)
min_count = df.loc[df[feat] < min].shape[0]
max_count = df.loc[df[feat] > max].shape[0]
if messages: print(f'{feat} has {max_count} values above max={max} and {min_count} below min={min}')
if min_count > 0 or max_count > 0:
if method == "remove": # Remove the rows with outliers
df = df[df[feat] > min]
df = df[df[feat] < max]
elif method == "replace": # Replace the outliers with the min/max cutoff
df.loc[df[feat] < min, feat] = min
df.loc[df[feat] > max, feat] = max
elif method == "impute": # Impute the outliers by deleting them and then prediting the values based on a linear regression
df.loc[df[feat] < min, feat] = np.nan
df.loc[df[feat] > max, feat] = np.nan
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10)
df_temp = df.copy()
df_temp = bin_categories(df_temp, features=df_temp.columns, messages=False)
df_temp = basic_wrangling(df_temp, features=df_temp.columns, messages=False)
df_temp = pd.get_dummies(df_temp, drop_first=True)
df_temp = pd.DataFrame(imp.fit_transform(df_temp), columns=df_temp.columns, index=df_temp.index, dtype='float')
df_temp.columns = df_temp.columns.get_level_values(0)
df_temp.index = df_temp.index.astype('int64')
# Save only the column from df_temp that we are iterating on in the main loop because we may not want every new column
df[feat] = df_temp[feat]
elif method == "null":
df.loc[df[feat] < min, feat] = np.nan
df.loc[df[feat] > max, feat] = np.nan
else:
if messages: print(f'{feat} is a dummy code (0/1) and was ignored')
else:
if messages: print(f'{feat} has only one value ({df[feat].unique()[0]}) and was ignored')
else:
if messages: print(f'{feat} is categorical and was ignored')
else:
if messages: print(f'{feat} is not found in the DataFrame provided')
return df
Let's review what's going on in this function. If this list below is not helpful enough, then I recommend following along with the video above to understand it more completely.
Iterates Through the Given Features
Checks if each feature exists in the DataFrame.
Skips features that are:
- Categorical (non-numeric data).
- Binary (0/1) (likely dummy variables).
- Constant (only one unique value).
Detects Outliers Using Two Methods:
Tukey’s Boxplot Rule (for Skewed Data)
- If the data is highly skewed (skew > skew_threshold), outliers are defined as values beyond 1.5 times the IQR (Interquartile Range).
- min = Q1 - (1.5 * IQR)
- max = Q3 + (1.5 * IQR)
Empirical Rule (for Normally Distributed Data)
- If the data is not highly skewed, outliers are values more than 3 standard deviations from the mean.
- min = mean - (3 * std)
- max = mean + (3 * std)
Counts and Logs the Number of Outliers
Calculates how many values fall below min or above max.
If messages=True, prints how many outliers exist in each column.
Handles Outliers Based on the method Parameter:
"remove": drops rows with outliers
"replace": replaces outliers with the min/max cutoff
"impute": users IterativeImputer to predict missing values
"null": replaces outliers with NaN
Returns the Clearned DataFrame
The function modifies and returns the DataFrame with outliers handled based on the selected method
Let's try out our new function on the datasets we have imported. First, let's examine the properties of the insurance dataset and the BMI feature in particular to see how it will change:
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
print(df_insurance.shape)
df_top5 = df_insurance.sort_values(by=['bmi'], ascending=False)['bmi'].head()
print(df_top5)
# Output:
# (1338, 7)
# 1317 53.1300
# 1047 52.5800
# 847 50.3800
# 116 49.0600
# 286 48.0700
# Name: bmi, dtype: float64
Keep these numbers for reference and now let's run the function on this dataset:
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
df_insurance = clean_outlier(df_insurance, features=df_insurance.columns, method='remove')
print(df_insurance.shape)
df_insurance.loc[df_insurance.index.isin(df_top5.index)].head()
# Output:
# age has 0 values above max=81.3569065487098 and 0 below min=-2.9428557265872257
# sex is categorical and was ignored
# bmi has 4 values above max=48.957957596023604 and 0 below min=12.368836125949496
# children has 18 values above max=4.716345622635248 and 0 below min=-2.5229423242844238
# smoker is categorical and was ignored
# region is categorical and was ignored
# charges has 129 values above max=35232.529987500006 and 0 below min=-13588.807712500002
# (1187, 7)
# age sex bmi children smoker region charges
# 286 46 female 48.0700 2 no northeast 9432.9253
After running the insurance dataset through our new function, we can see that bmi, children, and charges each have some outliers. Knowing that children is an integer with few values, I would prefer to remove that from the outlier detection. Notice that after removing the four outliers from BMI, we have only one of the original top 5 records left. After removing the outliers for charges and children, we went from 1338 total records down to 1187.
Let's run this function again using the "replace" technique instead of "remove". Let's also ignore the children column:
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
df_insurance = clean_outlier(df_insurance, features=df_insurance.drop(columns=['children']).columns, method='replace')
print(df_insurance.shape)
print(df_insurance.loc[df_insurance.index.isin(df_top5.index)].head())
# Output:
# age has 0 values above max=81.3569065487098 and 0 below min=-2.9428557265872257
# sex is categorical and was ignored
# bmi has 4 values above max=48.957957596023604 and 0 below min=12.368836125949496
# smoker is categorical and was ignored
# region is categorical and was ignored
# charges has 139 values above max=34489.350562499996 and 0 below min=-13109.1508975
# (1338, 7)
# age sex bmi children smoker region charges
# 116 58 male 48.9580 0 no southeast 11381.3254
# 286 46 female 48.0700 2 no northeast 9432.9253
# 847 23 male 48.9580 1 no southeast 2438.0552
# 1047 22 male 48.9580 1 yes southeast 34489.3506
# 1317 18 male 48.9580 0 no southeast 1163.4627
Notice that, this time, we kept all 1338 records. But the values that were previously dropped for bmi and charges have now been replaced with the theoretical max or min. You can see the potential downside of this practice if you have too many outliers. Take a look at the histplots() of charges before and after replacing outliers in the output image of the code below:
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
# Save original 'bmi' values before outlier cleaning
df_original = df_insurance.copy()
# Apply outlier cleaning (replacing outliers with min/max cutoff)
df_insurance = clean_outlier(df_insurance, features=df_insurance.drop(columns=['children']).columns, method='replace', messages=False)
# Create subplots: 1 row, 2 columns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Plot histogram for original 'bmi' values
sns.histplot(df_original['charges'], ax=axes[0], kde=True)
axes[0].set_title("Charges Distribution (Before Cleaning)")
axes[0].set_xlabel("Charges")
axes[0].set_ylabel("Count")
# Plot histogram for cleaned 'bmi' values
sns.histplot(df_insurance['charges'], ax=axes[1], kde=True)
axes[1].set_title("Charges Distribution (After Cleaning)")
axes[1].set_xlabel("Charges")
axes[1].set_ylabel("Count")
# Adjust layout for better visibility
plt.tight_layout()
plt.show()
Finally, let's try it again using imputation:
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
df_insurance = clean_outlier(df_insurance, features=df_insurance.columns, method='impute')
print(df_insurance.loc[df_insurance.index.isin(df_top5.index)].head())
# Output
# age has 0 values above max=81.3569065487098 and 0 below min=-2.9428557265872257
# sex is categorical and was ignored
# bmi has 4 values above max=48.957957596023604 and 0 below min=12.368836125949496
# children has 18 values above max=4.711396007088629 and 0 below min=-2.5215604316028286
# smoker is categorical and was ignored
# region is categorical and was ignored
# charges has 139 values above max=34489.350562499996 and 0 below min=-13109.1508975
# age sex bmi children smoker region charges
# 116 58 male 33.6658 0.0000 no southeast 11381.3254
# 286 46 female 48.0700 2.0000 no northeast 9432.9253
# 847 23 male 31.9898 1.0000 no southeast 2438.0552
# 1047 22 male 36.8380 1.0000 yes southeast 18595.6066
# 1317 18 male 31.8094 0.0000 no southeast 1163.4627
All-At-Once
Next, let's create another function that uses the DBSCAN clustering algorithm.
# Documentation
# required: a Pandas DataFrame
# optional:
# messages = True -> whether or not to include detailed information about outliers
# drop_percent = 0.02 -> the percent of the dataset you want dropped as outliers. The eps parameter will be automatically adjusted to accomplish this
# distance = 'euclidean' -> the distance metric used for the clustering algorithm. Options: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']
def clean_outliers(df, messages=True, drop_percent=0.02, distance='manhattan', min_samples=5):
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn import preprocessing
# Clean the dataset first
if messages: print(f"{df.shape[1] - df.dropna(axis='columns').shape[1]} columns were dropped first due to missing data")
df.dropna(axis='columns', inplace=True)
if messages: print(f"{df.shape[0] - df.dropna().shape[0]} rows were dropped first due to missing data")
df.dropna(inplace=True)
df_temp = df.copy()
df_temp = bin_categories(df_temp, features=df_temp.columns, messages=False)
df_temp = basic_wrangling(df_temp, features=df_temp.columns, messages=False)
df_temp = pd.get_dummies(df_temp, drop_first=True)
# Normalize the dataset
df_temp = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(df_temp), columns=df_temp.columns, index=df_temp.index)
# Calculate the number of outliers based on a range of eps values
outliers_per_eps = []
outliers = df_temp.shape[0]
eps = 0
if df_temp.shape[0] < 500:
iterator = 0.01
elif df_temp.shape[0] < 2000:
iterator = 0.05
elif df_temp.shape[0] < 10000:
iterator = 0.1
elif df_temp.shape[0] < 25000:
iterator = 0.2
while outliers > 0:
eps += iterator
db = DBSCAN(metric=distance, min_samples=min_samples, eps=eps).fit(df_temp)
outliers = np.count_nonzero(db.labels_ == -1)
outliers_per_eps.append(outliers)
if messages: print(f'eps: {round(eps, 2)}, outliers: {outliers}, percent: {round((outliers / df_temp.shape[0])*100, 3)}%')
drops = min(outliers_per_eps, key=lambda x:abs(x-round(df_temp.shape[0] * drop_percent)))
eps = (outliers_per_eps.index(drops) + 1) * iterator
db = DBSCAN(metric=distance, min_samples=min_samples, eps=eps).fit(df_temp)
df['outlier'] = db.labels_
if messages:
print(f"{df[df['outlier'] == -1].shape[0]} outlier rows removed from the DataFrame")
sns.lineplot(x=range(1, len(outliers_per_eps) + 1), y=outliers_per_eps)
sns.scatterplot(x=[eps/iterator], y=[drops])
plt.xlabel(f'eps (divide by {iterator})')
plt.ylabel('Number of Outliers')
plt.show()
# Drop rows that are outliers
df = df[df['outlier'] != -1]
return df
This function is a bit complicated, but well worth the time it takes to understand it. Follow along with the video above if needed. To summarize, this function does the following:
Data Preprocessing
Removes columns with missing data.
Removes rows with missing data.
Encodes categorical variables:
- Uses bin_categories() to group rare categories.
- Uses basic_wrangling() to remove redundant or low-value columns.
- Converts categorical features into dummy variables (pd.get_dummies()).
Normalizes all features using Min-Max Scaling to ensure numerical stability.
Finding the Optimal eps (Epsilon) for DBSCAN
Initial Setup: Defines an iterator step for eps:
- Small datasets (<500 rows): eps increases in 0.01 steps.
- Medium datasets (500–2,000 rows): eps increases in 0.05 steps.
- Large datasets (2,000–10,000 rows): eps increases in 0.1 steps.
- Very large datasets (10,000–25,000 rows): eps increases in 0.2 steps.
Iterates through eps values:
- Runs DBSCAN with increasing eps.
- Counts outliers (db.labels_ == -1).
- Logs the number of detected outliers at each step.
Selecting the Best eps
Finds the eps value that results in approximately drop_percent% of the dataset being labeled as outliers.
Re-runs DBSCAN with the chosen eps.
Removing Outliers
Assigns an “outlier” label (-1) to detected outliers in df['outlier'].
Drops rows where outlier == -1.
Visualizing Outlier Detection (Optional)
Plots a line graph showing the number of outliers detected at each eps value.
Marks the chosen eps on the graph.
Optional Parameters:
messages=True : If True, prints detailed logs and shows plots of the outlier detection process.
drop_percent=True : The percentage of the dataset to be removed as outliers. (e.g., 0.02 → removes ~2% of rows as outliers.)
distance=manhattan : The distance metric used for DBSCAN clustering. Options: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'].
min_samples=5 : The minimum number of points required to form a cluster. Lower values detect more outliers, higher values detect fewer outliers.
Let's go ahead and use this function on the insurance dataset:
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
df_insurance = clean_outliers(df_insurance, min_samples=5)
print(df_insurance.shape)
print(df_insurance.head())
# Output
# 0 columns were dropped first due to missing data
# 0 rows were dropped first due to missing data
# eps: 0.05, outliers: 1327, percent: 99.178%
# eps: 0.1, outliers: 1221, percent: 91.256%
# eps: 0.15, outliers: 934, percent: 69.806%
# eps: 0.2, outliers: 680, percent: 50.822%
# eps: 0.25, outliers: 470, percent: 35.127%
# eps: 0.3, outliers: 322, percent: 24.066%
# eps: 0.35, outliers: 236, percent: 17.638%
# eps: 0.4, outliers: 153, percent: 11.435%
# eps: 0.45, outliers: 108, percent: 8.072%
# eps: 0.5, outliers: 65, percent: 4.858%
# eps: 0.55, outliers: 36, percent: 2.691%
# eps: 0.6, outliers: 9, percent: 0.673%
# eps: 0.65, outliers: 6, percent: 0.448%
# eps: 0.7, outliers: 1, percent: 0.075%
# eps: 0.75, outliers: 1, percent: 0.075%
# eps: 0.8, outliers: 0, percent: 0.0%
# 36 outlier rows removed from the DataFrame
#
# (1302, 8)
# age sex bmi children smoker region charges outlier
# 0 19 female 27.9000 0 yes southwest 16884.9240 0
# 1 18 male 33.7700 1 no southeast 1725.5523 1
# 2 28 male 33.0000 3 no southeast 4449.4620 1
# 3 33 male 22.7050 0 no northwest 21984.4706 2
# 4 32 male 28.8800 0 no northwest 3866.8552 2
As you can see, at an eps at the default drop value of 0.02 drops the 36 (1338 - 1302) most extreme records. This is a much better way to remove outliers if you aren't going to use a linear algorithm later in the modeling phase.