Managing Outliers

Outliers are extreme values that deviate significantly from the rest of the dataset. They can skew results, impact model performance, and distort statistical analyses. Proper identification and handling of outliers are crucial for effective data analysis.

Identifying Outliers

There are three primary techniques for idnetifying outliers which we'll discuss below:

  1. Empirical Rule (Z-Score Method)

  2. Tukey’s Boxplot Rule (IQR Method)

  3. DBScan Clustering Algorithm

Each method has its own strenghts and wekanesses, making them suitable for different types of datasets.

Empirical Rule (a.k.a. Z-Score Method; a.k.a., 68-95-99.7 Rule)

The Empirical Rule is appropriate when linear assumptions are made and states that for a normal distribution:

  • 68% of values fall within 1 standard deviation (σ) of the mean.

  • 95% of values fall within 2 standard deviations (σ).

  • 99.7% of values fall within 3 standard deviations (σ).

Figure 9.2: Normal Distribution and Outliers

Again, this technique for identifying outliers is effective when the data follows a normal distribution. It is also easy to compute and interpret. However, it fails to accurately identify outliers when the data are skewed or non-normal. In addition, extreme outliers can distort the mean and standard deviation.

Tukey’s Boxplot Rule (a.k.a., Interquartile Range - IQR)

The is a particular type of box plot intended for skewed distributions where the max/min (or whiskers) of the plot are defined as the lowest/highest data points that are still within 1.5 * interquartile range ([IQR] defined as Q3 - Q1), as depicted in the image below:

Figure 9.3: Box Plot Details Mapped to Normal Distribution

Conveniently, the box plot function in the Matplotlib package (plt.boxplot()) defaults to the Tukey box plot with a 1.5|IQR distance for the min and max values. We can generate the box plot without actually showing it, just so that we can retrieve the 1.5|IQR min/max values. Then, as previously, we can use those values to perform the same set of logic as above to generate outlier replacement values at the theoretical min/max.

Boxplot Returned Values

The boxplot() function returns a dictionary containing the following information:

  • boxes: the main body of the boxplot showing the quartiles and the median’s confidence intervals (if enabled).

  • medians: the horizontal lines at the median of each box.

  • whiskers: the vertical lines extending to the most extreme, non-outlier data points.

  • caps: the horizontal lines at the ends of the whiskers.

  • fliers: the points representing data that extend beyond the whiskers.

  • means: the points or lines representing the means.

In summary, the IQR, or Tukey's Box Plot, method works well when the data are skewed or non-normal. Nor is it affected by extreme outliers. However, it is less effective when there are multiple peaks in the data (a.k.a. "Multimodal"). It also assumes a symmetric distribution of the data which may not always be true.

DBScan Clustering Algorithm

Another limitation of both the Empirical Rule and Tukey's BoxPlot methods is that both are intended to identify outliers in only one feature-at-a-time. A given value may appear to be an outlier within a single feature, but when combined with all other values in the record, it may be apparent that it is not, in fact, an outlier. For example, let's say that there is a dataset of employees. One employee has a salary that is higher than all others. If you only examine the salary feature, this employee may appear to be an outlier. When if you include all other values, perhaps you'll realize that this is the CEO who has worked at the company longer than all others. In that case, you may decide that she or he is an outlier after all.

This is the primary limitation solved by clustering-based outlier detection techniques like DBScan (Density-Based Spatial Clustering of Applications with Noise). DBSCAN is an unsupervised learning algorithm that clusters data points based on their density. Outliers are points not assigned to any cluster.

Figure 9.4: Box Plot Details Mapped to Normal Distribution

Addressing Outliers

Once outliers have been identified, the next step is to decide how to handle them. The best approach depends on the context of the data, the impact of outliers, and the goals of the analysis. Let's briefly review the options.

Remove Outliers

The first option is to remove outlier records altogether

      # Remove outliers detected by Z-Score
      df_cleaned = df[df['Z-Score'].abs() <= 3]
        
      # Remove outliers detected by IQR
      df_cleaned = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]
      

This method is clean and simple, but it may eliminate useful data that we could actually benefit from including in our dataset. Only do this if you are sure those values will not appear in future data or if you know that they will not be relevant.

Cap or Winsorize Outliers (Capping)

Next, you can cap your outliers be replacing the outlier value with the theoretical max or min value that represents the meaningful edge of the range.

      # Cap outliers using percentile-based Winsorization
      lower_cap = df['Value'].quantile(0.05)
      upper_cap = df['Value'].quantile(0.95)
        
      df['Capped'] = np.where(df['Value'] < lower_cap, lower_cap,
                              np.where(df['Value'] > upper_cap, upper_cap, df['Value']))
      

The advantage is that this method keeps as much data as possible. The disadvantage is that it creates artificial "bumps" at the edges of the data set so that it appears the distribution is multimodal.

Transform Data

We likely already covered in a prior section how you can mathematically transform the entire feature from skewed to normal. Doing this often eliminates most or all outliers in a feature. Although you have seen this before, here is a brief example again:

      # Log Transformation
      df['Log_Transformed'] = np.log1p(df['Value'])  # log1p avoids log(0) issues
        
      # Square Root Transformation
      df['Sqrt_Transformed'] = np.sqrt(df['Value'])
        
      # Box-Cox Transformation (only for positive values)
      from scipy.stats import boxcox
      df['BoxCox_Transformed'], lambda_ = boxcox(df['Value'] + 1)  # Avoid zero values
      

This is ideal for skewed features. But it may not be effective enough when the skewness is extreme.

Use Robust Models

Another option is to ignore the outliers altogether and simply plan to use alogorithms later in the modeling phase that are immune to the effect of outliers. You might be wondering why we don't always use this method. The answer is that sometimes linear algorithms that depend on a normal distribution will sometimes perform better than these algorithms (e.g. Decision Trees) if outliers can be addressed.

Automating in Functions

Now that we have an idea of how to identify and address outliers, let's create a couple of functions to put this altogether. In particular, let's create two functions: one to handle the one-at-a-time techniques (Empirical Rule and Tukey BoxPlot) and one to handle the more advanced all-at-once technique (DBSCAN).

One-At-A-Time

Let's begin with a function that uses either the Empirical Rule or the Tukey BoxPlot to identify and clean outliers depending on whether the skewness of each numeric feature is within acceptable boundaries (-1 to 1).

      # Documentation
      # required:
      #   Pandas DataFrame
      #   A list of features to clean outliers
      # optional:
      #   The method of cleaning:
      #     remove = delete outlier rows
      #     replace = replace the value with the theoretical min/max
      #     impute = fill in a predicted amount based on a linear model
      #     null = keep the rows but replace the outliers with null
      #   Whether to include skip messages
        
      def clean_outlier(df, features=[], method="remove", messages=True, skew_threshold=1):
        import pandas as pd, numpy as np
        
        for feat in features:
          if feat in df.columns:
            if pd.api.types.is_numeric_dtype(df[feat]):
              if df[feat].nunique() != 1:
                if not all(df[feat].value_counts().index.isin([0, 1])):
                  skew = df[feat].skew()
                  if skew < (-1 * skew_threshold) or skew > skew_threshold: # Tukey boxplot rule: < 1.5 * IQR < is an outlier
                    q1 = df[feat].quantile(0.25)
                    q3 = df[feat].quantile(0.75)
                    min = q1 - (1.5 * (q3 - q1))
                    max = q3 + (1.5 * (q3 - q1))
                  else:  # Empirical rule: any value > 3 std from the mean (or < 3) is an outlier
                    min = df[feat].mean() - (df[feat].std() * 3)
                    max = df[feat].mean() + (df[feat].std() * 3)
      
                  min_count = df.loc[df[feat] < min].shape[0]
                  max_count = df.loc[df[feat] > max].shape[0]
                  if messages: print(f'{feat} has {max_count} values above max={max} and {min_count} below min={min}')
      
                  if min_count > 0 or max_count > 0:
                    if method == "remove": # Remove the rows with outliers
                      df = df[df[feat] > min]
                      df = df[df[feat] < max]
                    elif method == "replace":   # Replace the outliers with the min/max cutoff
                      df.loc[df[feat] < min, feat] = min
                      df.loc[df[feat] > max, feat] = max
                    elif method == "impute": # Impute the outliers by deleting them and then prediting the values based on a linear regression
                      df.loc[df[feat] < min, feat] = np.nan
                      df.loc[df[feat] > max, feat] = np.nan
      
                      from sklearn.experimental import enable_iterative_imputer
                      from sklearn.impute import IterativeImputer
                      imp = IterativeImputer(max_iter=10)
                      df_temp = df.copy()
                      df_temp = bin_categories(df_temp, features=df_temp.columns, messages=False)
                      df_temp = basic_wrangling(df_temp, features=df_temp.columns, messages=False)
                      df_temp = pd.get_dummies(df_temp, drop_first=True)
                      df_temp = pd.DataFrame(imp.fit_transform(df_temp), columns=df_temp.columns, index=df_temp.index, dtype='float')
                      df_temp.columns = df_temp.columns.get_level_values(0)
                      df_temp.index = df_temp.index.astype('int64')
      
                      # Save only the column from df_temp that we are iterating on in the main loop because we may not want every new column
                      df[feat] = df_temp[feat]
                    elif method == "null":
                      df.loc[df[feat] < min, feat] = np.nan
                      df.loc[df[feat] > max, feat] = np.nan
                else:
                  if messages: print(f'{feat} is a dummy code (0/1) and was ignored')
              else:
                if messages: print(f'{feat} has only one value ({df[feat].unique()[0]}) and was ignored')
            else:
              if messages: print(f'{feat} is categorical and was ignored')
          else:
            if messages: print(f'{feat} is not found in the DataFrame provided')
      
        return df
      

Let's review what's going on in this function. If this list below is not helpful enough, then I recommend following along with the video above to understand it more completely.

  1. Iterates Through the Given Features

    • Checks if each feature exists in the DataFrame.

    • Skips features that are:

      • Categorical (non-numeric data).
      • Binary (0/1) (likely dummy variables).
      • Constant (only one unique value).
  2. Detects Outliers Using Two Methods:

    • Tukey’s Boxplot Rule (for Skewed Data)

      • If the data is highly skewed (skew > skew_threshold), outliers are defined as values beyond 1.5 times the IQR (Interquartile Range).
      • min = Q1 - (1.5 * IQR)
      • max = Q3 + (1.5 * IQR)
    • Empirical Rule (for Normally Distributed Data)

      • If the data is not highly skewed, outliers are values more than 3 standard deviations from the mean.
      • min = mean - (3 * std)
      • max = mean + (3 * std)
  3. Counts and Logs the Number of Outliers

    • Calculates how many values fall below min or above max.

    • If messages=True, prints how many outliers exist in each column.

  4. Handles Outliers Based on the method Parameter:

    • "remove": drops rows with outliers

    • "replace": replaces outliers with the min/max cutoff

    • "impute": users IterativeImputer to predict missing values

    • "null": replaces outliers with NaN

  5. Returns the Clearned DataFrame

    • The function modifies and returns the DataFrame with outliers handled based on the selected method

Let's try out our new function on the datasets we have imported. First, let's examine the properties of the insurance dataset and the BMI feature in particular to see how it will change:

      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      print(df_insurance.shape)
      df_top5 = df_insurance.sort_values(by=['bmi'], ascending=False)['bmi'].head()
      print(df_top5)
      
      # Output:
      # (1338, 7)
      # 1317   53.1300
      # 1047   52.5800
      # 847    50.3800
      # 116    49.0600
      # 286    48.0700
      # Name: bmi, dtype: float64
      

Keep these numbers for reference and now let's run the function on this dataset:

      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_insurance = clean_outlier(df_insurance, features=df_insurance.columns, method='remove')
      print(df_insurance.shape)
      df_insurance.loc[df_insurance.index.isin(df_top5.index)].head()
      
      # Output:
      # age has 0 values above max=81.3569065487098 and 0 below min=-2.9428557265872257
      # sex is categorical and was ignored
      # bmi has 4 values above max=48.957957596023604 and 0 below min=12.368836125949496
      # children has 18 values above max=4.716345622635248 and 0 below min=-2.5229423242844238
      # smoker is categorical and was ignored
      # region is categorical and was ignored
      # charges has 129 values above max=35232.529987500006 and 0 below min=-13588.807712500002
      # (1187, 7)
      #      age     sex     bmi  children smoker     region   charges
      # 286   46  female 48.0700         2     no  northeast 9432.9253
      

After running the insurance dataset through our new function, we can see that bmi, children, and charges each have some outliers. Knowing that children is an integer with few values, I would prefer to remove that from the outlier detection. Notice that after removing the four outliers from BMI, we have only one of the original top 5 records left. After removing the outliers for charges and children, we went from 1338 total records down to 1187.

Let's run this function again using the "replace" technique instead of "remove". Let's also ignore the children column:

      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_insurance = clean_outlier(df_insurance, features=df_insurance.drop(columns=['children']).columns, method='replace')
      print(df_insurance.shape)
      print(df_insurance.loc[df_insurance.index.isin(df_top5.index)].head())
      
      # Output:
      # age has 0 values above max=81.3569065487098 and 0 below min=-2.9428557265872257
      # sex is categorical and was ignored
      # bmi has 4 values above max=48.957957596023604 and 0 below min=12.368836125949496
      # smoker is categorical and was ignored
      # region is categorical and was ignored
      # charges has 139 values above max=34489.350562499996 and 0 below min=-13109.1508975
      # (1338, 7)
      #       age     sex     bmi  children smoker     region    charges
      # 116    58    male 48.9580         0     no  southeast 11381.3254
      # 286    46  female 48.0700         2     no  northeast  9432.9253
      # 847    23    male 48.9580         1     no  southeast  2438.0552
      # 1047   22    male 48.9580         1    yes  southeast 34489.3506
      # 1317   18    male 48.9580         0     no  southeast  1163.4627
      

Notice that, this time, we kept all 1338 records. But the values that were previously dropped for bmi and charges have now been replaced with the theoretical max or min. You can see the potential downside of this practice if you have too many outliers. Take a look at the histplots() of charges before and after replacing outliers in the output image of the code below:

      import seaborn as sns
      import matplotlib.pyplot as plt
        
      # Load dataset
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
        
      # Save original 'bmi' values before outlier cleaning
      df_original = df_insurance.copy()
        
      # Apply outlier cleaning (replacing outliers with min/max cutoff)
      df_insurance = clean_outlier(df_insurance, features=df_insurance.drop(columns=['children']).columns, method='replace', messages=False)
        
      # Create subplots: 1 row, 2 columns
      fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
      # Plot histogram for original 'bmi' values
      sns.histplot(df_original['charges'], ax=axes[0], kde=True)
      axes[0].set_title("Charges Distribution (Before Cleaning)")
      axes[0].set_xlabel("Charges")
      axes[0].set_ylabel("Count")
        
      # Plot histogram for cleaned 'bmi' values
      sns.histplot(df_insurance['charges'], ax=axes[1], kde=True)
      axes[1].set_title("Charges Distribution (After Cleaning)")
      axes[1].set_xlabel("Charges")
      axes[1].set_ylabel("Count")
        
      # Adjust layout for better visibility
      plt.tight_layout()
      plt.show()
      

Finally, let's try it again using imputation:

      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_insurance = clean_outlier(df_insurance, features=df_insurance.columns, method='impute')
      print(df_insurance.loc[df_insurance.index.isin(df_top5.index)].head())
      
      # Output
      # age has 0 values above max=81.3569065487098 and 0 below min=-2.9428557265872257
      # sex is categorical and was ignored
      # bmi has 4 values above max=48.957957596023604 and 0 below min=12.368836125949496
      # children has 18 values above max=4.711396007088629 and 0 below min=-2.5215604316028286
      # smoker is categorical and was ignored
      # region is categorical and was ignored
      # charges has 139 values above max=34489.350562499996 and 0 below min=-13109.1508975
      #       age     sex     bmi  children smoker     region    charges
      # 116    58    male 33.6658    0.0000     no  southeast 11381.3254
      # 286    46  female 48.0700    2.0000     no  northeast  9432.9253
      # 847    23    male 31.9898    1.0000     no  southeast  2438.0552
      # 1047   22    male 36.8380    1.0000    yes  southeast 18595.6066
      # 1317   18    male 31.8094    0.0000     no  southeast  1163.4627
      

All-At-Once

Next, let's create another function that uses the DBSCAN clustering algorithm.

      # Documentation
      # required: a Pandas DataFrame
      # optional:
      #  messages = True        -> whether or not to include detailed information about outliers
      #  drop_percent = 0.02    -> the percent of the dataset you want dropped as outliers. The eps parameter will be automatically adjusted to accomplish this
      #  distance = 'euclidean' -> the distance metric used for the clustering algorithm. Options: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']
        
      def clean_outliers(df, messages=True, drop_percent=0.02, distance='manhattan', min_samples=5):
        import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt
        from sklearn.cluster import DBSCAN
        from sklearn import preprocessing
        
        # Clean the dataset first
        if messages: print(f"{df.shape[1] - df.dropna(axis='columns').shape[1]} columns were dropped first due to missing data")
        df.dropna(axis='columns', inplace=True)
        if messages: print(f"{df.shape[0] - df.dropna().shape[0]} rows were dropped first due to missing data")
        df.dropna(inplace=True)
        df_temp = df.copy()
        df_temp = bin_categories(df_temp, features=df_temp.columns, messages=False)
        df_temp = basic_wrangling(df_temp, features=df_temp.columns, messages=False)
        df_temp = pd.get_dummies(df_temp, drop_first=True)
        # Normalize the dataset
        df_temp = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(df_temp), columns=df_temp.columns, index=df_temp.index)
      
        # Calculate the number of outliers based on a range of eps values
        outliers_per_eps = []
        outliers = df_temp.shape[0]
        eps = 0
      
        if df_temp.shape[0] < 500:
          iterator = 0.01
        elif df_temp.shape[0] < 2000:
          iterator = 0.05
        elif df_temp.shape[0] < 10000:
          iterator = 0.1
        elif df_temp.shape[0] < 25000:
          iterator = 0.2
        
        while outliers > 0:
          eps += iterator
          db = DBSCAN(metric=distance, min_samples=min_samples, eps=eps).fit(df_temp)
          outliers = np.count_nonzero(db.labels_ == -1)
          outliers_per_eps.append(outliers)
          if messages: print(f'eps: {round(eps, 2)}, outliers: {outliers}, percent: {round((outliers / df_temp.shape[0])*100, 3)}%')
       
        drops = min(outliers_per_eps, key=lambda x:abs(x-round(df_temp.shape[0] * drop_percent)))
        eps = (outliers_per_eps.index(drops) + 1) * iterator
        db = DBSCAN(metric=distance, min_samples=min_samples, eps=eps).fit(df_temp)
        df['outlier'] = db.labels_
        
        if messages:
          print(f"{df[df['outlier'] == -1].shape[0]} outlier rows removed from the DataFrame")
          sns.lineplot(x=range(1, len(outliers_per_eps) + 1), y=outliers_per_eps)
          sns.scatterplot(x=[eps/iterator], y=[drops])
          plt.xlabel(f'eps (divide by {iterator})')
          plt.ylabel('Number of Outliers')
          plt.show()
        
        # Drop rows that are outliers
        df = df[df['outlier'] != -1]
        return df
      

This function is a bit complicated, but well worth the time it takes to understand it. Follow along with the video above if needed. To summarize, this function does the following:

  1. Data Preprocessing

    • Removes columns with missing data.

    • Removes rows with missing data.

    • Encodes categorical variables:

      • Uses bin_categories() to group rare categories.
      • Uses basic_wrangling() to remove redundant or low-value columns.
      • Converts categorical features into dummy variables (pd.get_dummies()).
    • Normalizes all features using Min-Max Scaling to ensure numerical stability.

  2. Finding the Optimal eps (Epsilon) for DBSCAN

    • Initial Setup: Defines an iterator step for eps:

      • Small datasets (<500 rows): eps increases in 0.01 steps.
      • Medium datasets (500–2,000 rows): eps increases in 0.05 steps.
      • Large datasets (2,000–10,000 rows): eps increases in 0.1 steps.
      • Very large datasets (10,000–25,000 rows): eps increases in 0.2 steps.
    • Iterates through eps values:

      • Runs DBSCAN with increasing eps.
      • Counts outliers (db.labels_ == -1).
      • Logs the number of detected outliers at each step.
  3. Selecting the Best eps

    • Finds the eps value that results in approximately drop_percent% of the dataset being labeled as outliers.

    • Re-runs DBSCAN with the chosen eps.

  4. Removing Outliers

    • Assigns an “outlier” label (-1) to detected outliers in df['outlier'].

    • Drops rows where outlier == -1.

  5. Visualizing Outlier Detection (Optional)

    • Plots a line graph showing the number of outliers detected at each eps value.

    • Marks the chosen eps on the graph.

  6. Optional Parameters:

    • messages=True : If True, prints detailed logs and shows plots of the outlier detection process.

    • drop_percent=True : The percentage of the dataset to be removed as outliers. (e.g., 0.02 → removes ~2% of rows as outliers.)

    • distance=manhattan : The distance metric used for DBSCAN clustering. Options: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'].

    • min_samples=5 : The minimum number of points required to form a cluster. Lower values detect more outliers, higher values detect fewer outliers.

Let's go ahead and use this function on the insurance dataset:

      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_insurance = clean_outliers(df_insurance, min_samples=5)
      print(df_insurance.shape)
      print(df_insurance.head())
      
      # Output
      # 0 columns were dropped first due to missing data
      # 0 rows were dropped first due to missing data
      # eps: 0.05, outliers: 1327, percent: 99.178%
      # eps: 0.1, outliers: 1221, percent: 91.256%
      # eps: 0.15, outliers: 934, percent: 69.806%
      # eps: 0.2, outliers: 680, percent: 50.822%
      # eps: 0.25, outliers: 470, percent: 35.127%
      # eps: 0.3, outliers: 322, percent: 24.066%
      # eps: 0.35, outliers: 236, percent: 17.638%
      # eps: 0.4, outliers: 153, percent: 11.435%
      # eps: 0.45, outliers: 108, percent: 8.072%
      # eps: 0.5, outliers: 65, percent: 4.858%
      # eps: 0.55, outliers: 36, percent: 2.691%
      # eps: 0.6, outliers: 9, percent: 0.673%
      # eps: 0.65, outliers: 6, percent: 0.448%
      # eps: 0.7, outliers: 1, percent: 0.075%
      # eps: 0.75, outliers: 1, percent: 0.075%
      # eps: 0.8, outliers: 0, percent: 0.0%
      # 36 outlier rows removed from the DataFrame
      #  
      # (1302, 8)
      #    age     sex     bmi  children smoker     region    charges  outlier
      # 0   19  female 27.9000         0    yes  southwest 16884.9240        0
      # 1   18    male 33.7700         1     no  southeast  1725.5523        1
      # 2   28    male 33.0000         3     no  southeast  4449.4620        1
      # 3   33    male 22.7050         0     no  northwest 21984.4706        2
      # 4   32    male 28.8800         0     no  northwest  3866.8552        2
      

As you can see, at an eps at the default drop value of 0.02 drops the 36 (1338 - 1302) most extreme records. This is a much better way to remove outliers if you aren't going to use a linear algorithm later in the modeling phase.