Missing Data

Missing data is a common challenge in data analysis and machine learning projects. If not handled correctly, it can lead to biased conclusions, inaccurate predictions, and ineffective models.

Types of Missing Data

There are three key types of missing data:

  • MCAR (Missing Completely at Random)

    • Data is missing randomly with no pattern.

    • Least problematic; missingness is independent of the data.

  • MAR (Missing at Random)

    • Missingness depends on other observed variables but not the missing values themselves.

    • Can be corrected using statistical methods.

  • MNAR (Missing Not at Random)

    • Missingness depends on unobserved values or the missing values themselves.

    • Can be (but not necessarily) most problematic; requires domain knowledge to handle.

It can be difficult to understand these types of missing data without some examples. Let's cover some examples of why data might go missing.

Causes of Missing Data

Missing data can arise due to various reasons, often categorized into the following:

  • Human Error: Data entry mistakes, forgetting to record values.

    • Type: MCAR

    • Examples

      • A survey respondent skips a question accidentally.
      • A data entry operator mistypes a value or leaves a field blank.
      • A researcher misplaces handwritten data before digitizing it.
  • Data Processing Issues: Errors in merging, exporting, or loading datasets.

    • Type: MCAR

    • Examples

      • A CSV file is corrupted, resulting in missing values.
      • A database migration fails, causing some records to be lost.
      • A bug in an ETL (Extract, Transform, Load) pipeline causes missing entries.
  • System Constraints: Sensors failing to capture data, software crashes.

    • Type: MCAR

    • Examples

      • A temperature sensor stops working intermittently.
      • A power outage interrupts data logging.
      • A website’s tracking system fails to record user activity due to server downtime.
  • Respondent Unwillingness: Participants deliberately leave fields blank based on personal preference.

    • Type: MNAR

    • Examples

      • People avoid disclosing income in a survey.
      • Patients in a medical study refuse to answer personal questions.
      • Employees skip job satisfaction questions for fear of anonymity issues.
  • Non-Applicable Data: Some values are not relevant to specific cases.

    • Type: MNAR

    • Examples

      • “Marital Status” is missing for respondents who are children.
      • “Previous Employer” is missing for someone’s first job.
      • “Car Ownership” is missing for people who do not drive.

Methods to Handle Missing Data

  1. Removing Data

  2. Method Description Advantages Disadvantages
    Drop Missing Rows Remove records with missing values. Simple and easy to implement. Risk of losing valuable data, especially if many rows have missing values.
    Drop Missing Columns Remove entire columns with high missing percentages. Useful when a column is mostly empty. Could remove important features.
  3. Simple Imputation (Filling Missing Data)

  4. Fill with a Fixed Value Replace missing values with a constant (e.g., 0, “Unknown”). Simple and does not require computation. May introduce bias if the value is not representative.
    Fill with Mean/Median/Mode Replace missing values with the column’s mean, median, or mode. Works well when data is normally distributed. Can distort skewed distributions.
    Forward/Backward Fill Use previous or next value in the column. Useful for time-series data. Can propagate errors if the missing values are not sequential.
  5. Advanced Imputation Techniques

  6. K-Nearest Neighbors (KNN) Imputation Uses the nearest neighbors to predict missing values. Maintains relationships between variables. Computationally expensive for large datasets.
    Regression Imputation Predicts missing values using regression models. More accurate for MAR data. Assumes a linear relationship that may not always hold.
    Multiple Imputation Uses probabilistic modeling to generate multiple imputed datasets and averages results. Reduces bias and increases robustness. Requires significant computational resources.
  7. Machine Learning Imputation: Instead of filling missing data manually, we can train models to predict missing values.

  8. Deep Learning-based Imputation Uses neural networks to estimate missing values. Can handle complex patterns in data. Requires a large dataset and high computational power.
    Random Forest Imputation Uses multiple decision trees to predict missing values. Handles non-linear relationships well. Computationally expensive.

As you learn about all of these methods of handling missing data, you may be wondering why we don't just use the most advanced and accurate methods of handling missing data every time like Deep Learning-based Imputation? Well, everything comes with a cost. The techniques that are most accurate are also typically the most costly in terms of computational power requried. If we are building a machine learning pipeline that needs to execute within an hour, then it may not be possible to use the most advanced methods. You have to make tradeoff decisions as depicted in the image below:

Automated Missing Data Handling

The next question is how to automate the process as much as possible? Every situation is unique and needs to serious thought to address the various causes of missing data correctly. But we can automate some of it. The video below will guide you through the fairly complex function after the video:

Let's start by pulling in the datasets that we are going to experiment with (unless you already have them from a prior section)

      # Mount Google Drive if needed and bring in some sample data
      from google.colab import drive
      drive.mount('/content/drive')
        
      import pandas as pd
        
      # Datasets with numeric label for testing
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
      df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
        
      # Dataset with categorical label for testing
      df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
      

Next, let's create a function that will handle dropping data so that the maximum possible amount is retained. Follow along with the video above if you don't understand how part of the function works.

      def missing_drop(df, label="", features=[], messages=True, row_threshold=.9, col_threshold=.5):
        import pandas as pd
        
        start_count = df.count().sum()  # Store the initial count of non-null values
        
        # Drop columns with missing values beyond the specified column threshold
        df.dropna(axis=1, thresh=round(col_threshold * df.shape[0]), inplace=True)
        # Drop rows that have fewer non-null values than the row threshold allows
        df.dropna(axis=0, thresh=round(row_threshold * df.shape[1]), inplace=True)
        # If a label is specified, ensure it has no missing values
        if label != "": 
          df.dropna(axis=0, subset=[label], inplace=True)
        
        # Function to generate a summary of missing data for each column
        def generate_missing_table():
          df_results = pd.DataFrame(columns=['Missing', 'column', 'rows'])
          for feat in df:
            missing = df[feat].isna().sum()  # Count missing values in column
            if missing < 0:
              memory_col = df.drop(columns=[feat]).count().sum()  # Count non-null values if this column is dropped
              memory_rows = df.dropna(subset=[feat]).count().sum()  # Count non-null values if this column is kept
              df_results.loc[feat] = [missing, memory_col, memory_rows]  # Store results
          return df_results
        
        df_results = generate_missing_table()  # Generate initial missing data table
        
        # Iteratively remove the column or row that preserves the most non-null data
        while df_results.shape[0] > 0:
          max = df_results[['column', 'rows']].max(axis=1)[0]  # Find the max value in columns or rows
          max_axis = df_results.columns[df_results.isin([max]).any()][0]  # Determine whether to drop column or row
          print(max, max_axis)
        
          df_results.sort_values(by=[max_axis], ascending=False, inplace=True)  # Sort missing data table by max_axis
          if messages: print('\n', df_results)
        
          # Drop the most impactful missing data (either row or column)
          if max_axis == 'rows':
            df.dropna(axis=0, subset=[df_results.index[0]], inplace=True)  # Drop row with highest missing impact
          else:
            df.drop(columns=[df_results.index[0]], inplace=True)  # Drop column with highest missing impact
        
          df_results = generate_missing_table()  # Recalculate missing data table after dropping
        
        # Print the percentage of non-null values retained
        if messages: 
          print(f'{round(df.count().sum() / start_count * 100, 2)}% ({df.count().sum()}) / ({start_count}) of non-null cells were kept.')
          
        return df
      

So what exactly is going on in this function? Basically:

  1. Initial Setup & Count
    • It calculates the total number of non-null cells (start_count).
  2. Drops Columns with Too Many Missing Values
    • If a column has more than (1 - col_threshold)% missing values, it is removed.
  3. Drops Rows with Too Many Missing Values
    • If a row has more than (1 - row_threshold)% missing values, it is removed.
  4. Ensures the Label Column (if specified) Has No Missing Values
    • If label is provided, all rows where the label column is missing are removed.
  5. Iteratively Drops the Most Problematic Columns/Rows
    • The function creates a missing data report (generate_missing_table), listing how many values are missing per column.
    • While there are still missing values:
    • It identifies the column or row that contributes most to missing data.
    • Removes it (column if the column has most missing data, row otherwise).
    • Updates the missing data report and repeats the process.
  6. Final Report & Return
    • Prints the percentage of non-null data retained.
    • Returns the cleaned DataFrame.

Function Definition

Necessary Parameters
Parameter Description
df (required) The Pandas DataFrame that needs missing value handling.
Optional Parameters
Parameter Default Description
label "" (empty string) If provided, ensures that this label (target) column has no missing values by dropping rows where it is missing. If "", this step is skipped.
features [] (empty list) A list of specific columns to check for missing values. If empty, all columns are considered.
messages True If True, prints progress updates, including the percentage of non-null values retained. Set to False to suppress output.
row_threshold 0.9 Minimum proportion of non-missing values required to keep a row. Rows missing more than (1 - row_threshold)% are removed. (e.g., 0.9 → drop rows missing more than 10% of values).
col_threshold 0.5 Minimum proportion of non-missing values required to keep a column. Columns missing more than (1 - col_threshold)% are removed. (e.g., 0.5 → drop columns missing more than 50% of values).

Next, let's try it out on our datasets:

      missing_drop(df_airbnb.copy()).head()
      
      # Output:
      # 263796 rows

      #             Missing  column    rows
      # name             24  246536  263796
      # host_name       144  246656  262116
      # 261780 rows
        
      #             Missing  column    rows
      # host_name       144  246344  261780
      # 88.63% (261780) / (295375) of non-null cells were kept.
      
      # See the DataFrame head() in your own notebook
      

Let's keep working through the other datasets:

      missing_drop(df_nba.copy()).head()
      
      # Output:
      # 14505 rows

      #        Missing  column   rows
      #   2P%        3   14412  14784
      #   3P%       12   14421  14505
      #   FT%       22   14431  14197
      #   14414 rows
        
      #        Missing  column   rows
      #   3P%       12   14333  14414
      #   FT%       20   14341  14166
      #   13981 column
        
      #        Missing  column   rows
      #   FT%       18   13981  13856
      #   93.82% (13981) / (14902) of non-null cells were kept.
      
      # See the DataFrame head() in your own notebook
      
      missing_drop(df_airline.copy()).head()
      
      # Output:
      # 2978201 rows

      #                             Missing   column     rows
      #   Arrival Delay in Minutes      393  2857360  2978201
      #   99.71% (2978201) / (2986847) of non-null cells were kept.
      
      # See the DataFrame head() in your own notebook
      

As you've hopefully noticed from either the video or working through the code on your own, this missing_drop function does not drop every column or row with missing. It only drops the biggest offenders--those whethere the missing percent is over a particular value. For rows and columns with most of the data available, we will want to impute the rest of the missing values. This is why we also have the next function below:

      def missing_fill(df, label, features=[], row_threshold=.9, col_threshold=.5, acceptable=0.1, mar='drop', force_impute=False, large_dataset=200000, messages=True):
        import pandas as pd, numpy as np
        from scipy import stats
        from statsmodels.stats.proportion import proportions_ztest
        pd.set_option('display.float_format', lambda x: '%.4f' % x)  # Display float values with 4 decimal places
        from IPython.display import display
        
        # Ensure the provided label column exists in the DataFrame
        if not label in df.columns:
          print(f'The label provided ({label}) does not exist in the DataFrame provided')
          return df
        
        start_count = df.count().sum()  # Store the initial count of non-null values
      
        # Drop columns with missing data above the threshold
        df.dropna(axis=1, thresh=round(col_threshold * df.shape[0]), inplace=True)
        # Drop rows that have fewer non-null values than row_threshold allows
        df.dropna(axis=0, thresh=round(row_threshold * df.shape[1]), inplace=True)
        if label != "": df.dropna(axis=0, subset=[label], inplace=True)  # Ensure label column has no missing values
        
        # If no features are specified, consider all columns as features
        if len(features) == 0: features = df.columns  
        
        # If the label column is numeric, perform a t-test for missing vs non-missing groups
        if pd.api.types.is_numeric_dtype(df[label]):
          df_results = pd.DataFrame(columns=['total missing', 'null x̄', 'non-null x̄', 'null s', 'non-null s', 't', 'p'])
          for feat in features:
            missing = df[feat].isna().sum()  # Count missing values
            if missing > 0:
              null = df[df[feat].isna()]  # Subset where feature is missing
              nonnull = df[~df[feat].isna()]  # Subset where feature is present
              t, p = stats.ttest_ind(null[label], nonnull[label])  # Perform t-test to check for MAR vs MCAR
              df_results.loc[feat] = [round(missing), round(null[label].mean(), 6), round(nonnull[label].mean(), 6),
                                      round(null[label].std(), 6), round(nonnull[label].std(), 6), t, p]
        else:
          # If label is categorical, use proportions_ztest to check for MAR vs MCAR
          df_results = pd.DataFrame(columns=['total missing', 'null p̂', 'non-null p̂', 'Z', 'p'])
          for feat in features:
            missing = df[feat].isna().sum()
            if missing > 0:
              null = df[df[feat].isna()]
              nonnull = df[~df[feat].isna()]
              for group in null[label].unique():
                p1_num = null[null[label]==group].shape[0]  # Count of group in missing subset
                p1_den = null[null[label]!=group].shape[0]  # Count of others in missing subset
                p2_num = nonnull[nonnull[label]==group].shape[0]  # Count of group in non-missing subset
                p2_den = nonnull[nonnull[label]!=group].shape[0]  # Count of others in non-missing subset
                
                if p1_num < p1_den:  # Avoid division by zero
                  numerators = np.array([p1_num, p2_num])
                  denominators = np.array([p1_den, p2_den])
                  z, p = proportions_ztest(numerators, denominators)  # Conduct z-test
                  df_results.loc[f'{feat}_{group}'] = [round(missing), round(p1_num/p1_den, 6), round(p2_num/p2_den, 6), z, p]
      
        # Display the missing data analysis results
        if messages: display(df_results)
        
        # Determine if data is MAR (Missing at Random) or MCAR (Missing Completely at Random)
        if df_results[df_results['p'] < 0.05].shape[0] / df_results.shape[0] > acceptable and not force_impute:
          if mar == 'drop':
            df.dropna(inplace=True)  # Drop all rows containing missing values
            if messages: print('null rows dropped')
          else:  # Last resort: fill missing values with the median
            for feat in df_results.index:
              if pd.api.types.is_numeric_dtype(df[feat]):
                df[feat].fillna(df[feat].median(), inplace=True)
                if messages: print(f'{feat} filled with median ({df[feat].median()})')
              else:
                df[feat].fillna('missing', inplace=True)  # Fill categorical missing values with "missing"
                if messages: print(f'{feat} filled with "missing"')
        else:
          # If missing data is MCAR, perform imputation using either KNN or IterativeImputer
          from sklearn.preprocessing import OrdinalEncoder
          oe = OrdinalEncoder().fit(df)
          df_encoded = oe.fit_transform(df)  # Convert categorical values to numeric
      
          if df.count().sum() > large_dataset:
            from sklearn.experimental import enable_iterative_imputer
            from sklearn.impute import KNNImputer
            imp = KNNImputer()  # Use K-Nearest Neighbors Imputation for large datasets
            df_imputed = imp.fit_transform(df_encoded)
            df_recoded = oe.inverse_transform(df_imputed)
            df = pd.DataFrame(df_recoded, columns=df.columns, index=df.index)
          else:
            from sklearn.experimental import enable_iterative_imputer
            from sklearn.impute import IterativeImputer
            imp = IterativeImputer()  # Use Iterative Imputer for smaller datasets
            df = pd.DataFrame(imp.fit_transform(df), columns=df.columns, index=df.index)
          
          if messages: print(f'null values imputed')
      
        return df
      

How does this function work?

  1. Initial Setup & Dropping Rows/Columns with Too Much Missing Data
    • Removes columns that have more than (1 - col_threshold)% missing values.
    • Removes rows that have more than (1 - row_threshold)% missing values.
    • Ensures the label column has no missing values.
  2. Statistical Testing for Missing Data Bias
    • If the label column is numeric, it performs a t-test to see if the missing values in each feature affect the label’s distribution.
    • If the label column is categorical, it performs a proportion z-test to compare missing vs. non-missing groups.
  3. Determining if Missing Data is MAR (Missing At Random) or MCAR (Missing Completely At Random)
    • If too many features show statistically significant differences (p < 0.05 in tests), data is assumed MAR.
    • If mar='drop', rows with missing values are removed.
    • Otherwise, missing values are imputed (filled in) with median/missing categories.
  4. Imputation of Missing Data (If Needed)
    • If data is not MAR, missing values are imputed using machine learning:
      • Large datasets (> 200,000 rows) → Uses K-Nearest Neighbors (KNN) Imputation.
      • Smaller datasets → Uses Iterative Imputation (Multivariate Imputer).
  5. Final Report & Return
    • Displays statistical tests on missing data.
    • Prints messages if requested.
    • Returns the cleaned DataFrame.

Function Definition

Necessary Parameters
Parameter Description
df (required) The Pandas DataFrame that needs missing value handling.
label (required) The name of the column to be used as the target (dependent) variable. This is important for performing statistical tests on missing data.
Optional Parameters
Parameter Default Description
features [] (empty list) List of specific columns to check for missing values. If empty, all columns are considered.
row_threshold 0.9 Minimum proportion of non-missing values required to keep a row. Rows missing more than (1 - row_threshold)% are removed. (e.g., 0.9 → drop rows missing more than 10% of values).
col_threshold 0.5 Minimum proportion of non-missing values required to keep a column. Columns missing more than (1 - col_threshold)% are removed. (e.g., 0.5 → drop columns missing more than 50% of values).
acceptable 0.1 Proportion of features that can have statistically significant missingness (based on p < 0.05). If more than this percentage of features are MAR (Missing At Random), missing values are not imputed unless force_impute=True.
mar "drop" If missing values are MAR, decides how to handle them: - "drop" → Remove rows with missing values. - "impute" → Fill missing values with the median (for numeric) or "missing" (for categorical).
force_impute False If True, forces imputation even if MAR analysis suggests dropping rows. Useful when you want to retain all data points.
large_dataset 200000 Defines a large dataset threshold. If df has more than this many rows, KNN Imputer is used instead of Iterative Imputer for faster processing.
messages True If True, prints progress updates, including the percentage of non-null values retained. Set to False to suppress output.

As you can tell, this is a complex function. Let's test it out:

      df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
      df_airbnb = bin_categories(df_airbnb)
      df_airbnb_clean = missing_fill(df_airbnb.copy(), 'average_review', mar='drop')
      print(f'\nOriginal df_airbnb:\t{df_airbnb.shape}')
      print(f'Cleaned df_airbnb:\t{df_airbnb_clean.shape}\n')
      print(df_airbnb_clean.name.value_counts())
      df_airbnb_clean.head()
      
      # Output:
      # null rows dropped

      # Original df_airbnb:	(20025, 16)
      # Cleaned df_airbnb:	(17452, 15)
        
      # name
      # Amsterdam                                 31
      # Cosy apartment in Amsterdam West           6
      # Apartment in Amsterdam                     6
      # Spacious apartment in Amsterdam            6
      # Spacious apartment near Vondelpark         6
      #                                        ..
      # Wonderful apartment in Amsterdam           1
      # Very spacious appartment (85m2)            1
      # Lovely family house with patio             1
      # Private room in nice Amsterdam             1
      # Celebrate winter in heart of Amsterdam     1
      # Name: count, Length: 17116, dtype: int64
      
      df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
      df_airline_clean = missing_fill(df_airline.copy(), 'satisfaction')
      print(f'\nOriginal df_airline:\t{df_airline.shape}')
      print(f'Cleaned df_airline:\t{df_airline_clean.shape}\n')
      df_airline_clean.head()
      
      # Output:
      # null rows dropped

      # Original df_airline: (129880, 23)
      # Cleaned df_airline:	 (129487, 23)
      
      df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
      df_nba = basic_wrangling(df_nba)
      print('...basic wrangling done\n')
      df_nba_clean = missing_fill(df_nba.copy(), 'Salary', mar='impute', mcar='impute')
      print(df_nba.shape)
      print(df_nba_clean.shape)
      df_nba_clean.head()
      
      # Output:
      # Too many unique values (467 out of 467, 1.0) for Unnamed: 0
      # Too many unique values (467 out of 467, 1.0) for Player Name
      # Too many unique values (467 out of 467, 1.0) for Player-additional
      # ...basic wrangling done
        
      # 3P% filled with median (0.3445)
      # 2P% filled with median (0.537)
      # FT% filled with median (0.769)
      # (467, 29)
      # (466, 29)
      

In summary, we drop only those features and rows missing too much data--much like the prior function we created. However, after that, we then test for bias before determining the best technique for imputation. Numeric features are tested with a t-test while categorical features are tested with a z-test for proportions. If there is no bias, then we can predict a value with KNNImputer or OrdinalImputer. While these are good techniques, there are even more advanced imputations we could have used. If there is bias, then we can choose whether to drop the rows entirely or fill in with a median value.