9.6 Missing Data
Missing data is a common challenge in data analysis and machine learning projects. If not handled correctly, it can lead to biased conclusions, inaccurate predictions, and ineffective models.
Types of Missing Data
There are three key types of missing data:
MCAR (Missing Completely at Random) Data is missing randomly with no pattern.
Least problematic; missingness is independent of the data.
MAR (Missing at Random) Missingness depends on other observed variables but not the missing values themselves.
Can be corrected using statistical methods.
MNAR (Missing Not at Random) Missingness depends on unobserved values or the missing values themselves.
Can be (but not necessarily) most problematic; requires domain knowledge to handle.
It can be difficult to understand these types of missing data without some examples. Let's cover some examples of why data might go missing.
Causes of Missing Data
Missing data can arise due to various reasons, often categorized into the following:
Human Error : Data entry mistakes, forgetting to record values.Type:
MCAR Examples
- A survey respondent skips a question accidentally.
- A data entry operator mistypes a value or leaves a field blank.
- A researcher misplaces handwritten data before digitizing it.
Data Processing Issues : Errors in merging, exporting, or loading datasets.Type:
MCAR Examples
- A CSV file is corrupted, resulting in missing values.
- A database migration fails, causing some records to be lost.
- A bug in an ETL (Extract, Transform, Load) pipeline causes missing entries.
System Constraints : Sensors failing to capture data, software crashes.Type:
MCAR Examples
- A temperature sensor stops working intermittently.
- A power outage interrupts data logging.
- A website’s tracking system fails to record user activity due to server downtime.
Respondent Unwillingness : Participants deliberately leave fields blank based on personal preference.Type:
MNAR Examples
- People avoid disclosing income in a survey.
- Patients in a medical study refuse to answer personal questions.
- Employees skip job satisfaction questions for fear of anonymity issues.
Non-Applicable Data : Some values are not relevant to specific cases.Type:
MNAR Examples
- “Marital Status” is missing for respondents who are children.
- “Previous Employer” is missing for someone’s first job.
- “Car Ownership” is missing for people who do not drive.
Methods to Handle Missing Data
Removing Data
Simple Imputation (Filling Missing Data)
Advanced Imputation Techniques
Machine Learning Imputation: Instead of filling missing data manually, we can train models to predict missing values.
Method | Description | Advantages | Disadvantages |
---|---|---|---|
Drop Missing Rows | Remove records with missing values. | Simple and easy to implement. | Risk of losing valuable data, especially if many rows have missing values. |
Drop Missing Columns | Remove entire columns with high missing percentages. | Useful when a column is mostly empty. | Could remove important features. |
Fill with a Fixed Value | Replace missing values with a constant (e.g., 0, “Unknown”). | Simple and does not require computation. | May introduce bias if the value is not representative. |
Fill with Mean/Median/Mode | Replace missing values with the column’s mean, median, or mode. | Works well when data is normally distributed. | Can distort skewed distributions. |
Forward/Backward Fill | Use previous or next value in the column. | Useful for time-series data. | Can propagate errors if the missing values are not sequential. |
K-Nearest Neighbors (KNN) Imputation | Uses the nearest neighbors to predict missing values. | Maintains relationships between variables. | Computationally expensive for large datasets. |
Regression Imputation | Predicts missing values using regression models. | More accurate for MAR data. | Assumes a linear relationship that may not always hold. |
Multiple Imputation | Uses probabilistic modeling to generate multiple imputed datasets and averages results. | Reduces bias and increases robustness. | Requires significant computational resources. |
Deep Learning-based Imputation | Uses neural networks to estimate missing values. | Can handle complex patterns in data. | Requires a large dataset and high computational power. |
Random Forest Imputation | Uses multiple decision trees to predict missing values. | Handles non-linear relationships well. | Computationally expensive. |
As you learn about all of these methods of handling missing data, you may be wondering why we don't just use the most advanced and accurate methods of handling missing data every time like Deep Learning-based Imputation? Well, everything comes with a cost. The techniques that are most accurate are also typically the most costly in terms of computational power requried. If we are building a machine learning pipeline that needs to execute within an hour, then it may not be possible to use the most advanced methods. You have to make tradeoff decisions as depicted in the image below:
Automated Missing Data Handling
The next question is how to automate the process as much as possible? Every situation is unique and needs to serious thought to address the various causes of missing data correctly. But we can automate some of it. The video below will guide you through the fairly complex function after the video:
Let's start by pulling in the datasets that we are going to experiment with (unless you already have them from a prior section)
# Mount Google Drive if needed and bring in some sample data
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Datasets with numeric label for testing
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
# Dataset with categorical label for testing
df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
Next, let's create a function that will handle dropping data so that the maximum possible amount is retained. Follow along with the video above if you don't understand how part of the function works.
def missing_drop(df, label="", features=[], messages=True, row_threshold=.9, col_threshold=.5):
import pandas as pd
start_count = df.count().sum() # Store the initial count of non-null values
# Drop columns with missing values beyond the specified column threshold
df.dropna(axis=1, thresh=round(col_threshold * df.shape[0]), inplace=True)
# Drop rows that have fewer non-null values than the row threshold allows
df.dropna(axis=0, thresh=round(row_threshold * df.shape[1]), inplace=True)
# If a label is specified, ensure it has no missing values
if label != "":
df.dropna(axis=0, subset=[label], inplace=True)
# Function to generate a summary of missing data for each column
def generate_missing_table():
df_results = pd.DataFrame(columns=['Missing', 'column', 'rows'])
for feat in df:
missing = df[feat].isna().sum() # Count missing values in column
if missing < 0:
memory_col = df.drop(columns=[feat]).count().sum() # Count non-null values if this column is dropped
memory_rows = df.dropna(subset=[feat]).count().sum() # Count non-null values if this column is kept
df_results.loc[feat] = [missing, memory_col, memory_rows] # Store results
return df_results
df_results = generate_missing_table() # Generate initial missing data table
# Iteratively remove the column or row that preserves the most non-null data
while df_results.shape[0] > 0:
max = df_results[['column', 'rows']].max(axis=1)[0] # Find the max value in columns or rows
max_axis = df_results.columns[df_results.isin([max]).any()][0] # Determine whether to drop column or row
print(max, max_axis)
df_results.sort_values(by=[max_axis], ascending=False, inplace=True) # Sort missing data table by max_axis
if messages: print('\n', df_results)
# Drop the most impactful missing data (either row or column)
if max_axis == 'rows':
df.dropna(axis=0, subset=[df_results.index[0]], inplace=True) # Drop row with highest missing impact
else:
df.drop(columns=[df_results.index[0]], inplace=True) # Drop column with highest missing impact
df_results = generate_missing_table() # Recalculate missing data table after dropping
# Print the percentage of non-null values retained
if messages:
print(f'{round(df.count().sum() / start_count * 100, 2)}% ({df.count().sum()}) / ({start_count}) of non-null cells were kept.')
return df
So what exactly is going on in this function? Basically:
- Initial Setup & Count
- It calculates the total number of non-null cells (start_count).
- Drops Columns with Too Many Missing Values
- If a column has more than (1 - col_threshold)% missing values, it is removed.
- Drops Rows with Too Many Missing Values
- If a row has more than (1 - row_threshold)% missing values, it is removed.
- Ensures the Label Column (if specified) Has No Missing Values
- If label is provided, all rows where the label column is missing are removed.
- Iteratively Drops the Most Problematic Columns/Rows
- The function creates a missing data report (generate_missing_table), listing how many values are missing per column.
- While there are still missing values:
- It identifies the column or row that contributes most to missing data.
- Removes it (column if the column has most missing data, row otherwise).
- Updates the missing data report and repeats the process.
- Final Report & Return
- Prints the percentage of non-null data retained.
- Returns the cleaned DataFrame.
Function Definition
Parameter | Description |
---|---|
df (required) | The Pandas DataFrame that needs missing value handling. |
Parameter | Default | Description |
---|---|---|
label | "" (empty string) | If provided, ensures that this label (target) column has no missing values by dropping rows where it is missing. If "", this step is skipped. |
features | [] (empty list) | A list of specific columns to check for missing values. If empty, all columns are considered. |
messages | True | If True, prints progress updates, including the percentage of non-null values retained. Set to False to suppress output. |
row_threshold | 0.9 | Minimum proportion of non-missing values required to keep a row. Rows missing more than (1 - row_threshold)% are removed. (e.g., 0.9 → drop rows missing more than 10% of values). |
col_threshold | 0.5 | Minimum proportion of non-missing values required to keep a column. Columns missing more than (1 - col_threshold)% are removed. (e.g., 0.5 → drop columns missing more than 50% of values). |
Next, let's try it out on our datasets:
missing_drop(df_airbnb.copy()).head()
# Output:
# 263796 rows
# Missing column rows
# name 24 246536 263796
# host_name 144 246656 262116
# 261780 rows
# Missing column rows
# host_name 144 246344 261780
# 88.63% (261780) / (295375) of non-null cells were kept.
# See the DataFrame head() in your own notebook
Let's keep working through the other datasets:
missing_drop(df_nba.copy()).head()
# Output:
# 14505 rows
# Missing column rows
# 2P% 3 14412 14784
# 3P% 12 14421 14505
# FT% 22 14431 14197
# 14414 rows
# Missing column rows
# 3P% 12 14333 14414
# FT% 20 14341 14166
# 13981 column
# Missing column rows
# FT% 18 13981 13856
# 93.82% (13981) / (14902) of non-null cells were kept.
# See the DataFrame head() in your own notebook
missing_drop(df_airline.copy()).head()
# Output:
# 2978201 rows
# Missing column rows
# Arrival Delay in Minutes 393 2857360 2978201
# 99.71% (2978201) / (2986847) of non-null cells were kept.
# See the DataFrame head() in your own notebook
As you've hopefully noticed from either the video or working through the code on your own, this missing_drop function does not drop every column or row with missing. It only drops the biggest offenders--those whethere the missing percent is over a particular value. For rows and columns with most of the data available, we will want to impute the rest of the missing values. This is why we also have the next function below:
def missing_fill(df, label, features=[], row_threshold=.9, col_threshold=.5, acceptable=0.1, mar='drop', force_impute=False, large_dataset=200000, messages=True):
import pandas as pd, numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
pd.set_option('display.float_format', lambda x: '%.4f' % x) # Display float values with 4 decimal places
from IPython.display import display
# Ensure the provided label column exists in the DataFrame
if not label in df.columns:
print(f'The label provided ({label}) does not exist in the DataFrame provided')
return df
start_count = df.count().sum() # Store the initial count of non-null values
# Drop columns with missing data above the threshold
df.dropna(axis=1, thresh=round(col_threshold * df.shape[0]), inplace=True)
# Drop rows that have fewer non-null values than row_threshold allows
df.dropna(axis=0, thresh=round(row_threshold * df.shape[1]), inplace=True)
if label != "": df.dropna(axis=0, subset=[label], inplace=True) # Ensure label column has no missing values
# If no features are specified, consider all columns as features
if len(features) == 0: features = df.columns
# If the label column is numeric, perform a t-test for missing vs non-missing groups
if pd.api.types.is_numeric_dtype(df[label]):
df_results = pd.DataFrame(columns=['total missing', 'null x̄', 'non-null x̄', 'null s', 'non-null s', 't', 'p'])
for feat in features:
missing = df[feat].isna().sum() # Count missing values
if missing > 0:
null = df[df[feat].isna()] # Subset where feature is missing
nonnull = df[~df[feat].isna()] # Subset where feature is present
t, p = stats.ttest_ind(null[label], nonnull[label]) # Perform t-test to check for MAR vs MCAR
df_results.loc[feat] = [round(missing), round(null[label].mean(), 6), round(nonnull[label].mean(), 6),
round(null[label].std(), 6), round(nonnull[label].std(), 6), t, p]
else:
# If label is categorical, use proportions_ztest to check for MAR vs MCAR
df_results = pd.DataFrame(columns=['total missing', 'null p̂', 'non-null p̂', 'Z', 'p'])
for feat in features:
missing = df[feat].isna().sum()
if missing > 0:
null = df[df[feat].isna()]
nonnull = df[~df[feat].isna()]
for group in null[label].unique():
p1_num = null[null[label]==group].shape[0] # Count of group in missing subset
p1_den = null[null[label]!=group].shape[0] # Count of others in missing subset
p2_num = nonnull[nonnull[label]==group].shape[0] # Count of group in non-missing subset
p2_den = nonnull[nonnull[label]!=group].shape[0] # Count of others in non-missing subset
if p1_num < p1_den: # Avoid division by zero
numerators = np.array([p1_num, p2_num])
denominators = np.array([p1_den, p2_den])
z, p = proportions_ztest(numerators, denominators) # Conduct z-test
df_results.loc[f'{feat}_{group}'] = [round(missing), round(p1_num/p1_den, 6), round(p2_num/p2_den, 6), z, p]
# Display the missing data analysis results
if messages: display(df_results)
# Determine if data is MAR (Missing at Random) or MCAR (Missing Completely at Random)
if df_results[df_results['p'] < 0.05].shape[0] / df_results.shape[0] > acceptable and not force_impute:
if mar == 'drop':
df.dropna(inplace=True) # Drop all rows containing missing values
if messages: print('null rows dropped')
else: # Last resort: fill missing values with the median
for feat in df_results.index:
if pd.api.types.is_numeric_dtype(df[feat]):
df[feat].fillna(df[feat].median(), inplace=True)
if messages: print(f'{feat} filled with median ({df[feat].median()})')
else:
df[feat].fillna('missing', inplace=True) # Fill categorical missing values with "missing"
if messages: print(f'{feat} filled with "missing"')
else:
# If missing data is MCAR, perform imputation using either KNN or IterativeImputer
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder().fit(df)
df_encoded = oe.fit_transform(df) # Convert categorical values to numeric
if df.count().sum() > large_dataset:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer
imp = KNNImputer() # Use K-Nearest Neighbors Imputation for large datasets
df_imputed = imp.fit_transform(df_encoded)
df_recoded = oe.inverse_transform(df_imputed)
df = pd.DataFrame(df_recoded, columns=df.columns, index=df.index)
else:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer() # Use Iterative Imputer for smaller datasets
df = pd.DataFrame(imp.fit_transform(df), columns=df.columns, index=df.index)
if messages: print(f'null values imputed')
return df
How does this function work?
- Initial Setup & Dropping Rows/Columns with Too Much Missing Data
- Removes columns that have more than (1 - col_threshold)% missing values.
- Removes rows that have more than (1 - row_threshold)% missing values.
- Ensures the label column has no missing values.
- Statistical Testing for Missing Data Bias
- If the label column is numeric, it performs a t-test to see if the missing values in each feature affect the label’s distribution.
- If the label column is categorical, it performs a proportion z-test to compare missing vs. non-missing groups.
- Determining if Missing Data is MAR (Missing At Random) or MCAR (Missing Completely At Random)
- If too many features show statistically significant differences (p < 0.05 in tests), data is assumed MAR.
- If mar='drop', rows with missing values are removed.
- Otherwise, missing values are imputed (filled in) with median/missing categories.
- Imputation of Missing Data (If Needed)
- If data is not MAR, missing values are imputed using machine learning:
- Large datasets (> 200,000 rows) → Uses K-Nearest Neighbors (KNN) Imputation.
- Smaller datasets → Uses Iterative Imputation (Multivariate Imputer).
- Final Report & Return
- Displays statistical tests on missing data.
- Prints messages if requested.
- Returns the cleaned DataFrame.
Function Definition
Parameter | Description |
---|---|
df (required) | The Pandas DataFrame that needs missing value handling. |
label (required) | The name of the column to be used as the target (dependent) variable. This is important for performing statistical tests on missing data. |
Parameter | Default | Description |
---|---|---|
features | [] (empty list) | List of specific columns to check for missing values. If empty, all columns are considered. |
row_threshold | 0.9 | Minimum proportion of non-missing values required to keep a row. Rows missing more than (1 - row_threshold)% are removed. (e.g., 0.9 → drop rows missing more than 10% of values). |
col_threshold | 0.5 | Minimum proportion of non-missing values required to keep a column. Columns missing more than (1 - col_threshold)% are removed. (e.g., 0.5 → drop columns missing more than 50% of values). |
acceptable | 0.1 | Proportion of features that can have statistically significant missingness (based on p < 0.05). If more than this percentage of features are MAR (Missing At Random), missing values are not imputed unless force_impute=True. |
mar | "drop" | If missing values are MAR, decides how to handle them: - "drop" → Remove rows with missing values. - "impute" → Fill missing values with the median (for numeric) or "missing" (for categorical). |
force_impute | False | If True, forces imputation even if MAR analysis suggests dropping rows. Useful when you want to retain all data points. |
large_dataset | 200000 | Defines a large dataset threshold. If df has more than this many rows, KNN Imputer is used instead of Iterative Imputer for faster processing. |
messages | True | If True, prints progress updates, including the percentage of non-null values retained. Set to False to suppress output. |
As you can tell, this is a complex function. Let's test it out:
df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
df_airbnb = bin_categories(df_airbnb)
df_airbnb_clean = missing_fill(df_airbnb.copy(), 'average_review', mar='drop')
print(f'\nOriginal df_airbnb:\t{df_airbnb.shape}')
print(f'Cleaned df_airbnb:\t{df_airbnb_clean.shape}\n')
print(df_airbnb_clean.name.value_counts())
df_airbnb_clean.head()
# Output:
# null rows dropped
# Original df_airbnb: (20025, 16)
# Cleaned df_airbnb: (17452, 15)
# name
# Amsterdam 31
# Cosy apartment in Amsterdam West 6
# Apartment in Amsterdam 6
# Spacious apartment in Amsterdam 6
# Spacious apartment near Vondelpark 6
# ..
# Wonderful apartment in Amsterdam 1
# Very spacious appartment (85m2) 1
# Lovely family house with patio 1
# Private room in nice Amsterdam 1
# Celebrate winter in heart of Amsterdam 1
# Name: count, Length: 17116, dtype: int64
df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
df_airline_clean = missing_fill(df_airline.copy(), 'satisfaction')
print(f'\nOriginal df_airline:\t{df_airline.shape}')
print(f'Cleaned df_airline:\t{df_airline_clean.shape}\n')
df_airline_clean.head()
# Output:
# null rows dropped
# Original df_airline: (129880, 23)
# Cleaned df_airline: (129487, 23)
df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
df_nba = basic_wrangling(df_nba)
print('...basic wrangling done\n')
df_nba_clean = missing_fill(df_nba.copy(), 'Salary', mar='impute', mcar='impute')
print(df_nba.shape)
print(df_nba_clean.shape)
df_nba_clean.head()
# Output:
# Too many unique values (467 out of 467, 1.0) for Unnamed: 0
# Too many unique values (467 out of 467, 1.0) for Player Name
# Too many unique values (467 out of 467, 1.0) for Player-additional
# ...basic wrangling done
# 3P% filled with median (0.3445)
# 2P% filled with median (0.537)
# FT% filled with median (0.769)
# (467, 29)
# (466, 29)
In summary, we drop only those features and rows missing too much data--much like the prior function we created. However, after that, we then test for bias before determining the best technique for imputation. Numeric features are tested with a t-test while categorical features are tested with a z-test for proportions. If there is no bias, then we can predict a value with KNNImputer or OrdinalImputer. While these are good techniques, there are even more advanced imputations we could have used. If there is bias, then we can choose whether to drop the rows entirely or fill in with a median value.