9.1 Introduction

In the previous chapter, we learned how to automate simple univariate statistics and visualizations. Typically, exploratory data analysis (EDA) also includes basic bivariate analyses. However, before diving into that, I prefer to shift ahead to the Data Preparation phase. Univariate analyses reveal data issues that need to be cleaned, so addressing them before completing the Data Understanding phase ensures a smoother workflow. Therefore, let’s pause EDA momentarily to explore how to automate some of the most common data preparation tasks.

There are many possible data cleaning steps, but some of the key ones include:

Basic wrangling

Removing unique identifiers (ID columns)
Eliminating empty or blank columns
Converting dates into usable formats

Handling missing data

Dropping rows and columns with excessive missing values
Imputing missing data

Replacing with mean, median, or mode
Bivariate imputation: using the most highly correlated column to predict missing values
Multivariate imputation: using all available features to estimate missing values

Handling numeric outliers

Identifying outliers based on defined rules

Univariate methods: Empirical Rule or Tukey’s Boxplot
Multivariate methods: Clustering

Addressing identified outliers

Deleting the entire row
Removing only the outlier value

Replacing with a theoretical minimum or maximum value
Imputing based on missing data handling techniques

Handling categorical outliers

Identifying infrequent categorical groups (e.g., applying the 5% rule)
Addressing rare categories

Relabeling infrequent categories as "Other" or merging with similar groups
Dropping rows containing infrequent categories
Dropping entire categorical columns if they contain too many infrequent groups

Feature engineering

Creating new features by combining existing ones
Modifying existing features using mathematical transformations

While we cannot cover every possible data preparation step in this chapter, we will focus on automating several of the most essential and commonly used techniques.

Previous Next