Introduction

In the previous chapter, we learned how to automate simple univariate statistics and visualizations. Typically, exploratory data analysis (EDA) also includes basic bivariate analyses. However, before diving into that, I prefer to shift ahead to the Data Preparation phase. Univariate analyses reveal data issues that need to be cleaned, so addressing them before completing the Data Understanding phase ensures a smoother workflow. Therefore, let’s pause EDA momentarily to explore how to automate some of the most common data preparation tasks.

There are many possible data cleaning steps, but some of the key ones include:

  • Basic wrangling

    • Removing unique identifiers (ID columns)
    • Eliminating empty or blank columns
    • Converting dates into usable formats
  • Handling missing data

    • Dropping rows and columns with excessive missing values
    • Imputing missing data
      • Replacing with mean, median, or mode
      • Bivariate imputation: using the most highly correlated column to predict missing values
      • Multivariate imputation: using all available features to estimate missing values
  • Handling numeric outliers

    • Identifying outliers based on defined rules
      • Univariate methods: Empirical Rule or Tukey’s Boxplot
      • Multivariate methods: Clustering
    • Addressing identified outliers
      • Deleting the entire row
      • Removing only the outlier value
        • Replacing with a theoretical minimum or maximum value
        • Imputing based on missing data handling techniques
  • Handling categorical outliers

    • Identifying infrequent categorical groups (e.g., applying the 5% rule)
    • Addressing rare categories
      • Relabeling infrequent categories as "Other" or merging with similar groups
      • Dropping rows containing infrequent categories
      • Dropping entire categorical columns if they contain too many infrequent groups
  • Feature engineering
    • Creating new features by combining existing ones
    • Modifying existing features using mathematical transformations

While we cannot cover every possible data preparation step in this chapter, we will focus on automating several of the most essential and commonly used techniques.