Data Wrangling

Data wrangling, also known as data munging, is the process of cleaning, transforming, and structuring raw data into a usable format for analysis. This process involves handling missing values (which we will cover in a separate section), removing duplicates, standardizing formats, correcting errors, and merging datasets to make the data more accessible and meaningful.

Some of these tasks can be automated, but many require manual intervention. Here, we'll focus on a few that can be automated. Let's start by importing a few datasets, each with different data wrangling needs.

      # Load sample datasets
    
      import pandas as pd
        
      # Datasets with numeric labels for testing
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
      df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
        
      # Dataset with categorical labels for testing
      df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
      

Creating the Function

There are many steps involved in data wrangling. However, to keep our scope manageable, we'll focus on these key steps:

  1. Remove empty columns or those with more than 95% missing
  2. Remove columns with all unique values (e.g., primary keys) or those with more than 95% unique values
  3. Remove columns containing only a single repeated value

I've considered different approaches to teaching this process. In the previous chapter, we built the unistats function step by step. To keep this chapter concise, I’ll assume you remember the format we used for function development and won't review those steps again. Instead, I’ll provide a function that performs all the tasks listed above. Before looking at my solution, I encourage you to try writing your own function to achieve these objectives. Then, compare your approach with mine below:

      # Function for basic data wrangling
      # Required parameter: a Pandas DataFrame
      # Optional parameters:
      #   features = list of column names to process
      #   missing_threshold = percentage of missing values required to drop a column
      #   unique_threshold = percentage of unique values in a column before it's removed
      #   messages = boolean flag to print messages when columns are dropped
        
      def basic_wrangling(df, features=[], missing_threshold=0.95, unique_threshold=0.95, messages=True):
        import pandas as pd
        
        if not features: 
            features = df.columns
        
        for feat in features:
          if feat in df.columns:
            missing = df[feat].isna().sum()
            unique = df[feat].nunique()
            rows = df.shape[0]
        
            if missing / rows >= missing_threshold:
              if messages: print(f"Dropping {feat}: {missing} missing values out of {rows} ({round(missing/rows, 2)})")
              df.drop(columns=[feat], inplace=True)
            elif unique / rows >= unique_threshold:
              if df[feat].dtype in ['int64', 'object']:
                if messages: print(f"Dropping {feat}: {unique} unique values out of {rows} ({round(unique/rows, 2)})")
                df.drop(columns=[feat], inplace=True)
            elif unique == 1:
              if messages: print(f"Dropping {feat}: Contains only one unique value ({df[feat].unique()[0]})")
              df.drop(columns=[feat], inplace=True)
          else:
            if messages: print(f"Skipping \"{feat}\": Column not found in DataFrame")
        
        return df
      

Now, let's break down what this function does. The function takes several parameters. The first is a Pandas DataFrame, not a single column or Series. If you designed a similar function that processes only one column at a time, that approach is also valid—you would just need to loop through the DataFrame columns.

The optional features parameter allows users to specify a subset of columns for processing. While this isn't strictly necessary (since users could simply pass in a DataFrame with only the desired columns), it demonstrates how to add flexibility to function design.

Additional parameters include missing_threshold, unique_threshold, and messages. These have default values but can be customized as needed. You’ll see their role as we step through the function.

First, the function imports Pandas. Then, it checks whether a specific list of features has been provided. If not, it defaults to processing all columns in the DataFrame.

Next, the function iterates through each column. If a column name in the features list isn't found in the DataFrame, it is skipped. Otherwise, the function calculates the number of missing values, unique values, and total rows—key statistics for data wrangling.

The function then evaluates each column against the specified thresholds. If a column has too many missing values, too many unique values, or consists of only a single repeated value, it is removed. If the messages parameter is set to True, the function prints explanations for each dropped column. This feedback is useful for reporting and debugging.

Finally, the function returns the cleaned DataFrame. This approach ensures that only meaningful features remain, improving the dataset’s quality for further analysis or modeling.

Calling the Function

Let's start by testing our function on the NBA salary dataset. This dataset contains several features that will be affected by our wrangling process, but to better evaluate the function, we’ll add two additional columns: one filled entirely with missing values and another containing only a single unique value. Then, we’ll fill in a few of the missing values to test whether the missing_threshold parameter functions correctly. Finally, we'll print the results:

      import numpy as np
        
      df_nba['missing'] = np.nan
      df_nba['single'] = 'single value'
            
      for row in range(10):
        df_nba.loc[row, 'missing'] = 1
            
      df_nba.head()
      

Now that we've intentionally introduced some "dirty" data, let's clean it up using our function:

      df_nba = basic_wrangling(df_nba, features=df_nba.columns)
      df_nba.head()
        
      # Output:
      # Dropping Unnamed: 0: 467 unique values out of 467 (1.0)
      # Dropping Player Name: 467 unique values out of 467 (1.0)
      # Dropping Player-additional: 467 unique values out of 467 (1.0)
      # Dropping missing: 457 missing values out of 467 (1.0)
      # Dropping single: Contains only one unique value (single value)
      

Not bad, right? Now, try using your newly created basic_wrangling function on the other datasets you imported. Experimenting with different datasets will help you understand how well the function generalizes across various types of data.