Math Transformations

Figure 9.1: Representation of Skewness

Photo by Diva Jain, August 23, 2018, CC BY-SA 4.0 via Wikimedia Commons.

Why Do We Transform?

Mathematical transformations are often applied to numeric features in data analytics to improve the performance of machine learning models, enhance interpretability, and correct distributional issues. Below are some key reasons why transformations are beneficial:

  • Normalizing Skewed Data: Many machine learning algorithms assume normality in the input data. If a feature is highly skewed (e.g., income data), applying a transformation can help normalize it.

  • Reducing the Impact of Outliers: Outliers can disproportionately affect models like linear regression. Logarithmic or square root transformations can reduce their influence.

  • Improving Linearity: Some machine learning models perform better when relationships between features and the target variable are linear. Transformations can help achieve this.

  • Stabilizing Variance: If a feature’s variance increases as its value increases, a transformation can help achieve homoscedasticity (constant variance), which is desirable in linear regression.

  • Making Data More Interpretable: Some transformations help present data in a way that is easier to understand, such as converting exponential growth to a linear scale.

The above list gives several reasons for performing math transformations. The next obvious question is "how" and then, "can we automate this?"

How Do We Transform?

The table below summarizes a list of common transformations:

Table 9.2
Summary of Common Transformations
Transformation Use Case Effect Reverse
Log Right-skewed data (e.g., prices, income) Reduces skewness, stabilizes variance Exponentiation (np.exp(x))
Square Root (including n root) Count data (e.g., population, transaction counts) Normalizes variance while preserving order Squaring (x**2)
Box-Cox Non-normal data Stabilizes variance and improves normality Inverse Box-Cox (scipy.special.inv_boxcox(x, lambda))
Min-Max Scaling Features with different units Scales features to [0,1] range x * (max - min) + min
Z-Score Standardization Features with different scales Centers data around 0 with unit variance x * std + mean

Again, I realized that you will not likely fully understand why we need these transformations at this point in the Data Project process. The need for each of these will become clearer in the modeling phase. For now, let's simply learn to automate them. There are already functions that can perform Box-Cox, Min-Max scaling, and Z-Score standardization. So let's focus on logarithmic and n root transformations to reduce skewness.

Let's begin by examining the histogram of a skewed feature and see how various transformations affect the shape. Remember, that we have four datasets to test our functions on. Let's use the charges feature from the medical insurance dataset:

      # Mount Google Drive if needed and bring in some sample data
      from google.colab import drive
      drive.mount('/content/drive')
        
      import pandas as pd
        
      # Datasets with numeric label for testing
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
      df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
        
      # Dataset with categorical label for testing
      df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
        
      import seaborn as sns, matplotlib.pyplot as plt, numpy as np
        
      print(f"Original charges: {df_insurance['charges'].skew()}")
      print(f"Square root transform: {(df_insurance['charges']**(1/2)).skew()}")
      print(f"Cubed root transform: {(df_insurance['charges']**(1/3)).skew()}")
      print(f"Natural log transform: {np.log2(df_insurance['charges']).skew()}")
        
      # Create subplots (1 row, 4 columns)
      fig, axes = plt.subplots(1, 4, figsize=(20, 5))
        
      # Original charges histogram
      sns.histplot(data=df_insurance, x='charges', ax=axes[0])
      axes[0].set_title("Original Distribution")
        
      # Square root transform
      sns.histplot(df_insurance['charges']**(1/2), ax=axes[1])
      axes[1].set_title("Square Root Transform")
        
      # Cubed root transform
      sns.histplot(df_insurance['charges']**(1/3), ax=axes[2])
      axes[2].set_title("Cubed Root Transform")
        
      # Natural log transform
      sns.histplot(np.log2(df_insurance['charges']), ax=axes[3])
      axes[3].set_title("Log Transform")
        
      # Adjust layout and show plots
      plt.tight_layout()
      plt.show()
      
      # Output: 
      # Original charges:	    1.5158796580240388
      # Square root transform:	0.7958625166976426
      # Cubed root transform:	0.515182615434519
      # Natural log transform:	-0.09009752473024946
      

Notice how the skewness and shape of the histogram become closer to zero and more normally distributed as we apply increasingly strong transformations. However, it is important to note that stronger transformations are not always better. In this case, it required the most extreme transformation (logarithmic) to bring the skewness close to zero. However, in some cases, this may be too strong and could result in negative skewness. The goal is to select the transformation that brings the skewness score as close to zero as possible. How should we do that? Take a look at the example below:

      def skew_correct(df, feature, max_power=50, messages=True):
        import pandas as pd, numpy as np
        import seaborn as sns, matplotlib.pyplot as plt
        
        # Ensure the feature is numeric before proceeding
        if not pd.api.types.is_numeric_dtype(df[feature]):
          if messages: print(f'{feature} is not numeric. No transformation performed')
          return df
        
        # Address missing data by running basic wrangling function
        df = basic_wrangling(df, messages=False)
        if messages: print(f"{df.shape[0] - df.dropna().shape[0]} rows were dropped first due to missing data")
        df.dropna(inplace=True)
        
        # Reduce dataset size if it is too large
        df_temp = df.copy()
        if df_temp.memory_usage().sum() > 1000000:  # If memory usage is greater than 1MB
          df_temp = df.sample(frac=round(5000 / df.shape[0], 2))  # Take a representative sample
        
        # Identify the appropriate transformation to correct skewness
        i = 1  # Initial transformation power
        skew = df_temp[feature].skew()  # Compute initial skewness
        if messages: print(f'Starting skew:\t{round(skew, 5)}')
        
        # Try different transformations to reduce skewness
        while round(skew, 2) != 0 and i <= max_power:
          i += 0.01  # Increment transformation power slightly
          if skew > 0:
            skew = np.power(df_temp[feature], 1/i).skew()  # Apply root transformations for right-skewed data
          else:
            skew = np.power(df_temp[feature], i).skew()  # Apply power transformations for left-skewed data
        
        if messages: print(f'Final skew:\t{round(skew, 5)} based on raising to {round(i, 2)}')
        
        # Apply the best-found transformation
        if skew > -0.1 and skew < 0.1:  # If skew is sufficiently corrected
          if skew > 0:
            corrected = np.power(df[feature], 1/round(i, 3))
            name = f'{feature}_1/{round(i, 3)}'  # Naming convention for transformed feature
          else:
            corrected = np.power(df[feature], round(i, 3))
            name = f'{feature}_{round(i, 3)}'
          df[name] = corrected  # Add transformed feature to DataFrame
        else:
          # If skew correction is unsuccessful, convert the feature to binary (0/1)
          name = f'{feature}_binary'
          df[name] = df[feature]
          if skew > 0:
            df.loc[df[name] == df[name].value_counts().index[0], name] = 0  # Most frequent value = 0
            df.loc[df[name] != df[name].value_counts().index[0], name] = 1  # Others = 1
          else:
            df.loc[df[name] == df[name].value_counts().index[0], name] = 1  # Most frequent value = 1
            df.loc[df[name] != df[name].value_counts().index[0], name] = 0  # Others = 0
          if messages:
            print(f'The feature {feature} could not be transformed into a normal distribution.')
            print(f'Instead, it has been converted to a binary (0/1)')
        
        # Generate histograms to visualize the effect of skew correction
        if messages:
          f, axes = plt.subplots(1, 2, figsize=[7, 3.5])  # Create two subplots for before & after transformation
          sns.despine(left=True)
        
          # Plot original feature distribution
          sns.histplot(df_temp[feature], color='b', ax=axes[0], kde=True)
        
          # Plot corrected feature distribution
          if skew > -0.1 and skew < 0.1:
            if skew > 0:
              corrected = np.power(df_temp[feature], 1/round(i, 3))
            else:
              corrected = np.power(df_temp[feature], round(i, 3))
            df_temp['corrected'] = corrected
            sns.histplot(df_temp.corrected, color='g', ax=axes[1], kde=True)
          else:
            df_temp['corrected'] = df[feature]
            if skew > 0:
              df_temp.loc[df_temp['corrected'] == df_temp['corrected'].min(), 'corrected'] = 0
              df_temp.loc[df_temp['corrected'] > df_temp['corrected'].min(), 'corrected'] = 1
            else:
              df_temp.loc[df_temp['corrected'] == df_temp['corrected'].max(), 'corrected'] = 1
              df_temp.loc[df_temp['corrected'] < df_temp['corrected'].max(), 'corrected'] = 0
            sns.countplot(data=df_temp, x='corrected', color='g', ax=axes[1])
        
          plt.setp(axes, yticks=[])
          plt.tight_layout()
          plt.show()
        
        return df
      

This is a fairly complex function. You may benefit from watching the video above to understand how and why we built it step by step. However, if you feel comfortable with it, let's go ahead and test it using the same charges feature we started with above:

      import pandas as pd
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      skew_correct(df_insurance, 'charges').head()
      
      # Output:
      # Starting skew: 1.516
      # Final skew: 0.005
      

You might find it beneficial to try this out on a few other datasets:

      df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
      skew_correct(df_nba, 'Salary').head()
      
      # Output:
      # Starting skew: 1.842
      # Final skew: 0.005
      
      df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
      skew_correct(df_airbnb, 'average_review').head()
      
      # Output:
      # Starting skew: 7.59
      # Final skew: 0.005
      
      df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
      skew_correct(df_airbnb, 'average_review').head()
      
      # Output:
      # Starting skew: 6.822
      # Final skew: 1.939
      
      # The feature Departure Delay in Minutes could not be transformed into a normal distribution.
      # Instead, it has been converted to a binary (0/1) where 0 = 0 and all other values = 1
      

This is a pretty powerful function. However, it's only the beginning of automating mathematical operations. You can create more for yourself as you learn more about the modeling phase.