9.5 Math Transformations
Photo by Diva Jain, August 23, 2018, CC BY-SA 4.0 via Wikimedia Commons.
Why Do We Transform?
Mathematical transformations are often applied to numeric features in data analytics to improve the performance of machine learning models, enhance interpretability, and correct distributional issues. Below are some key reasons why transformations are beneficial:
Normalizing Skewed Data: Many machine learning algorithms assume normality in the input data. If a feature is highly skewed (e.g., income data), applying a transformation can help normalize it.
Reducing the Impact of Outliers: Outliers can disproportionately affect models like linear regression. Logarithmic or square root transformations can reduce their influence.
Improving Linearity: Some machine learning models perform better when relationships between features and the target variable are linear. Transformations can help achieve this.
Stabilizing Variance: If a feature’s variance increases as its value increases, a transformation can help achieve homoscedasticity (constant variance), which is desirable in linear regression.
Making Data More Interpretable: Some transformations help present data in a way that is easier to understand, such as converting exponential growth to a linear scale.
The above list gives several reasons for performing math transformations. The next obvious question is "how" and then, "can we automate this?"
How Do We Transform?
The table below summarizes a list of common transformations:
Transformation | Use Case | Effect | Reverse |
---|---|---|---|
Log | Right-skewed data (e.g., prices, income) | Reduces skewness, stabilizes variance | Exponentiation (np.exp(x)) |
Square Root (including n root) | Count data (e.g., population, transaction counts) | Normalizes variance while preserving order | Squaring (x**2) |
Box-Cox | Non-normal data | Stabilizes variance and improves normality | Inverse Box-Cox (scipy.special.inv_boxcox(x, lambda)) |
Min-Max Scaling | Features with different units | Scales features to [0,1] range | x * (max - min) + min |
Z-Score Standardization | Features with different scales | Centers data around 0 with unit variance | x * std + mean |
Again, I realized that you will not likely fully understand why we need these transformations at this point in the Data Project process. The need for each of these will become clearer in the modeling phase. For now, let's simply learn to automate them. There are already functions that can perform Box-Cox, Min-Max scaling, and Z-Score standardization. So let's focus on logarithmic and n root transformations to reduce skewness.
Let's begin by examining the histogram of a skewed feature and see how various transformations affect the shape. Remember, that we have four datasets to test our functions on. Let's use the charges feature from the medical insurance dataset:
# Mount Google Drive if needed and bring in some sample data
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Datasets with numeric label for testing
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
# Dataset with categorical label for testing
df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
import seaborn as sns, matplotlib.pyplot as plt, numpy as np
print(f"Original charges: {df_insurance['charges'].skew()}")
print(f"Square root transform: {(df_insurance['charges']**(1/2)).skew()}")
print(f"Cubed root transform: {(df_insurance['charges']**(1/3)).skew()}")
print(f"Natural log transform: {np.log2(df_insurance['charges']).skew()}")
# Create subplots (1 row, 4 columns)
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
# Original charges histogram
sns.histplot(data=df_insurance, x='charges', ax=axes[0])
axes[0].set_title("Original Distribution")
# Square root transform
sns.histplot(df_insurance['charges']**(1/2), ax=axes[1])
axes[1].set_title("Square Root Transform")
# Cubed root transform
sns.histplot(df_insurance['charges']**(1/3), ax=axes[2])
axes[2].set_title("Cubed Root Transform")
# Natural log transform
sns.histplot(np.log2(df_insurance['charges']), ax=axes[3])
axes[3].set_title("Log Transform")
# Adjust layout and show plots
plt.tight_layout()
plt.show()
# Output:
# Original charges: 1.5158796580240388
# Square root transform: 0.7958625166976426
# Cubed root transform: 0.515182615434519
# Natural log transform: -0.09009752473024946
Notice how the skewness and shape of the histogram become closer to zero and more normally distributed as we apply increasingly strong transformations. However, it is important to note that stronger transformations are not always better. In this case, it required the most extreme transformation (logarithmic) to bring the skewness close to zero. However, in some cases, this may be too strong and could result in negative skewness. The goal is to select the transformation that brings the skewness score as close to zero as possible. How should we do that? Take a look at the example below:
def skew_correct(df, feature, max_power=50, messages=True):
import pandas as pd, numpy as np
import seaborn as sns, matplotlib.pyplot as plt
# Ensure the feature is numeric before proceeding
if not pd.api.types.is_numeric_dtype(df[feature]):
if messages: print(f'{feature} is not numeric. No transformation performed')
return df
# Address missing data by running basic wrangling function
df = basic_wrangling(df, messages=False)
if messages: print(f"{df.shape[0] - df.dropna().shape[0]} rows were dropped first due to missing data")
df.dropna(inplace=True)
# Reduce dataset size if it is too large
df_temp = df.copy()
if df_temp.memory_usage().sum() > 1000000: # If memory usage is greater than 1MB
df_temp = df.sample(frac=round(5000 / df.shape[0], 2)) # Take a representative sample
# Identify the appropriate transformation to correct skewness
i = 1 # Initial transformation power
skew = df_temp[feature].skew() # Compute initial skewness
if messages: print(f'Starting skew:\t{round(skew, 5)}')
# Try different transformations to reduce skewness
while round(skew, 2) != 0 and i <= max_power:
i += 0.01 # Increment transformation power slightly
if skew > 0:
skew = np.power(df_temp[feature], 1/i).skew() # Apply root transformations for right-skewed data
else:
skew = np.power(df_temp[feature], i).skew() # Apply power transformations for left-skewed data
if messages: print(f'Final skew:\t{round(skew, 5)} based on raising to {round(i, 2)}')
# Apply the best-found transformation
if skew > -0.1 and skew < 0.1: # If skew is sufficiently corrected
if skew > 0:
corrected = np.power(df[feature], 1/round(i, 3))
name = f'{feature}_1/{round(i, 3)}' # Naming convention for transformed feature
else:
corrected = np.power(df[feature], round(i, 3))
name = f'{feature}_{round(i, 3)}'
df[name] = corrected # Add transformed feature to DataFrame
else:
# If skew correction is unsuccessful, convert the feature to binary (0/1)
name = f'{feature}_binary'
df[name] = df[feature]
if skew > 0:
df.loc[df[name] == df[name].value_counts().index[0], name] = 0 # Most frequent value = 0
df.loc[df[name] != df[name].value_counts().index[0], name] = 1 # Others = 1
else:
df.loc[df[name] == df[name].value_counts().index[0], name] = 1 # Most frequent value = 1
df.loc[df[name] != df[name].value_counts().index[0], name] = 0 # Others = 0
if messages:
print(f'The feature {feature} could not be transformed into a normal distribution.')
print(f'Instead, it has been converted to a binary (0/1)')
# Generate histograms to visualize the effect of skew correction
if messages:
f, axes = plt.subplots(1, 2, figsize=[7, 3.5]) # Create two subplots for before & after transformation
sns.despine(left=True)
# Plot original feature distribution
sns.histplot(df_temp[feature], color='b', ax=axes[0], kde=True)
# Plot corrected feature distribution
if skew > -0.1 and skew < 0.1:
if skew > 0:
corrected = np.power(df_temp[feature], 1/round(i, 3))
else:
corrected = np.power(df_temp[feature], round(i, 3))
df_temp['corrected'] = corrected
sns.histplot(df_temp.corrected, color='g', ax=axes[1], kde=True)
else:
df_temp['corrected'] = df[feature]
if skew > 0:
df_temp.loc[df_temp['corrected'] == df_temp['corrected'].min(), 'corrected'] = 0
df_temp.loc[df_temp['corrected'] > df_temp['corrected'].min(), 'corrected'] = 1
else:
df_temp.loc[df_temp['corrected'] == df_temp['corrected'].max(), 'corrected'] = 1
df_temp.loc[df_temp['corrected'] < df_temp['corrected'].max(), 'corrected'] = 0
sns.countplot(data=df_temp, x='corrected', color='g', ax=axes[1])
plt.setp(axes, yticks=[])
plt.tight_layout()
plt.show()
return df
This is a fairly complex function. You may benefit from watching the video above to understand how and why we built it step by step. However, if you feel comfortable with it, let's go ahead and test it using the same charges feature we started with above:
import pandas as pd
df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
skew_correct(df_insurance, 'charges').head()
# Output:
# Starting skew: 1.516
# Final skew: 0.005
You might find it beneficial to try this out on a few other datasets:
df_nba = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/nba_salaries.csv')
skew_correct(df_nba, 'Salary').head()
# Output:
# Starting skew: 1.842
# Final skew: 0.005
df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
skew_correct(df_airbnb, 'average_review').head()
# Output:
# Starting skew: 7.59
# Final skew: 0.005
df_airbnb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/listings.csv')
skew_correct(df_airbnb, 'average_review').head()
# Output:
# Starting skew: 6.822
# Final skew: 1.939
# The feature Departure Delay in Minutes could not be transformed into a normal distribution.
# Instead, it has been converted to a binary (0/1) where 0 = 0 and all other values = 1
This is a pretty powerful function. However, it's only the beginning of automating mathematical operations. You can create more for yourself as you learn more about the modeling phase.