Statistics Function

Let’s begin by recalling the eight general steps to an automation function:

  1. Define the automation function

  2. Import Python packages

  3. Create variables needed for processing

  4. Define the iteration (i.e., for loop)

  5. Perform processing needed for every iteration

  6. Define the decision criterion

  7. Perform processing required in each branch of the decision tree

  8. Perform any final processing required to synthesize all branches (if any)

We followed these steps very clearly in the prior chapter one step at a time. However, although I listed them above, it is really just to remind you of them because we are going to move much more quickly through the automation process in this chapter.

      # Remember to mount Google Drive if needed
      from google.colab import drive
      drive.mount('/content/drive')
          
      import pandas as pd

      # Dataset with numeric label for testing
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
    
      # Dataset with categorical label for testing
      df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
      df_airline.head()
      
      # Examine the output in your own notebook
      

Numeric-to-Numeric

Next, let's proceed by creating a function that will automatically calculate an effect size for the first relationship type: numeric-to-numeric (N2N). In this case, of both the label and feature are numeric, then we want to calculate a Pearson correlation (r) coefficient. However, it can also help to calculate the regression line and a p-value. So we will do all three. By the way, we will later enhance this function to calculate Kendall's tau (τ) and Spearman's rho (ρ) which apply to features that don't meet the standard assumptions for linear relationships.

      def bivariate_stats(df, label, roundto=4):
        import pandas as pd
        from scipy import stats
          
        output_df = pd.DataFrame(columns=['p', 'r', 'y = m(x) + b'])
        
        for feature in df.columns:
          if feature != label: # No need to calculate the relationship of the label with itself
            if pd.api.types.is_numeric_dtype(df[feature]):
              m, b, r, p, err = stats.linregress(df[feature], df[label]) # Calculate the regression line
              output_df.loc[feature] = [round(p, roundto), round(r, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}']
        
        return output_df
      

In this first version of the function above, we have implemented several of the steps all at once:

  1. Defined the function as 'bivariate_stats.' This includes three inputs: 1) the entire DataFrame to be analyzed, 2) the name of the label, and 3) an option to specify out many decimals to include in the results

  2. Imported python packages pandas and scipy

  3. Created the output_df needed to print the results. Other variables were also created in the branch logic later

  4. Defined an iteration through all of the features while skipping the label. This is the for loop.

  5. Defined a decision criterion to filter out the categorical features so that we initially focus only on N2N relationships.

  6. Performed at least the mimimum required processing in each branch This includes the statistics generated for N2N relationships.

Notice that these steps align with the concepts at the top of this page. If you completed the chapter on univariate automation, this should all feel familiar to you. Next, let's test out the function using the two practice datasets provided:

      bivariate_stats(df_insurance, 'charges', 3)
      

See how we have a p-value, Pearson correlation r, and linear formula for each of the numeric features (because the label is also numeric), but dashes for the categorical features. What if we want to analyze a dataset that has a categorical label? Print out the head of the df_airline dataset as an example. This dataset is used to predict or explain what causes airline passengers to be satisfied (or dissatisfied) with their flight experience.

      df_airline.head()
      
      # See the output in your own notebook
      

Notice that the label column 'satisfaction' contains a text indication of whether the customer was 'satisfied' or 'dissatisfied'. If we try our function on this dataset with 'satisfaction' as a label, it will break because you can't calculate a correlation with text data. Let's modify the decision criteria of our function to handle C2N and N2C relationships without implementing the statistical analyses yet and then test it with the airline satisfaction data.

Categorical-to-Numeric

In this version of the function below, we

      def bivariate_stats(df, label, roundto=4):
        import pandas as pd
        from scipy import stats
      
        output_df = pd.DataFrame(columns=['p', 'r', 'y = m(x) + b'])
    
        for feature in df.columns:
          if feature != label:
          
            # If both the feature and label are numeric
            if pd.api.types.is_numeric_dtype(df[feature]) and pd.api.types.is_numeric_dtype(df[label]):
              m, b, r, p, err = stats.linregress(df[feature], df[label])
              output_df.loc[feature] = [round(p, roundto), round(r, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}']
          
            # If both the feature and label are categorical (we will come back to this condition, leave it with dashes for now)
            elif not pd.api.types.is_numeric_dtype(df[feature]) and not pd.api.types.is_numeric_dtype(df[label]):
              output_df.loc[feature] = ['-', '-', '-']
          
            # If either the feature and label are categorical (we will come back to this condition, leave it with dashes for now)
            else:
              output_df.loc[feature] = ['-', '-', '-']
        return output_df
      

Notice that we not have two additional decision tree alternatives for scenarios where both the feature and label are categorical (C2C) and then a simple "else" criterion that handles all other conditions. If N2N is handled in the first branch and C2C in the second branch, then that allows the final 'else' crition to handle both C2N and N2C. This is ideal because we would use the same statistics and visualizations for either scenario. If we run this function with the ariline dataset, what do we get?

      bivariate_stats(df_airline, 'satisfaction', 3)
      

Notice that we get a table with a row for every feature in the airlines dataset, but no statistics calculated since we simply inputted dashes for the C2N and C2C conditions. That makes sense. Let's now implement the logic needed to calculate the necessary statistics for C2N/N2C relationships:

      def bivariate_stats(df, label, roundto=4):
        import pandas as pd
        from scipy import stats
      
        output_df = pd.DataFrame(columns=['p', 'r', 'y = m(x) + b', 'F'])
    
        for feature in df.columns:
          if feature != label:
          
            # If both the feature and label are numeric
            if pd.api.types.is_numeric_dtype(df[feature]) and pd.api.types.is_numeric_dtype(df[label]):
              m, b, r, p, err = stats.linregress(df[feature], df[label])
              output_df.loc[feature] = ['r', round(r, roundto), round(p, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}', '-']
          
            # If both the feature and label are categorical (we will come back to this condition, leave it with dashes for now)
            elif not pd.api.types.is_numeric_dtype(df[feature]) and not pd.api.types.is_numeric_dtype(df[label]):
              output_df.loc[feature] = ['-', '-', '-', '-']
          
            # If either the feature and label are categorical
            else:
              # Handle whether it's the feature or label that is numeric; basically, set which one is numeric and which is categorical
              if pd.api.types.is_numeric_dtype(df[feature]): 
                num = feature
                cat = label
              else:
                num = label
                cat = feature
            
              groups = df[cat].unique()         # Identify the unique categorical values in a list
              group_lists = []                  # Create a list to keep track of the lists of numeric values for each category
              for g in groups:                  # Loop through each category
                g_list = df[df[cat] == g][num]  # Add values to a temporary list
                group_lists.append(g_list)      # Add temp list to master nested list; NOTE: if you are unsure of what's going on here, try printing out each of the variables
    
              results = stats.f_oneway(*group_lists) # The "*" is the same as (group_lists[0], group_lists[1], ... group_lists[n])
              F = results[0]
              p = results[1]
              output_df.loc[feature] = [round(p, roundto), '-', '-', round(F, roundto)]
        return output_df
      

In this version, after the "else" condition, we first added logic to specify whether the label and feature were numeric or categorical. We did that by checking if the feature is numeric and then resetting them both to the variables "num" and "cat". After that, we used the logic which may have been taught in an earlier chapter in this book to identify the unique categorical values, loop through those values, and create separate lists of numeric variable for each category. Once we had those sub-lists, we inputted them into the stats.f_oneway function and generate the F and p values. Since the oneway ANOVA F returns the p-value when there are only two categorical values as a t-test, we will simply use a oneway ANOVA for all C2N/N2C conditions regardless of whether there are 2 or more unique categorical values.

Let's test it out on the airline data:

      bivariate_stats(df_airline, 'satisfaction', 3)
      

Notice that we now have p and F values for every numeric feature (e.g. Age, Flight Distance, etc.). This is starting to look good. Why didn't we get values for Gender, Customer Type, Type of Travel, and Class? Because those features are also categorical. That means we need to create a condition for C2C relationships. Why are there no N2N results showing up? Because the label 'satisfaction' is categorical. So the only two types of relationships we will get from this dataset (as long as we are using satisfaction as the label) will be C2C and N2C.

However, there is one other issue we need to address first. Notice that Arrival Delay in Minutes came back with NaN values. That is because there is missing data in that feature. Let's first modify our code to account for missing data and also add a column to indicate how much data is missing in each feature. Since we will want to do this regardless of what type of relationship we are analyzing (i.e. N2N, C2N/N2C, or C2C), let's add this code starting just after we make sure that we are not analyzing the label's relationship with itself on line 9. However, notice that we also had to add a new column for missing data into the DataFrame create statement on line 5 and then include the percent of missing data when we insert new records into output_df on lines 15, 18, and 37. We also had to change each of the DataFrame references in the rest of the code to refer to df_temp--the version without missing data:

      def bivariate_stats(df, label, roundto=4):
        import pandas as pd
        from scipy import stats
        
        output_df = pd.DataFrame(columns=['missing', 'p', 'r', 'y = m(x) + b', 'F'])
        
        for feature in df.columns:
          if feature != label:
            df_temp = df[[feature, label]]
            df_temp = df_temp.dropna()
            missing = (df.shape[0] - df_temp.shape[0]) / df.shape[0] # Calculate the percent of rows missing if either the feature or label are missing
        
            if pd.api.types.is_numeric_dtype(df[feature]) and pd.api.types.is_numeric_dtype(df[label]):
              m, b, r, p, err = stats.linregress(df_temp[feature], df_temp[label])
              output_df.loc[feature] = [f'{missing:.2%}', 'r', round(r, roundto), round(p, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}', '-']
            
            elif not pd.api.types.is_numeric_dtype(df_temp[feature]) and not pd.api.types.is_numeric_dtype(df[label]):
              output_df.loc[feature] = [f'{missing:.2%}', '-', '-', '-', '-']
            
            else:
              if pd.api.types.is_numeric_dtype(df_temp[feature]): 
                num = feature
                cat = label
              else:
                num = label
                cat = feature
              
              groups = df_temp[cat].unique()
              group_lists = []
              for g in groups:
                g_list = df_temp[df_temp[cat] == g][num]
                group_lists.append(g_list)
        
              results = stats.f_oneway(*group_lists)
              F = results[0]
              p = results[1]
              output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', round(F, roundto)]
        return output_df
      

Let's test out this version:

      bivariate_stats(df_airline, 'satisfaction', 3)
      

That looks better. we now have a p-value and F stat for Arrival Delay in Minutes as well as a percent of missing data in the first column so that we know when a modified version of the dataset was used to calculate the statistics. Next, let's handle the C2C scenario.

Categorical-to-Categorical

Beginning on line 19, let's calculate the Pearson chi-square statistic (χ2) for C2C relationships. But to include this stat, we need to add a new column to the DataFrame (line 5) and then a dash to the irrelevant conditions on line 15 and eventually line 39.

Finally, now that we have all relationships accounted for, we can sort our output_df based on the p-value to get an idea of which features have the largest effect on the label. Please note that p-value is only a surrogate here for the effect size. It's not actually a measure of effect size.

      def bivariate_stats(df, label, roundto=4):
        import pandas as pd
        from scipy import stats
          
        output_df = pd.DataFrame(columns=['missing', 'p', 'r', 'y = m(x) + b', 'F', 'X2'])
        
        for feature in df.columns:
          if feature != label:
            df_temp = df[[feature, label]]
            df_temp = df_temp.dropna()
            missing = (df.shape[0] - df_temp.shape[0]) / df.shape[0]
        
            if pd.api.types.is_numeric_dtype(df_temp[feature]) and pd.api.types.is_numeric_dtype(df_temp[label]):
              m, b, r, p, err = stats.linregress(df_temp[feature], df_temp[label])
              output_df.loc[feature] = [f'{missing:.2%}', round(r, roundto), round(p, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}', '-', '-']
              
            elif not pd.api.types.is_numeric_dtype(df_temp[feature]) and not pd.api.types.is_numeric_dtype(df_temp[label]):
              contingency_table = pd.crosstab(df_temp[feature], df_temp[label]) # Calculate the crosstab
              X2, p, dof, expected = stats.chi2_contingency(contingency_table)  # Calculate the Chi-square based on the crosstab
              output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', '-', round(X2, roundto)]
                
            else:
              if pd.api.types.is_numeric_dtype(df_temp[feature]): 
                num = feature
                cat = label
              else:
                num = label
                cat = feature
                
              groups = df_temp[cat].unique()
              group_lists = []
              for g in groups:
                g_list = df_temp[df_temp[cat] == g][num]
                group_lists.append(g_list)
        
              results = stats.f_oneway(*group_lists)
              F = results[0]
              p = results[1]
              output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', round(F, roundto), '-']
        return output_df.sort_values(by=['p'])
      

Let's test out this version:

      bivariate_stats(df_airline, 'satisfaction', 3)
      

Now that we have every condition accounted for. Let's try this with the insurance.csv dataset again using a couple of different labels:

      bivariate_stats(df_insurance, 'charges', 3)
      
      bivariate_stats(df_insurance, 'smoker', 3)
      

Hopefully, you realize how cool this is. You now have a function that automatically calculates the correct statistics for any dataset based on the type of data.