Visualization Functions

We are going to create visualization automation a bit differently than the statistics. Rather than create one overall function that handles everything, let's create a separate function for each type of visualization we want. Then, we will create a controller function that integrates the other smaller functions. Why would we do this? Because it allows us to reuse portions of code in a variety of places while editing them all in one spot. The maximizes reusability while limiting the potential for inconsistency and errors because it limits the total amount of code. Make sense? Proabably not if you are still somewhat new to coding. But you'll get the picture as we go along.

Once again, let's create the functions for each data type relationship: N2N, C2N/N2C, and C2C. Then we will integrate them at the end.

We are going to work with the same datasets from the prior section. Read them into DataFrames if you haven't already:

      # Bring in some sample data

      import pandas as pd
    
      df_insurance = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/insurance.csv')
      df_airline = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
      df_housing = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/housing.csv')
      

Numeric-to-Numeric (N2N): Scatterplots

Let's begin by creating a function that generates a scatterplot with a trendline. We could do this using either the matplotlib or seaborn package. If we were to use matplotlib, it would require generating two different plots (one for the scatterplot and one for the trendline) and embedding them together. But seaborn will create both in one plot using the "regplot" object. What will we need to pass into the function to make this work? At a minimum, .regplot requires a dataset with two vectors or lists (one for the feature and one for the label) and the names of those columns. Let's create it to allow a DataFrame and two column names. Notice in the code below that we also allow for a number of decimals to round to and the color of the regression line which we default to 'darkorange'. By providing default values for both the roundto and linecolor parameters, these become optional parameters that the user may or may not specify when calling the function.

      def scatterplot(df, feature, label, roundto=3, linecolor='darkorange'):
        import pandas as pd
        from matplotlib import pyplot as plt
        import seaborn as sns
        from scipy import stats
    
        # Create the plot
        sns.regplot(x=df[feature], y=df[label], line_kws={"color": linecolor})
    
        # Calculate the regression line so that we can print the text
        m, b, r, p, err = stats.linregress(df[feature], df[label])
    
        # Add all descriptive statistics to the diagram
        textstr  = 'Regression line:' + '\n'
        textstr += 'y  = ' + str(round(m, roundto)) + 'x + ' + str(round(b, roundto)) + '\n'
        textstr += 'r   = ' + str(round(r, roundto)) + '\n'
        textstr += 'r2 = ' + str(round(r**2, roundto)) + '\n'
        textstr += 'p  = ' + str(round(p, roundto)) + '\n\n'
    
        plt.text(1, 0.1, textstr, fontsize=12, transform=plt.gcf().transFigure)
        plt.show()
      

Why did we not simply use the regplot function? Why create our own function at all? Because, it can also be useful to calculate the statistics we created in the prior section and embed them along with the chart so that we don't have to refer back to that statistics summary table in order to find out whether there is a significant relationship between the feature and label. That is the primary value added in this function. All of the code after calling the .regplot() function is used to embed those statistics.

Next, let's try this function out with three N2N relationships in the insurance dataset:

      scatterplot(df_insurance, 'age', 'charges')
      
      scatterplot(df_insurance, 'bmi', 'charges')
      
      scatterplot(df_insurance, 'children', 'charges')
      

This wasn't too bad. Let's apply all of the same logic to create a visualization for C2N/N2C relationships next.

Categorical-to-Numeric (C2N/N2C): Bar charts

There are many options in the seaborn package for C2N/N2C relationships. The most basic and straightforward is bar chart using the .barplot() object in seaborn so we will use that. See if you can create a similar function to the one above for a .barplot() that includes a oneway-ANOVA F stat and p-value embedded in the chart. If you have trouble, you can examine the code below for help.

      def bar_chart(df, feature, label, roundto=3):
        import pandas as pd
        from scipy import stats
        from matplotlib import pyplot as plt
        import seaborn as sns
        
        sns.barplot(df, x=feature, y=label)
        
        # Create the label lists needed to calculate oneway-ANOVA F  
        groups = df[feature].unique()
        group_lists = []
        for g in groups:
          g_list = df[df[feature] == g][label]
          group_lists.append(g_list)
      
        results = stats.f_oneway(*group_lists)
        F = results[0]
        p = results[1]
      
        # Next, calculate t-tests with Bonferroni correction for p-value threshold
        ttests = []
        for i1, g1 in enumerate(groups): # Use the enumerate() function to add an index for counting to a list of values
          # For each item, loop through a second list of each item to compare each pair
          for i2, g2 in enumerate(groups):
            if i2 > i1: # If the inner_index is greater that the outer_index, then go ahead and run a t-test
              type_1 = df[df[feature] == g1]
              type_2 = df[df[feature] == g2]
              t, p = stats.ttest_ind(type_1[label], type_2[label])
            
              # Add each t-test result to a list of t, p pairs
              ttests.append([str(g1) + ' - ' + str(g2), round(t, roundto), round(p, roundto)])
      
        p_threshold = 0.05 / len(ttests) # Bonferroni-corrected p-value determined
      
        # Add all descriptive statistics to the diagram
        textstr  = '   ANOVA' + '\n'
        textstr += 'F: ' + str(round(F, roundto)) + '\n'
        textstr += 'p: ' + str(round(p, roundto)) + '\n\n'
      
        # Only include the significant t-tests in the printed results for brevity
        for ttest in ttests:
          if ttest[2] <= p_threshold:
            if 'Sig. comparisons (Bonferroni-corrected)' not in textstr: # Only include the header if there is at least one significant result
              textstr += 'Sig. comparisons (Bonferroni-corrected)' + '\n'
            textstr += str(ttest[0]) + ": t=" + str(ttest[1]) + ", p=" + str(ttest[2]) + '\n'
      
        plt.text(1, 0.1, textstr, fontsize=12, transform=plt.gcf().transFigure)
        plt.show()
      

In the bar_chart() function above, we implemented the seaborn .barplot() object but also included an embedded ANOVA F stat and p-value. In addition, I added all individual pairwise t-test between each pair of categorical values. You may remember from stats classes or a prior chapter in this book that it is common to make an adjustment or correction to the p-value criterion when we examine every pairwise t-test to avoid finding significant results by random chance. In this function above, I calculated the adjusted Bonferroni p-value threshold. Then, with a second for loop beginning on line 41, I added only those pairwise t-test relationships that were significant according to the Bonferroni test into the embedded chart in order to simplify the results.

One way that we could improve this function would be to bin categorical values that represent less than 5 percent of the total rows in the dataset into a category called "Other". This is useful since we wouldn't trust ANOVA or t-test results for categories with inadequate representation anyway. It would also reduce and simplify the number of t-tests for features with many values. We might come back to this later.

      bar_chart(df_insurance, 'region', 'charges')
      
      bar_chart(df_insurance, 'smoker', 'charges')
      
      bar_chart(df_housing, 'Electrical', 'SalePrice')
      
      def bar_chart(df, feature, label, roundto=3):
        import pandas as pd
        from scipy import stats
        from matplotlib import pyplot as plt
        import seaborn as sns
        
        # Handle missing data
        df_temp = df[[feature, label]]
        df_temp = df_temp.dropna()
        
        sns.barplot(df_temp, x=feature, y=label)
        
        # Create the label lists needed to calculate oneway-ANOVA F  
        groups = df_temp[feature].unique()
        group_lists = []
        for g in groups:
          g_list = df_temp[df_temp[feature] == g][label]
          group_lists.append(g_list)
       
        results = stats.f_oneway(*group_lists)
        F = results[0]
        p = results[1]
        
        # Next, calculate t-tests with Bonferroni correction for p-value threshold
        ttests = []
        for i1, g1 in enumerate(groups): # Use the enumerate() function to add an index for counting to a list of values
          # For each item, loop through a second list of each item to compare each pair
          for i2, g2 in enumerate(groups):
            if i2 > i1: # If the inner_index is greater that the outer_index, then go ahead and run a t-test
              type_1 = df_temp[df_temp[feature] == g1]
              type_2 = df_temp[df_temp[feature] == g2]
              t, p = stats.ttest_ind(type_1[label], type_2[label])
            
              # Add each t-test result to a list of t, p pairs
              ttests.append([str(g1) + ' - ' + str(g2), round(t, roundto), round(p, roundto)])
        
        p_threshold = 0.05 / len(ttests) # Bonferroni-corrected p-value determined
        
        # Add all descriptive statistics to the diagram
        textstr  = '   ANOVA' + '\n'
        textstr += 'F: ' + str(round(F, roundto)) + '\n'
        textstr += 'p: ' + str(round(p, roundto)) + '\n\n'
        
        # Only include the significant t-tests in the printed results for brevity
        for ttest in ttests:
          if ttest[2] <= p_threshold:
            if 'Sig. comparisons (Bonferroni-corrected)' not in textstr: # Only include the header if there is at least one significant result
              textstr += 'Sig. comparisons (Bonferroni-corrected)' + '\n'
            textstr += str(ttest[0]) + ": t=" + str(ttest[1]) + ", p=" + str(ttest[2]) + '\n'
        
        plt.text(1, 0.1, textstr, fontsize=12, transform=plt.gcf().transFigure)
        plt.show()
      
      bar_chart(df_housing, 'Electrical', 'SalePrice')
      

Categorical-to-Categorical (C2C): Heat maps

      def bin_categories(df, feature, cutoff=0.05, replace_with='Other'):
        # create a list of feature values that are below the cutoff percentage
        other_list = df[feature].value_counts()[df[feature].value_counts() / len(df) < cutoff].index
        
        # Replace the value of any country in that list (using the .isin() method) with 'Other'
        df.loc[df[feature].isin(other_list), feature] = replace_with
        
        return df
      
      # print out a list of values before an after using the bin_categories function
        
      print(df_housing['Neighborhood'].value_counts() / df_housing.shape[0], '\n\n')
      df_housing = bin_categories(df_housing, 'Neighborhood')
      print(df_housing['Neighborhood'].value_counts() / df_housing.shape[0])
      
      # Output:
      # Neighborhood
      # NAmes      0.154110
      # CollgCr    0.102740
      # OldTown    0.077397
      # Edwards    0.068493
      # Somerst    0.058904
      # Gilbert    0.054110
      # NridgHt    0.052740
      # Sawyer     0.050685
      # NWAmes     0.050000
      # SawyerW    0.040411
      # BrkSide    0.039726
      # Crawfor    0.034932
      # Mitchel    0.033562
      # NoRidge    0.028082
      # Timber     0.026027
      # IDOTRR     0.025342
      # ClearCr    0.019178
      # SWISU      0.017123
      # StoneBr    0.017123
      # MeadowV    0.011644
      # Blmngtn    0.011644
      # BrDale     0.010959
      # Veenker    0.007534
      # NPkVill    0.006164
      # Blueste    0.001370
      # Name: count, dtype: float64 
        
       
      # Neighborhood
      # Other      0.330822
      # NAmes      0.154110
      # CollgCr    0.102740
      # OldTown    0.077397
      # Edwards    0.068493
      # Somerst    0.058904
      # Gilbert    0.054110
      # NridgHt    0.052740
      # Sawyer     0.050685
      # NWAmes     0.050000
      # Name: count, dtype: float64
      
      def crosstab(df, feature, label, roundto=3):
        import pandas as pd
        from scipy.stats import chi2_contingency
        from matplotlib import pyplot as plt
        import seaborn as sns
        import numpy as np
        
        # Handle missing data
        df_temp = df[[feature, label]]
        df_temp = df_temp.dropna()
        
        # Bin categories
        df_temp = bin_categories(df_temp, feature)
        
        # Generate the crosstab table required for X2
        crosstab = pd.crosstab(df_temp[feature], df_temp[label])
        
        # Calculate X2 and p-value
        X, p, dof, contingency_table = chi2_contingency(crosstab)
        
        textstr  = 'X2: ' + str(round(X, 4))+ '\n'
        textstr += 'p = ' + str(round(p, 4)) + '\n'
        textstr += 'dof  = ' + str(dof)
        plt.text(0.9, 0.1, textstr, fontsize=12, transform=plt.gcf().transFigure)
        
        ct_df = pd.DataFrame(np.rint(contingency_table).astype('int64'), columns=crosstab.columns, index=crosstab.index)
        sns.heatmap(ct_df, annot=True, fmt='d', cmap='coolwarm')
        plt.show()
      
      crosstab(df_housing, 'Neighborhood', 'OverallCond')