Putting It All Together

Now that we have a set of good visualization functions to go with our bivariate() stats function, let's combine all of these into a single function that will allow us to complete the exploratory data analysis phase of data mining projects. To do this, let's copy the last version of the bivariate() function and then integrate each of the functions we created to facilitate bivariate visualizations into the appropriate places in the flow control:

      def bivariate(df, label, roundto=4):
        import pandas as pd
        from scipy import stats
        
        output_df = pd.DataFrame(columns=['missing', 'p', 'r', 'τ', 'ρ', 'y = m(x) + b', 'F', 'X2', 'skew', 'unique', 'values'])
        
        for feature in df.columns:
          if feature != label:
            df_temp = df[[feature, label]]
            df_temp = df_temp.dropna()
            missing = (df.shape[0] - df_temp.shape[0]) / df.shape[0]
            unique = df_temp[feature].nunique()
        
            # Bin categories
            if not pd.api.types.is_numeric_dtype(df_temp[feature]):
              df = bin_categories(df, feature)
        
            if pd.api.types.is_numeric_dtype(df_temp[feature]) and pd.api.types.is_numeric_dtype(df_temp[label]):
              m, b, r, p, err = stats.linregress(df_temp[feature], df_temp[label])
              tau, tp = stats.kendalltau(df_temp[feature], df_temp[label])
              rho, rp = stats.spearmanr(df_temp[feature], df_temp[label])
              output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), round(r, roundto), round(tau, roundto),
                                        round(rho, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}', '-', '-',
                                        df_temp[feature].skew(), unique, '-']
        
              scatterplot(df_temp, feature, label, roundto) # Call the scatterplot function
            elif not pd.api.types.is_numeric_dtype(df_temp[feature]) and not pd.api.types.is_numeric_dtype(df_temp[label]):
              contingency_table = pd.crosstab(df_temp[feature], df_temp[label])
              X2, p, dof, expected = stats.chi2_contingency(contingency_table)
              output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', '-', '-', '-', round(X2, roundto), '-',
                                        unique, df_temp[feature].unique()]
      
              crosstab(df_temp, feature, label, roundto) # Call the crosstab function
            else:
              if pd.api.types.is_numeric_dtype(df_temp[feature]):
                skew = df_temp[feature].skew()
                num = feature
                cat = label
              else:
                skew = '-'
                num = label
                cat = feature
      
              groups = df_temp[cat].unique()
              group_lists = []
              for g in groups:
                g_list = df_temp[df_temp[cat] == g][num]
                group_lists.append(g_list)
      
              results = stats.f_oneway(*group_lists)
              F = results[0]
              p = results[1]
              output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', '-', '-', round(F, roundto), '-', skew,
                                        unique, df_temp[cat].unique()]
        
              bar_chart(df_temp, cat, num, roundto) # Call the barchart function
            return output_df.sort_values(by=['p'])
      

Notice that we integrated the four new functions we created. The bin_categories() function is called on one 16 before the flow control relating to the relationship type. Then, the three visualization functions are called within their respective flow control locations based on the relationship type on lines 26, 33, and 56. Now, let's test out the function with a few datasets:

      bivariate(df_insurance, 'charges')
      
      # See the output in your own notebook; very long
      
      bivariate(df_airline, 'satisfaction')
      
      # See the output in your own notebook; very long
      
      bivariate(df_housing, 'SalePrice')
      
      # See the output in your own notebook; very long
      

Okay, we have just begun to scratch the surface of the type of automation that can help you perform the exploratory data analysis, or "Data Understanding" phase, of the data mining life cycle. Hopefully, you have already begun to think about ways that you can improve these functions. But before we end, don't forget to add these functions to an external .py file that you can use to keep track of all your favorite automation functions. Then you can call these functions from within any .ipynb file like this:

      import sys
      sys.path.append('/content/drive/MyDrive/Colab Notebooks/class/IS455/In-class notebooks')
      import functions as fun
        
      fun.bivariate(df_insurance, "charges")