6.4 Putting It All Together
Now that we have a set of good visualization functions to go with our bivariate() stats function, let's combine all of these into a single function that will allow us to complete the exploratory data analysis phase of data mining projects. To do this, let's copy the last version of the bivariate() function and then integrate each of the functions we created to facilitate bivariate visualizations into the appropriate places in the flow control:
def bivariate(df, label, roundto=4):
import pandas as pd
from scipy import stats
output_df = pd.DataFrame(columns=['missing', 'p', 'r', 'τ', 'ρ', 'y = m(x) + b', 'F', 'X2', 'skew', 'unique', 'values'])
for feature in df.columns:
if feature != label:
df_temp = df[[feature, label]]
df_temp = df_temp.dropna()
missing = (df.shape[0] - df_temp.shape[0]) / df.shape[0]
unique = df_temp[feature].nunique()
# Bin categories
if not pd.api.types.is_numeric_dtype(df_temp[feature]):
df = bin_categories(df, feature)
if pd.api.types.is_numeric_dtype(df_temp[feature]) and pd.api.types.is_numeric_dtype(df_temp[label]):
m, b, r, p, err = stats.linregress(df_temp[feature], df_temp[label])
tau, tp = stats.kendalltau(df_temp[feature], df_temp[label])
rho, rp = stats.spearmanr(df_temp[feature], df_temp[label])
output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), round(r, roundto), round(tau, roundto),
round(rho, roundto), f'y = {round(m, roundto)}(x) + {round(b, roundto)}', '-', '-',
df_temp[feature].skew(), unique, '-']
scatterplot(df_temp, feature, label, roundto) # Call the scatterplot function
elif not pd.api.types.is_numeric_dtype(df_temp[feature]) and not pd.api.types.is_numeric_dtype(df_temp[label]):
contingency_table = pd.crosstab(df_temp[feature], df_temp[label])
X2, p, dof, expected = stats.chi2_contingency(contingency_table)
output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', '-', '-', '-', round(X2, roundto), '-',
unique, df_temp[feature].unique()]
crosstab(df_temp, feature, label, roundto) # Call the crosstab function
else:
if pd.api.types.is_numeric_dtype(df_temp[feature]):
skew = df_temp[feature].skew()
num = feature
cat = label
else:
skew = '-'
num = label
cat = feature
groups = df_temp[cat].unique()
group_lists = []
for g in groups:
g_list = df_temp[df_temp[cat] == g][num]
group_lists.append(g_list)
results = stats.f_oneway(*group_lists)
F = results[0]
p = results[1]
output_df.loc[feature] = [f'{missing:.2%}', round(p, roundto), '-', '-', '-', '-', round(F, roundto), '-', skew,
unique, df_temp[cat].unique()]
bar_chart(df_temp, cat, num, roundto) # Call the barchart function
return output_df.sort_values(by=['p'])
Notice that we integrated the four new functions we created. The bin_categories() function is called on one 16 before the flow control relating to the relationship type. Then, the three visualization functions are called within their respective flow control locations based on the relationship type on lines 26, 33, and 56. Now, let's test out the function with a few datasets:
bivariate(df_insurance, 'charges')
# See the output in your own notebook; very long
bivariate(df_airline, 'satisfaction')
# See the output in your own notebook; very long
bivariate(df_housing, 'SalePrice')
# See the output in your own notebook; very long
Okay, we have just begun to scratch the surface of the type of automation that can help you perform the exploratory data analysis, or "Data Understanding" phase, of the data mining life cycle. Hopefully, you have already begun to think about ways that you can improve these functions. But before we end, don't forget to add these functions to an external .py file that you can use to keep track of all your favorite automation functions. Then you can call these functions from within any .ipynb file like this:
import sys
sys.path.append('/content/drive/MyDrive/Colab Notebooks/class/IS455/In-class notebooks')
import functions as fun
fun.bivariate(df_insurance, "charges")