6.5 Practice
Next, see how you do with these practice problems to assess your ability to automate:
The bivariate_stats function is a good start. But there are ways for us to improve it. This function does not tell us whether the statistics we used to analyze each relationship are truly valid. What does that mean? You might remember that the Pearson correlation coefficient depends on the assumption that both the feature and label are normally distributed. What if they aren't?
Modify the function to also calculate Kendall's tau (τ) (best for rank-ordered/ordinal data) and Spearman rho (ρ) correlations (best when the feature or label are not normally distributed or the relationship is non-linear). Add these columns next to Pearson r. To keep things simpler, do not worry about including the unique p-values for the tau and rho correlations. However, add two more columns at the end of the output_df: one with the skewness of each numeric feature and the other with a count of the number of unique values in that column. The analyst will need this to help her/him determine which correlation metric to use.
Test this function using the housing dataset to predict SalesPrice as the label. It should look similar to this:
If you have learned about data cleaning, then you might remember that categorical variables need to have a minimum representation for each categorical value (often 5%). For example, let's say you have a variable for 'Country' and there are 200 values that are distributed something like this: US = 75 rows, China = 75 rows, Brazil = 40 rows, Germany = 8 rows, Thailand = 1 row, Japan = 1 row. Three of those countries have value counts that comprise less than 5% of the overall dataset. Typically, we will bin those values into a bucket like 'Other' since it wouldn't be valid to draw conclusions from under-represented data.
The "5 percent" rule discussed above is just a rough guideline and not necessarily a strict rule. For example, what if you have very large datasets where 4 percent of the records is still 50 total records? The point is that it may be useful to have both a percent cutoff as well as a numeric count cutoff. For example, maybe we only want to bin if the category group value is BOTH below 5 percent AND lower than 20 (or some other n) records.
Modify the bin_categories() function from the chapter to include this logic and test it using the housing dataset with the neighborhood feature. To do this, print out the value_counts() of the neighborhood feature before and after binning. But show both the count of values as well as the percent of values in the "before" version of neighborhood. It should look similar to this:
Notice that more of the neighborhoods were kept (without being binned into "Other") with this modified function because they still have enough rows (> 50) even though their percentages are not > 0.05. Now you have a function that gives you greater control over binning.