19.4 Data Understanding
The primary issue to address in data exploration for this process is to see if we have any missing data.
print(df.shape)
df.isna().sum()
# Output:
# (8807, 12)
# show_id 0
# type 0
# title 0
# director 2634
# cast 825
# country 831
# date_added 10
# release_year 0
# rating 4
# duration 3
# listed_in 0
# description 0
# dtype: int64
Looks like we do have some missing data to address. We'll take care of that in the next phase. In addition, in other supervised machine learning contexts, we often examined the distributions of categorical features to see if there were groups that were under-represented. That requirement does not apply to the type of algorithm we will be using. Therefore, we won't worry about any other data exploration right now.