Data Preparation

Address Missing Data

Let's address that missing data now. The best way to fix this would be to look up each movie or TV show that is missing data on the imdb.com website. I checked the first couple of records with missing data and their full details were available online. However, let's say hypothetically that information is unavailable for some reason. Maybe there is no cast because it's a nature documentary with no narration. If that were the case, I'd recommend that we create a category to represent all of these issues. In the code below, I'm replacing missing directors, casts, and countries with the text "unknown". Alternatively, you could name it "not applicable". Either way, it wouldn't matter. Those values would have the same importance/usefulness regardless of what you decide to call it.

Next, I'm substituting date_added and duration with the mode of those columns. Finally, I'm dropping any row that is missing rating. Is this the right thing to do? Maybe, maybe not. You can decide. I just wanted to remind you of some of the various options. You may also remember from a prior chapter that we can also predict missing values using sklearn's IterativeImputer or KNN Imputer.

      df.director.fillna('unknown', inplace=True)
      df.cast.fillna('unknown', inplace=True)
      df.country.fillna('unknown', inplace=True)
      df.date_added.fillna(df.date_added.mode()[0], inplace=True)
      df.duration.fillna(df.date_added.mode()[0], inplace=True)
      df.dropna(subset=['rating'], inplace=True)
      
      # Very important step
      df.reset_index(inplace=True)
      
      print(df.isna().sum(), '\n')
      df.shape
      
      # Output:
      # show_id         0
      # type            0
      # title           0
      # director        0
      # cast            0
      # country         0
      # date_added      0
      # release_year    0
      # rating          0
      # duration        0
      # listed_in       0
      # description     0
      # dtype: int64 
        
      # (8803, 12)
      

Okay, missing data addressed. But what is with that line that resets the df index with the comment "Very important step"? Remember during the collaborative filtering example when we had to create an itemID to matrix index mapping dictionary? Then we created an item_inv_mapper object that reversed the mapping? We did that because we had to drop a bunch of movie IDs which caused there to be gaps in the movie ID list (e.g. 1, 2, 5, 10, 23, etc). However, the user-item matrix needed to be consequtively ordered. Well, the same is true for the matrix we are about to create. We could, once again, create an item_mapper and item_inv_mapper set of dictionaries to handle this. Or, we could simply reset the index of the DataFrame so that each movie/show ID is consequtively numbered. I wanted you to see that both were valid options for creating the similarity matrix.

Let's proceed with some of the modeling-specific data preparation that needs to happen before we can make recommendations.