Practice

Try working through these practice problems below.

Practice #1: Create a pipeline for content filtering.

As you know, functions create reusable code that can save us time and enable smooth pipelines. You already have a function from this chapter for making content filtering-based recommendations. For this practice, you will create functions for all remaining steps in the pipeline.

First, create a function that performs the same cleaning steps that we performed in the chapter for the netflix_titles.csv dataset. This function does not need to be particularly dynamic. It should just return a dataset with no missing values that has had the index reset so that all movie IDs are consecutive in the index.

Next, create a function that calculates TF-IDF scores and generates the similarity matrix. It should require the DataFrame that was cleaned from the prior function and also allow, as a parameter, the name of the feature that has unstructured text that will be used to calculate TF-IDF scores in the matrix. Call this function tfidf_matrix().

Then, either use the existing function from the chapter called get_recommendations() or create a new one to accomplish the same purpose. It should accept an item_id, the similarity matrix, and the number of recommendations you want returned. It should return the recommended movie IDs (not titles) in the order of similarity as well as the similarity scores. Again, you could just copy the function from the chapter.

Then, create a method (not a function because it won't have a return statement) that allows you to pass in a movie title and anything else you need to make this method work well. This method will print out the n titles recommended for the submitted title. In other words, this function will call get_recommendations() to return movie IDs, but then it will convert them into titles. If you enter an invalid title, it should inform you that it was invalid and it should then suggest 10 random other titles to try instead. This function should print out the following text, "If you liked [movie name] staring [cast], then you may like these other movies including a similar cast:" Then it should print a DataFrame of the recommended movie titles, casts, and the similarity score.

Finally call these functions in the proper order using the netflix_titles.csv dataset once again. But use the "cast" feature instead of the "description" feature to calculate the similarity scores. Get 5 recommendations for the movie, "The Polar Express". Try spelling the movie title incorrectly to see if your logic works in the last method.

Click the Colab icon to the right to see one possible solution to this practice problem.

Practice #2: Extending Similarity

Using the same Netflix dataset as found in the chapter, and the pipeline you created in the prior practice problem, expand the content filtering to include all of the following features together: type, title, director, cast, country, listed_in, and description. One way to do this is to merge/concatenate each of those string features into a single column and base your model on that combined column. However, be sure to clean the data first by addressing all missing data.

There are many ways to accomplish this task using the pipeline you have already created in Practice #1. Feel free to modify any of the prior functions to make this work.

Click the Colab icon to the right to see one possible solution to this practice problem.