19.6 Modeling
Modeling Preparation
Similar to the collaborative filtering context, we have to get the data in a particular format before we can establish similarity scores. You might remember that we generated a sparse user-item-rating matrix for collaborative filtering that was basically a table inidcating the rating each each user gave each item. We are going to do something similar again. The difference is that we don't have ratings with this data. Instead of a rating, we are going to generate a tokenized version of the movie/show description column. Then we are going to calcualte a score that represents how importance, or unique, each word is and sum up the score each descrition gets for having those words. Understand? Probably not yet. Let's run the code and then it will be a bit clearer.
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer and Remove stopwords
tfidf = TfidfVectorizer(stop_words='english')
# Fit and transform the data to a tfidf matrix
tfidf_matrix = tfidf.fit_transform(df['description'])
# Print the shape of the tfidf_matrix
print(tfidf_matrix.shape)
# Preview the matrix by placing it into a DataFrame (which we won't need later)
df_tfidf = pd.DataFrame(tfidf_matrix.T.todense(), index=tfidf.get_feature_names_out(), columns=df['description'])
df_tfidf.iloc[2221:2226]
# Output
# (8803, 18891)
So what exactly are you looking at? This table shows the the tfidf_matrix we generated in a Pandas DataFrame. To be clear, we don't need this DataFrame. I just printed it out so that you could see what the tfidf_matrix looks like. Each of the 8803 columns contains one of the unique movie/show descriptions. The rows contain the unique 18891 words (minus stopwords) that appear in the corpus of those descriptions. The row indices represent the unique tokenized words from the corpus of movie descriptions. The index is sorted alphabetically. I've highlighted the index for the word "boy" because it represents an n-gram that appears in the description of the farthest right movie in the image above. That is why the cell for that row and column combination has a positive score which we call a TF-IDF score.
The score in the table is the term frequency - inverse document frequenty (TF-IDF). TF-IDF is a common statistic used in natural language processing that measures how important/unique a term is within a document relative to the overall document collection. It is the product of two measures: term frequency (TF) and inverse document frequency (IDF). TF is measured as the count of the number of times a word or n-gram appears in a specific document divided by the number of words in the document. IDF is also calculated uniquely for each n-gram as: ln((the number of documents) / (the total number of documents containing the n-gram)). For example, a word that appears only once across all descriptions would have the highest TF-IDF score. The table below gives the exact TF, IDF, and TF-IDF scores for four sample documents:
the | dog | ran | cat | fast | was | happy | again | |
---|---|---|---|---|---|---|---|---|
The dog ran | (1/3) * ln(4/4) = 0 | (1/3) * ln(4/2) = 0.231 | (0/3) * ln(4/2) = 0 | (1/3) * ln(4/2) = 0.231 | (0/3) * ln(4/2) = 0 | (0/3) * ln(4/2) = 0 | (0/3) * ln(4/1) = 0 | (0/3) * ln(4/1) = 0 |
The cat ran fast | (1/4) * ln(4/4) = 0 | (0/4) * ln(4/2) = 0 | (1/4) * ln(4/2) = 0 | (1/4) * ln(4/2) = 0.173 | (0/4) * ln(4/2) = 0 | (1/4) * ln(4/2) = 0.173 | (0/4) * ln(4/1) = 0 | (0/4) * ln(4/1) = 0 |
The dog was happy | (1/4) * ln(4/4) = 0 | (1/4) * ln(4/2) = 0.173 | (0/4) * ln(4/2) = 0 | (0/4) * ln(4/2) = 0 | (0/4) * ln(4/2) = 0 | (1/4) * ln(4/2) = 0.173 | (1/4) * ln(4/1) = 0.347 | (0/4) * ln(4/1) = 0 |
The cat was fast again | (1/5) * ln(4/4) = 0 | (0/5) * ln(4/2) = 0 | (0/5) * ln(4/2) = 0 | (1/5) * ln(4/2) = 0.138 | (1/5) * ln(4/2) = 0.138 | (1/5) * ln(4/2) = 0.138 | (0/5) * ln(4/1) = 0 | (1/5) * ln(4/1) = 0.277 |
Let's break down the first row of data. In the document, "The dog ran", the word "The" appears 1 time in this doc and there are 3 words total making the TF score 0.3333 repeated. There are 4 total docs and all 4 include the word "The" making the IDF score zero because ln(4/4) = 0. Therefore, the TF-IDF score is 0.3333 * 0 = 0. Again, the purpose of TF-IDF is to represent a relative measure of how unique a word is across all documents and within the document the score is being calculated against. As a result, the word "The" has no importance because it appears in every document. That is no different from a word that doesn't appear in a particular document at all. They both get a TF-IDF score of zero. The best scores come from words that appear in fewer documents in documents with fewer words.
Hopefully that makes a bit more sense and helps you understand what the tfidf_matrix object holds that we created above. the next step is to calculate a similarity matrix based on the TF-IDF scores. Just as with the collaborative filtering model, there are many similarity scores that we can use here. For a complete review, see 18.7. For this demonstration, we will stick with a cosine-based similarity matrix just like that example. The code below creates a matrix and prints it out in a temporary DataFrame just for viewing purposes.
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity between each movie description
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# For easier viewing, put it in a dataframe
pd.DataFrame(cosine_sim)
To be clear, this is an 8803 by 8803 matrix that contains a cosine "similarity" score representing how similar the movie/show descriptions are between each pair of movies/shows. The cosine similarity score was calculated using the vectors of TF-IDF scores in the prior step. Essentially, higher scores in the matrix above indicate that the two movies/shows reprented by the row and column ID numbers have descriptions with more of the same words. If these movies/shows have useful descriptions that truly represent what the content is about, then these cosine similarity scores will accurately indicate which items to recommend based on an item of interest.
You may notice that the cosine similarity scores generated in this matrix are different than those we calculated for the collaborative filtering example. That is because we used the linear_kernel object from sklearn to compute them. For a more thorough discussion of how the linear_kernel() object calculates cosine similarity, you can read more here. But essentially, it reverses the scale so that higher numbers mean "more similar" as opposed to the opposite in the collaborative filtering example in the prior chapter where a lower score represented a smaller angle and, thus, greater similarity. As a result of how linear_kernel() calculates cosine similarity, the diagonal of that table contains all 1.0000 values (the max possible) because that is the cosine similarity of a movie description with itself.
This cosine_sim matrix object is basically the trained model we need to make recommendations. For any movie/showID, we only need to rank sort (descending) the cosine similarity scores and return the top n as recommendations. In fact, let's try it to see if that works. The code below sorts by column 0 descending and shows that movie 4877, 1066, 7503, and 5047 have the most similar descriptions to movie 0.
df_sorted = pd.DataFrame(cosine_sim).sort_values(by=[0], ascending=False)
for id in df_sorted.index[0:4]:
print(id, '\t', df.loc[id, 'title'])
display(df_sorted)
# Output:
# 0 Dick Johnson Is Dead
# 4877 End Game
# 1066 The Soul
# 7503 Moon
# 5047 The Cloverfield Paradox
You may or may not recognize these shows, but if you read their descriptions, you'll find that they truly do have the most similar text to movie ID 0, Dick Johnson is Dead. Let's explore some good ways to deploy these recommendations in the next section.