Content Filtering

As mentioned in the prior section, content filtering does not depend on user ratings. But content filtering does depend on some of the same assumptions as collaborative filtering. For example, both types of filtering assume that similar people like similar things and that similar items will be similarly preferred by users. However, each techniques measures "similarity" quite differently. Collaborative-based filtering measures user similarity based on the ratings that users give to items. If they rate the same items similarity, then they are similar users. Content-based filtering measures similarity based on the characteristics of the user and items.

For example, movie descriptions with the same words and phrases are similar. Movies with the same director, actors, or genre are similar. Users with the same demographic characteristics are similar. Or, users who have the same stated preferences for certain item features are similar. The diagram below visualizes the data required for collaborative vs content filtering

Figure 21.1: Collaborative vs Content Filtering Data Requirements

Let's summarize the advantages and disadvantages of both approaches.

Table 21.1
Comparing Collaborative and Content Filtering-Based Recommendation
Collaborative Content
Based on user-item-rating Based on user features and/or item features
Very accurate when there are many ratings and the user-item-rating matrix is not too sparse. Very accurate when there is a rich set of user and item features and those features represent the ways that users and items are similar.
Only requires three features to generate very accurate recommendations: userID, itemID, rating. No domain knowledge is necessary; just accurate ratings. Requires theoretically valid features that indicate similarity. For example, movie descriptions must be accurate depictions of movie content and quality if they are going to be used as similarity vectors.
Can be used for item-specific and user-specific recommendations Can only be used for both item-specific recommendations. For example, you couldn't generate recommendations for a "home screen view" (i.e. not a item details page) that are unique to a specific user unless you combine it with a collaborative model.
Loses accuracy when there are new items with few or no ratings. No loss of accuracy when there are new items and no ratings.

The great advantage of content-based filtering is that you don't need user-item-rating historical data to generate recommendations. This means it will generate equally valid recommendations whether the items have been thoroughly rated or are brand new with no ratings. You only need item characteristics for item-based recommendations. The downside is that you can't generate user-based recommendations (e.g. recommendations for the user's home screen view that are specific to that user, but do not depend on a specific item like you would get from viewing an item's detailed view). Having said that, if you have theoretically valid features about users, then you can identify similar users, but you would have to pair the content-based model with a collaborative model in order to know which items to recommend to new users. It is actually quite common to combine collaborative and content models together into hybrid recommender systems which combine elements from both collaborative- and content-based concepts to best address all scenarios.

If you want to dive into hybrid recommendation in more detail, you can find a useful discussion of various ways to combine collaborative- and content-based approaches here: https://medium.com/analytics-vidhya/7-types-of-hybrid-recommendation-system-3e4f78266ad8

It may still be difficult to understand these differences between collaborative and content filtering without just digging into the code and seeing how it works. For this example, we will switch to a similar, but unique, dataset based on Netflix data which includes many more details about the movies and series available on the platform. Download the dataset below which can also be found here.