Content-based recommender system
If you want to get an overview of a recommender engine first, then I would suggest you to go through this article: https://www.garudax.id/pulse/recommender-system-himanshu-singh/ else you are ready to rock.
The main idea behind a content-based recommender system is to recommend only those items to her which are similar to the items that she has highly rated in the past.
For example, in the case of movies, recommend her movies with the same actors, directors, etc. In the case of blogs, recommend her blogs with similar content that she has already read. In the case of social media, recommend people whose likes and dislikes match hers.
The general flow of a content-based recommendation engine goes like this:
1. The user likes some items
2. Create item profiles of the liked items
3. Create a user profile using item profiles
4. Match the user profile with the item profiles from the catalog of items that have not been rated by the user
5. Recommend the items that have a high similarity with the user profile.
A profile is a set of features. In the case of movies, actor names, director names, IMDB ratings, etc are item profiles. In the case of images and videos, metadata and tags form the item profiles. Even though the profile is a set of features, it is convenient to represent it using vectors. The vector can either be boolean or real-valued. If a feature is present in the item then it is given a value of 1 and if a feature is absent in the item then it is given a value of 0.
Blog or news article recommendation
To understand content-based filtering better, let’s consider a special case of text recommendation i.e. which blog or news article should be recommended next. In the case of text, a profile can be considered as a set of important words in the document.
1. Calculating document profile
How to find which words are more important in a document?
From the field of Information Retrieval, we can borrow a concept called TF-IDF to find out important words in a document.
TF-IDF score is composed of two terms: 1. TF (term frequency) and 2. IDF (inverse document frequency). TF-IDF is computed for each word in a document.
Term frequency (TF) = (word count) / (max word count)
where word count: number of times a word appeared in the document and max word count: maximum number of times this word has appeared in any document
Term frequency of the word banana for document k = (number of times the word banana has appeared in the document k) / (total number of terms in the document k)
NOTE: Division by total word count in the document is done to discount for the longer documents. Without this division, longer documents will have higher relevance because of the presence of more words in it.
Intuitively, the more frequent a word appears in a document, the more important feature is this word for this document. For example, a document mentions the word apple ten times in it then the word apple is more important in this document than a document that mentions the word apple only once.
But how do we calculate the relative weights of different terms in a document i.e. a rare word like banana is more important than a more common word like the that appears a thousand times in a document. This is where inverse document frequency comes into play i.e. if a word appears in more number of documents then that is not an important word. For example, the appears in almost all documents and hence we can say that it is not an important word.
Inverse Document Frequency (IDF) = log(N/n)
N = total number of documents in the corpus
n = number of documents where this word has appeared
TF-IDF score = TF * IDF
To calculate the document profile, keep the word that has a TF-IDF score greater than a chosen threshold, and ignore the below threshold words. In this case, a document profile is a real-valued vector as opposed to a boolean vector. We can use these already rated document's profile to calculate the user profile.
2. Calculate user profile using document profiles
An over-simplistic way of calculating the user profile can be to take the simple average of the document profiles that the user has already rated.
But since users have rated some documents higher while others lower, we can use these ratings as a weight while taking average i.e. give more weight to highly rated documents while low weight to lower-rated documents.
Some users are more generous than others. For some users a rating of 4 is a wildly positive rating while for some users a rating of 4 is just okay.
A rating of 1 or 2 is considered a negative rating i.e. the user didn’t like the blog or news article.
So, how do we capture these two notions i.e. 1. Tough or easy raters 2. Negative ratings by users. We can do so by normalizing the document profile by deducting the mean of each feature from the corresponding feature values. Let’s understand this concept of deducting the mean through another example of movie ratings.
Assume that for Actor A’s movie, the user has given ratings of 1, 2, 4 and for Actor B’s movie, the user has given ratings of 3, 5. The average rating given by the user is (3+5+1+2+4)/5 = 3
Mean deducted ratings:
Actor A’s movie: 0, 3
Actor B’s movie: -2, -1, +1
We can see that 1 and 2 ratings given by the user for Actor B’s movie are actually negative ratings.
3. Make recommendations
Now we have document profiles and user profiles as vectors in high dimensional space, we can use cosine similarity to calculate the most similar document that should be recommended next.
cos(theta) = (u.v)/(|u||v|) where u: user profile and v: document profile.
If two vectors are similar i.e. if they are pointing in the same direction then the angle between them should be small. Smaller the angle, the larger the similarity. So, similarity should be 180-theta. But to keep things simple, we use the cosine of the angle between two vectors to measure the similarity as the angle between two vectors increases cosine of the angle decreases. Hence, if two vectors are close enough then the cosine of the angle would be larger.
Cons of Content-based recommendation:
- Finding the features is a difficult problem i.e. in the case of movies, we can think of the names of actors or directors as one of the features. But we see that in real life, many users are not loyal to a particular actor or director i.e. once a Salman fan is not always a Salman fan.
- Users have a variety of tastes. This kind of system fails to satisfy all kinds of tastes of users because it only depends on the already watched movies. Consider the case when a user has watched only action movies till now but she has a liking for comedy movies as well. But this system will never recommend a comedy movie to her because she never watched or rated any comedy movie till now.
- For a new user, the system creates an average profile i.e. average profile of all the users. And as time passes, this new user’s profile becomes more fine-tuned according to the movie he has watched. So, there is a problem of the cold start here i.e. in the starting phase, users are recommended not according to her preference but the preference of an average audience.
In the coming post, we’ll see collaborative filtering that will try to alleviate some of the cons of the content-based recommender systems.
I read that Netflix never actually deployed the algorithm which won :D The reasons are also quite interesting https://www.wired.com/2012/04/netflix-prize-costs/