Recommender System
Imagine a situation where a user needs to interact with a really large catalog of items (e.g. tens of thousands of items). These items can be anything starting from products on Amazon to movies on Netflix. Videos on Youtube to friends on Facebook.
Users can interact with the catalog in two ways:
1. The user knows what she wants and searches for it by entering in the search bar. This search request goes to the server and the search result is sent back to the client by the server.
2. The user does not know what she really wants. In this case, the recommender system comes into the picture.
But why do we need a recommender system? Isn’t our life good enough without it? I don’t find it compelling enough to be researched so deeply.
With digitalization, we are moving from an era of scarcity to an era of abundance.
A physical book store can have only a small number of books in the store limited by the size of the store. A retail shop can have very few items limited by the size of the shop. We can have only a few people in our neighborhood limited by the size of the colony. Scarcity!
But a virtual store i.e. Amazon can have unlimited varieties of items in an unlimited capacity. But in a virtual world i.e. Facebook, we can have the whole world in our neighborhood. Abundance!
Storing an item in a physical store takes space. And if the number of purchases for an item is not enough then there is no point in keeping that item in the store as that item can’t give enough profit to justify its space rent. Imagine keeping a refrigerator of an unknown brand in the mart. Refrigerators take up a lot of space in the mart. And if there are not enough purchases for that particular brand of the refrigerator then it's just eating up space which could have been used for keeping other popular items that can lead to more sales and hence more profit. So there can be a limited number of items (i.e. only a few branded ones) that can be sold offline and most of the things have to be sold online.
This leads to the Long Tail phenomenon. In long-tail phenomenon, as shown in the picture, we can see that area under the curve on the right side of vertical line i.e. “Items available only online” maybe sometimes larger than the area under the curve on the left side of the vertical line i.e. “Retail and Online” because of most unpopular brands available only online. In this scenario, how would a user come to know about these “only online” items as there can be literally thousands of brands for each item? The recommender system comes to the rescue here.
Fun fact
Touching the Void is a 1988 book. At that time, the internet was not there. Because of that this book got limited attention even though this book is awesome. In 1997, a book was released named Into Thin Air. This book became quite popular as it was a good book and has been published in the internet era. Amazon noticed that few people who bought Into Thin Air have also bought Touching the Void. So when someone used to buy Into Thin Air she was informed that “Users who have bought this item have also bought Touching the Void”. In this way, people started buying Touching the Void too. And later on Touching the Void became an even bigger success than Into Thin Air. So sometimes a good recommender system may be helpful in exposing the hidden gems like Touching the Void.
Types of Recommender System
- Editorial and hand-curated: These include items that are handpicked by the owner of the website to show to the users. But this method does not account for any user input. The owner of the website decides that something is good and that is kept on the website which might not be always likeable by the users.
- Simple aggregate: In order to account for the user actions, simple aggregation of items i.e top 10 popular, recent uploads, etc can be used. But this kind of recommendation does not represent the view of a single user but an aggregation of views of masses. This might not be helpful at many times to many users.
- Tailored to individual users: This kind of recommendation system shows content specific to the user e.g. Amazon, Netflix, etc. This is where most of the revenues come from. And since we are living in a profit-driven society, this will be our focus.
Utility Matrix
The utility matrix gives you ratings for certain movies by certain users. For example, User 1 has rated Movie A but not Movie B. User 2 has rated Movie B and Movie D but not Movie A and Movie C. It could be that User B has not seen Movie A and Movie C or it could be that User B has seen the movie but didn’t bother to rate it. There are 10s of thousands of movies. And most of the users have not watched most of the movies. Hence, in general, the utility matrix is a very sparse matrix. The key problem in the recommender system is to predict what rating a user would have given if she has watched a particular movie e.g. what rating User 1 would give to Movie B if she watches it. Our concern here is only to find movies that would have been rated higher given a user watches it. We don’t want to recommend movies to a user that she dislikes or would have given an average rating.
Key problems in a recommender system:
- Gathering known ratings for utility matrix: Most of the time a user does not bother to give any rating to the movies she has watched.
- Extrapolating unknown ratings from known ratings: We are mainly interested in knowing only high rated movies and not the ones that would have been rated lower. We don’t want to recommend user movies that are average or below average for them.
- Evaluating the extrapolation method: How do we know that the rating determined by the recommender system is correct?
Gathering known ratings:
- Explicit method: Ask the user if he liked the movie after she is done watching the movie on Netflix. This method of gathering ratings is good because it gives an exact measure of what kind of movies a user likes or dislikes. But this method does not scale i.e. most of the users don’t rate a movie.
- Implicit method: Don’t ask a user to rate a movie but observe the user behavior i.e. if a user has fast-forwarded the movie or skipped some part of the movie then it means that she is not liking the movie. Or in case of shopping websites, assume that if a user has bought the product it means she liked it. But this method of gathering ratings has its own disadvantages i.e. in case of shopping sites, its difficult to say if a user has disliked the product if she has not bought it. Similarly, in the case of movie streaming sites, it would be difficult to say if she liked the movie if she has not fast-forwarded or skipped the movie.
In practice, we use a combination of explicit and implicit methods to gather the ratings.
Extrapolating unknown ratings from known ratings: There are basically two problems in extrapolating unknown ratings from known ratings:
- Utility matrix is very sparse i.e. most users have not watched most of the movies or not rated them even after watching them
- Cold start: New movies have no rating and new users have no history.
There are three approaches to recommender system:
- Content-based filtering: These systems recommend an item to a user based upon a description of the item and a profile of the user’s interests.
- Collaborative filtering: It is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating)
- Latent factor-based filtering: In the latent factor-based method, we only feed the user’s history and we do not need to define descriptors. The algorithm will find the hidden descriptors that influence the user’s preference like the actors in the movies, genre of the movie, and so on for us. Here, no movie description is required, only the user’s history is good enough.
We’ll discuss each one of them in detail in future articles.
Good analysis (y)