Recommendations Based on Correlation

Lokesh Vijay Kumar

Published Jul 10, 2018

We will look at correlation-based recommendation systems. This recommender offers a basic form of collaborative ﬁltering, as it recommends items based on similarities in their user review. We use Pearson's R correlation to recommend an item that is most similar to the item a user has already chosen. The Pearson R correlation coefficient is a measure of linear correlation between two variables, or in this case, two items ratings. The Pearson correlation coefficient is represented by the symbol R and with an R value that's close to one or negative one.

If R values get closer to zero, we know that the two variables are not linearly correlated. Let's look at the logic of this.

Lets say we have a mystery shopper here, Shopper D. We see that she has already chosen and reviewed the camera. She gave it a rating of four stars. It looks like users A, B, and C also reviewed the camera but look at the ratings each of these users gave. If User A gave a four stars, user B gave four stars, and user C gave 2.5 stars, then based on correlations between user ratings, we'd say that user A's and user B's ratings are more similar to or more highly correlated with user D's ratings. If we look at what other items user A and user B liked, and if they both gave pretty good ratings for the printer, then based on how well user A's and user B's review scores correlate with user D's review scores of the camera, and based on the shared preferences user A and user B have for the printer, we would recommend the printer to user D as well.

Now let's see this theory in practice making recommendation based on the Pearson correlation. We will see an example of an item based recommendation system where the recommender will compare items based on user reviews.

Actually though, in our dataset, the items are different places to eat and the users are restaurant goers. Making recommendations based on correlation is a simple form of collaborative filtering, or user to user filtering. Because items are recommended based on similarities in user reviews.

These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

Citation: They were originally published by: Blanca Vargas-Govea, Juan Gabriel GonzÃ¡lez-Serna, Rafael Ponce-MedellÃn. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSysâ€™11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

Let's get started.

import numpy as np
import pandas as pd
print(np.__version__)
1.13.3

So the first thing we need to do is read the datasets into our Jupyter notebook, and we'll do that by calling the read_SCV function.

data_frame =  pd.read_csv('rating_final.csv')
cuisine = pd.read_csv('chefmozcuisine.csv')
geodata = pd.read_csv('geoplaces2.csv', encoding='latin-1')

Now let's take a quick look at the first few records in the all the data frames we have.

data_frame.head()

geodata.head()

places =  geodata[['placeID', 'name']]
places.head()

cuisine.head()

Now let's group and rank this data by looking at the ratings these places are getting. To do that, we will look at the mean value of all the ratings that are given to each place and group it by place ID.

rating = pd.DataFrame(data_frame.groupby('placeID')['rating'].mean())
rating.head()

In addition to the mean value we also want to look at how popular each of these places was, so to do this, let's add a column called rating count, and then within that column we'll generate counts for how many reviews each place got.

rating['rating_count'] = pd.DataFrame(data_frame.groupby('placeID')['rating'].count())
rating.head()

Now let's look at a statistical description of this rating data frame.

rating.describe()

What this means is that the most popular place in the dataset has got a total of 36 reviews. So lets see our results in descending order to see that our most popular place has got a place ID of 135085. Let's find the name of this place. and also look at the type of cuisine this place serves.

rating.sort_values('rating_count', ascending=False).head()

places[places['placeID']==135085]

cuisine[cuisine['placeID']==135085]

Let us know prepare Data For Analysis. The next thing we need to do is to build a user by item utility matrix. To do that we're going to call the pivot table function. This function will cross tabulate each user against each place, and output a matrix. If you look at the first five records of places cross tab, we will notice that it's full of null values. That's because people never review that many places. Just a few people review just a few places. Hence the sparsity of this matrix.

places_crossTabulation = pd.pivot_table(data=data_frame, values='rating', index='userID', columns='placeID')
places_crossTabulation.head()

Let us isolate the user ratings from our restaurant called Tortas. We want to select the column that's indexed with the number 135085 and also filter Tortas ratings so that we can see only the non-null values. As we can recall, Tortas is the most popular place with 36 ratings. So let's get a look at what those ratings are. We will create a filter where Tortas ratings are greater than or equal to zero.

Tortas_ratings = places_crossTabulation[135085]
Tortas_ratings[Tortas_ratings>=0]
userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

Let us evaluate similarity based on Correlation. Now to find correlation between each of the places and the Tortas restaurant, what we need is call the corrwith method off of our places cross tab, and then pass it the Tortas rating series. This will generate a Pearson R correlation coefficient between Tortas and each other place that's been reviewed in the dataset. Let us drop the null values too. On printing the data, we see that we have a data frame that contains each place ID and a Pearson R correlation coefficient that indicates how well each place correlates with Tortas based on user rating.

similar_to_Tortas = places_crossTabulation.corrwith(Tortas_ratings)

corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head()

Think about it, some places with just two ratings, probably wouldn't really be all that similar to Tortas. May be those places got similar ratings as Tortas, but they wouldn't be very popular. Therefore, that correlation really wouldn't be significant. We also need to take stock of how popular each of these places is, in addition to how well the review scores correlate with the ratings that were given to other places in the dataset.

So to do that, let's join our corr Tortas data frame with a rating state of frame and create a filter so that we can see only the places from the data frame that have at least 10 user reviews. We will look at the Pearson R correlation coefficient sorted in descending order. Now we now have a list of top reviewed places that are most similar to Tortas. Places with Pearson R values of one aren't meaningful here. The reason you're seeing these is because for those places, there was only one user who gave a review to both places. That user gave both places the same score. Which is why you're seeing a Pearson R value of one. But a correlation that's based on similarities between only one review rating, that's not meaningful. The places need to have more than one reviewer in common.

Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)

So we'll throw those places out. So now let's take the top seven correlated results that remain and see if any of these places also serve fast food. So what we're going to do is we're going to create a data frame and we're going to call that places_corr_Tortas and then let's pass in a series of numbers that are the place IDs for the top correlated places. So we're going to call this data table summary and it's going to be based on the merge between places corr Tortas and cuisine. So basically we are trying to create a summary of each of the top correlated place IDs and the types of food they serve. When we print this out, we only get five results. And we included seven place IDs in this data frame. But the reason why you're only seeing five places here is that not all of the places were listed in the cuisine's dataset. Places that weren't in the cuisine's dataset were not able to be returned in this merged output table.

places_corr_Tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028, 135042, 135046], index = np.arange(7), columns=['placeID'])
summary = pd.merge(places_corr_Tortas, cuisine,on='placeID')
summary

What we are seeing here is that among the top six places that were most correlated with Tortas, at least one of these places also serves fast food. Let's get a name for this place so we don't have to refer to it as a number.

places[places['placeID']==135046]

So we'll call it Reyceito. To evaluate how relevant the similarity metric really is though, let's consider the entire set of possibilities. Meaning how many cuisine types are served at places in this dataset. To do that we'll use the describe method.

cuisine['Rcuisine'].describe()
count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

There are 59 unique types of cuisines that are served. So in last analysis, what we got back were six top places that were similar to Tortas based on correlation and popularity. Of these six places, one other place also serves fast food. Considering that there are 59 total cuisine types that could have been offered, and that we got back another fast food place in our top six most similar places, it looks like our correlation based recommendation system is on track. In this case, we'd be safe recommending the places Restaurante El Reyecito to users who also like the restaurant Tortas.

Hope you liked this example of using correlation for recommendation. I'll use machine learning algorithms for recommendations on my next post. Let me tell you they will be lot simpler than what we have seen so far... If you liked this post, please leave a comment or like my post and hang in there for more...!

Recommendations Based on Correlation

Lokesh Vijay Kumar

More articles by Lokesh Vijay Kumar

Explore content categories

More articles by Lokesh Vijay Kumar

The Weekly Wrap: A Tale of Two Markets

Simple Popularity Based Recommendation

Explore content categories