Recommender System using Matrix Factorisation
Netflix makes big use of recommender systems

Recommender System using Matrix Factorisation

Matrix factorisation is a common technique used in recommender systems to predict user preferences or ratings. If you can predict how a user is likely to rate something, you can then show them a list of recommendations.

Imagine a big table where rows are users, columns are items (like movies or products), and the cells contain ratings or preferences. Many cells in this table are empty because not every user has rated every item. Matrix factorisation works by breaking this large table into two smaller tables: one representing users and the other representing items. These smaller tables capture the underlying factors that influence user preferences, like genre preferences in movies. Note: you don't have to manually specify the factors, the model "works them out" during training (via gradient decent).

By performing a dot product and summation on these smaller tables, we can estimate the missing ratings in the original table, helping us recommend items that users are likely to enjoy.


Article content
Splitting out a ratings matrix into factors

Dataset

Here we're using a small MovieLens dataset. We're also using the Dataset and DataLoader classes from PyTorch to make batching a bit easier.

An important bit to note is the encoding of the userIds and movieIds. We encode them into consecutive numerical indices starting from zero because the embedding layers require these indices to map each user and movie to their respective embedding vectors correctly.

from torch.utils.data import Dataset, DataLoader
import zipfile
import pandas as pd
import pdb;
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder

class MovieDataset(Dataset):
  def __init__(self):
    # Download the popular movielens dataset
    ! curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o ml-latest-small.zip

    with zipfile.ZipFile('ml-latest-small.zip', 'r') as zip:
      zip.extractall('data')

    movies_df = pd.read_csv('data/ml-latest-small/movies.csv')
    ratings_df = pd.read_csv('data/ml-latest-small/ratings.csv')

    # Label Encode the ids (the encodings will then match the indexes of the embeddings)
    self.d = defaultdict(LabelEncoder)
    for c in ['userId', 'movieId']:
      # Encode the ids
      self.d[c].fit(ratings_df[c].unique())

      # Swap out the ids for the encoded values
      ratings_df[c] = self.d[c].transform(ratings_df[c])


    self.x = ratings_df.drop(['rating', 'timestamp'], axis=1).values
    self.y = ratings_df['rating'].values
    self.x, self.y = torch.tensor(self.x), torch.tensor(self.y)

    users = ratings_df.userId.unique()
    movies = ratings_df.movieId.unique()

    self.n_users = len(users)
    self.n_items = len(movies)

  def __getitem__(self, index):
    return (self.x[index], self.y[index])

  def __len__(self):
    return len(self.x)        

Matrix Factorisation

The Matrix Factorisation class inherits from a PyTorch module. As you can see it's quite simple. The __init__ constructor creates the two embedding layers and performs the dot product in the forward method. The forward method will be called during training to perform the forward pass.

import torch
import numpy as np

class MatrixFactorization(torch.nn.Module):
  def __init__(self, n_users, n_items, n_factors=20):
    super().__init__()
    # Create the embeddings that will be trained
    self.user_factors = torch.nn.Embedding(n_users, n_factors)
    self.item_factors = torch.nn.Embedding(n_items, n_factors)

    # Initialise to random weights
    self.user_factors.weight.data.uniform_(0, 0.05)
    self.item_factors.weight.data.uniform_(0, 0.05)

  def forward(self, data):
    users, items = data[:,0], data[:, 1]


    user_embedding = self.user_factors(users)
    item_embedding = self.item_factors(items)

    dot_product = (user_embedding * item_embedding).sum(1)

    return dot_product

  def predict(self, user, item):
    data = torch.tensor([[user, item]], dtype=torch.long)
    return self.forward(data)        

Train Test Split

Here, we split our dataset 80/20 % to train and test. We're using a standard Mean Square Error Loss Function and the Adam optimiser. We're passing in 8 for the number of factors. This tells the model to try and find 8 factors or features that can be used to make predictions. These are the all important weights of the model.

Here we're using 128 epochs. This means that during training it will loop and perform 128 forward and backward passes, updating the weights each time as it descends the gradient (which comes from the back-propagation step using calculus) trying to lower the value of the loss function.

from torch.utils.data import DataLoader, Dataset, random_split
from sklearn.model_selection import train_test_split

train_set = MovieDataset()

# Split the dataset into training and testing sets
train_size = int(0.8 * len(train_set))
test_size = len(train_set) - train_size
train_dataset, test_dataset = random_split(train_set, [train_size, test_size])

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

# Define the loss function, model and optimizer
epochs = 128
loss_fn = torch.nn.MSELoss()
model = MatrixFactorization(train_set.n_users, train_set.n_items, n_factors=8)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(epochs):
    model.train()
    epoch_loss = 0.0
    for x, y in train_loader:
        optimizer.zero_grad()
        outputs = model(x)
        loss = loss_fn(outputs, y.type(torch.float32))

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Accumulate the loss
        epoch_loss += loss.item()

    # Calculate the average loss for the epoch
    epoch_loss = epoch_loss / len(train_loader)

    # Print the training loss for this epoch
    print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss}")

# Evaluation function
def evaluate(model, loader):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for x, y in loader:
            predictions = model(x)
            loss = loss_fn(predictions, y.type(torch.float32))
            total_loss += loss.item()

    return total_loss / len(loader)

# Evaluate on the test set
test_loss = evaluate(model, test_loader)
print(f"Test Loss: {test_loss}")        

Predict

We can now use the model to predict ratings for user/movies pairs.

def predict(userId, movieId):
  movieIndex = train_set.d['movieId'].transform([movieId])[0]
  userIndex = train_set.d['userId'].transform([userId])[0]
  predicted_rating = model.predict(userId, movieIndex)
  print(f"Predicted rating for user {userId} and item {movieId}: {predicted_rating}")

predict(1, 1)
predict(1, 2)        

Predicted rating for user 1 and item 1: [3.6863]

Predicted rating for user 1 and item 2: [3.1846]




To view or add a comment, sign in

More articles by Lee Gunn

  • Create a GPT

    The ChatGPT feature set has expanded rapidly over the past month! In addition to executing Python, browsing the web…

    1 Comment
  • Deep Dive into Deep Learning

    MIT's "Introduction" to Deep Learning course is a comprehensive and intensive program that offers an overview of…

    1 Comment
  • ASP.NET 5 Conditional Anchor Tag Helper

    ASP.NET 5 has a few new tricks up it's sleeve.

  • Monty Hall Brain Teaser with .NET simulation

    I came across a nice brain teaser which involves some simple probability but it still took me a while to get my head…

    3 Comments
  • Let's Encrypt are pushing for a 100% Encrypted Web...for FREE

    They've not officially launched yet but they've been releasing free SSL certificates for about the past 7 months. I've…

Others also viewed

Explore content categories