Recommender System using Matrix Factorisation

Lee Gunn

Published Jun 5, 2024

Jupyter Notebook on Github

Matrix factorisation is a common technique used in recommender systems to predict user preferences or ratings. If you can predict how a user is likely to rate something, you can then show them a list of recommendations.

Imagine a big table where rows are users, columns are items (like movies or products), and the cells contain ratings or preferences. Many cells in this table are empty because not every user has rated every item. Matrix factorisation works by breaking this large table into two smaller tables: one representing users and the other representing items. These smaller tables capture the underlying factors that influence user preferences, like genre preferences in movies. Note: you don't have to manually specify the factors, the model "works them out" during training (via gradient decent).

By performing a dot product and summation on these smaller tables, we can estimate the missing ratings in the original table, helping us recommend items that users are likely to enjoy.

Article content — Splitting out a ratings matrix into factors

Dataset

Here we're using a small MovieLens dataset. We're also using the Dataset and DataLoader classes from PyTorch to make batching a bit easier.

An important bit to note is the encoding of the userIds and movieIds. We encode them into consecutive numerical indices starting from zero because the embedding layers require these indices to map each user and movie to their respective embedding vectors correctly.

from torch.utils.data import Dataset, DataLoader
import zipfile
import pandas as pd
import pdb;
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder

class MovieDataset(Dataset):
  def __init__(self):
    # Download the popular movielens dataset
    ! curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o ml-latest-small.zip

    with zipfile.ZipFile('ml-latest-small.zip', 'r') as zip:
      zip.extractall('data')

    movies_df = pd.read_csv('data/ml-latest-small/movies.csv')
    ratings_df = pd.read_csv('data/ml-latest-small/ratings.csv')

    # Label Encode the ids (the encodings will then match the indexes of the embeddings)
    self.d = defaultdict(LabelEncoder)
    for c in ['userId', 'movieId']:
      # Encode the ids
      self.d[c].fit(ratings_df[c].unique())

      # Swap out the ids for the encoded values
      ratings_df[c] = self.d[c].transform(ratings_df[c])


    self.x = ratings_df.drop(['rating', 'timestamp'], axis=1).values
    self.y = ratings_df['rating'].values
    self.x, self.y = torch.tensor(self.x), torch.tensor(self.y)

    users = ratings_df.userId.unique()
    movies = ratings_df.movieId.unique()

    self.n_users = len(users)
    self.n_items = len(movies)

  def __getitem__(self, index):
    return (self.x[index], self.y[index])

  def __len__(self):
    return len(self.x)

Matrix Factorisation

The Matrix Factorisation class inherits from a PyTorch module. As you can see it's quite simple. The __init__ constructor creates the two embedding layers and performs the dot product in the forward method. The forward method will be called during training to perform the forward pass.

Recommended by LinkedIn

Google Colab: A Powerful Tool for Machine Learning

NITHINBHARATHI T 2 years ago

Machine Learning with Google Colab

Nunzio Logallo 5 years ago

Machine Learning Blog – 9

Mahtab Syed 3 years ago

import torch
import numpy as np

class MatrixFactorization(torch.nn.Module):
  def __init__(self, n_users, n_items, n_factors=20):
    super().__init__()
    # Create the embeddings that will be trained
    self.user_factors = torch.nn.Embedding(n_users, n_factors)
    self.item_factors = torch.nn.Embedding(n_items, n_factors)

    # Initialise to random weights
    self.user_factors.weight.data.uniform_(0, 0.05)
    self.item_factors.weight.data.uniform_(0, 0.05)

  def forward(self, data):
    users, items = data[:,0], data[:, 1]


    user_embedding = self.user_factors(users)
    item_embedding = self.item_factors(items)

    dot_product = (user_embedding * item_embedding).sum(1)

    return dot_product

  def predict(self, user, item):
    data = torch.tensor([[user, item]], dtype=torch.long)
    return self.forward(data)

Train Test Split

Here, we split our dataset 80/20 % to train and test. We're using a standard Mean Square Error Loss Function and the Adam optimiser. We're passing in 8 for the number of factors. This tells the model to try and find 8 factors or features that can be used to make predictions. These are the all important weights of the model.

Here we're using 128 epochs. This means that during training it will loop and perform 128 forward and backward passes, updating the weights each time as it descends the gradient (which comes from the back-propagation step using calculus) trying to lower the value of the loss function.

from torch.utils.data import DataLoader, Dataset, random_split
from sklearn.model_selection import train_test_split

train_set = MovieDataset()

# Split the dataset into training and testing sets
train_size = int(0.8 * len(train_set))
test_size = len(train_set) - train_size
train_dataset, test_dataset = random_split(train_set, [train_size, test_size])

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

# Define the loss function, model and optimizer
epochs = 128
loss_fn = torch.nn.MSELoss()
model = MatrixFactorization(train_set.n_users, train_set.n_items, n_factors=8)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(epochs):
    model.train()
    epoch_loss = 0.0
    for x, y in train_loader:
        optimizer.zero_grad()
        outputs = model(x)
        loss = loss_fn(outputs, y.type(torch.float32))

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Accumulate the loss
        epoch_loss += loss.item()

    # Calculate the average loss for the epoch
    epoch_loss = epoch_loss / len(train_loader)

    # Print the training loss for this epoch
    print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss}")

# Evaluation function
def evaluate(model, loader):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for x, y in loader:
            predictions = model(x)
            loss = loss_fn(predictions, y.type(torch.float32))
            total_loss += loss.item()

    return total_loss / len(loader)

# Evaluate on the test set
test_loss = evaluate(model, test_loader)
print(f"Test Loss: {test_loss}")

Predict

We can now use the model to predict ratings for user/movies pairs.

def predict(userId, movieId):
  movieIndex = train_set.d['movieId'].transform([movieId])[0]
  userIndex = train_set.d['userId'].transform([userId])[0]
  predicted_rating = model.predict(userId, movieIndex)
  print(f"Predicted rating for user {userId} and item {movieId}: {predicted_rating}")

predict(1, 1)
predict(1, 2)

Predicted rating for user 1 and item 1: [3.6863]

Predicted rating for user 1 and item 2: [3.1846]

To view or add a comment, sign in

Recommender System using Matrix Factorisation

Lee Gunn

Dataset

Matrix Factorisation

Recommended by LinkedIn

Train Test Split

Predict

More articles by Lee Gunn

Others also viewed

core4ai in Jupyter Notebooks: A Hands-on Guide

The Essential Toolkit for AI/ML Professionals

Title: From Fails to First Success – Building ML Pipelines in Azure Studio (Both Visually and Programmatically)

PySpark MLlib – Algorithms and Parameters

Exploring Scikit-Learn: A Gateway to Machine Learning Excellence

Top Python Libraries for Machine Learning Project Life Cycle

DEPLOYING ML CLASSIFIER INTO WEB APP USING FASTAPI

Deploy your Machine Learning Pipelines

OpenAI Open Source Models, Data Engineering with DBT, MLOps with Databricks, Scikit-Learn Crash Course

Building an AI-Powered Q&A Agent with Jupyter Notebook and Streamlit

Explore content categories

Dataset

Matrix Factorisation

Recommended by LinkedIn

Train Test Split

Predict

More articles by Lee Gunn

Create a GPT

Deep Dive into Deep Learning

ASP.NET 5 Conditional Anchor Tag Helper

Monty Hall Brain Teaser with .NET simulation

Let's Encrypt are pushing for a 100% Encrypted Web...for FREE

Others also viewed

core4ai in Jupyter Notebooks: A Hands-on Guide

The Essential Toolkit for AI/ML Professionals

Title: From Fails to First Success – Building ML Pipelines in Azure Studio (Both Visually and Programmatically)

PySpark MLlib – Algorithms and Parameters

Exploring Scikit-Learn: A Gateway to Machine Learning Excellence

Top Python Libraries for Machine Learning Project Life Cycle

DEPLOYING ML CLASSIFIER INTO WEB APP USING FASTAPI

Deploy your Machine Learning Pipelines

OpenAI Open Source Models, Data Engineering with DBT, MLOps with Databricks, Scikit-Learn Crash Course

Building an AI-Powered Q&A Agent with Jupyter Notebook and Streamlit

Explore content categories