Upper Confidence Bound-1 Algorithm (Reinforcement Learning Basics)

Sai Anuraghav Savadam

Published Jan 16, 2023

When you start to dive deeper into Machine Learning, I am sure anyone's bound to hear the term Reinforcement learning. A dog is trained by rewarding her with treats when she does the right thing. Similarly, we reward the machine as it reaches the estimated goal. This type of teaching is called Reinforcement learning.

The basic or the first step to understand Reinforcement learning is to make sense of the Upper Confidence Bound algorithm.

The UCB is applied to solve one of the famous problem called 'Multi-Armed Bandits.' Back in the day, the slot machines in the casino were called Multi-Armed Bandits (makes sense).

Problem description: You have been given freedom to use n slot machines in a casino and as a professional gambler (are you?), your goal is to maximize the rewards gained from the slot machines.

Let's think about it a bit. Some slot machines in the casinos are rigged up to some extent to make sure minimum rewards are given to the gambler. But, to find information and probabilities of given machines, we need to spend a lot of time and coins on the machines. This step is called Exploration. After gaining information and it's probability distributions, we can spend the rest of the coins on the machine with highest mean and which has better probability distribution (Left Skewed). Please read the below Wikipedia for more information on different ways of approaching the problem-

The most optimal method of solving this type of problem to maximize the profits is to use Upper Confidence Bound Algorithm. A dilemma occurs between exploration and exploitation because the gambler can not choose to both explore and exploit at the same time. Hence, we use the Upper Confidence Bound algorithm to solve the exploration-exploitation dilemma. Please read the below GeeksforGeeks article for the maths behind solving this problem-

I've implemented the UCB-1 algorithm in python which uses the upper bounds to tell us which machine to use next to balance exploration and exploitation and to maximize profits whilst doing so. Follow the below GitHub repository which contains the ipynb notebook with UCB code in it-

Recommended by LinkedIn

🚀 Mastering Learning Rate Decay: Exponential…

nagababu molleti 1 year ago

Love It or List It: How to Price a House with Machine…

Mark McKamey 9 years ago

My deep dive into machine learning

Davide Schiavon 5 months ago

ReinforcementLearning/Multi_Armed_Bandit_Problem.ipynb at main · XzeraAnu/ReinforcementLearning · GitHub

Initially, the UCB explores all the machines and determines the upper confidence bound for each machine.

average_reward = Sum_of_rewards[i] / Number_of_selections[i] #Exploitation

delta_i = math.sqrt(3/2 * math.log(n + 1) / Number_of_selections[i]) #Exploration

upper_bound = average_reward + delta_i

At one point after good exploration, the UCB tells us which machine to pick in the next move which has maximum upper confidence bound.

Finally, we plot a histogram which tells us the most picked machine out of N number of rounds and this can help us determine which machine yielded most reward and most profit so we can spend the rest of the coins to exploit that machine.

To conclude, this algorithm might not have practical use as of now but this can help us understand how reinforcement learning works and this can be applied into various real-life applications such as Clinical Trails, Recommending Movies, Advertisements etc.

Thank you for reading the Article.

To view or add a comment, sign in

Upper Confidence Bound-1 Algorithm (Reinforcement Learning Basics)

Sai Anuraghav Savadam

Recommended by LinkedIn

More articles by Sai Anuraghav Savadam

Others also viewed

Python Recommendation Algorithm with Reinforcement Learning using Q-Learning

If I Had to Learn Machine Learning Again in 2026

Become Machine Learnig Engineer zero to hero

Introduction to Machine Learning - SciPy Sparse Matrix

Innovative Machine Learning Projects for 2024

CODING

Artificial Intelligence With Python: Machine Learning

Starting Machine Learning? Do not repeat my mistakes!

MLOps: Integrating Deep Learning with Computer Vision and DevOps

Mastering Machine Learning – From Basics to Advanced Applications

Explore content categories

Recommended by LinkedIn

More articles by Sai Anuraghav Savadam

Minecraft: Unleashing Your Inner Engineer

Stochastic Gradient Descent (SGD) - Beginners' Guide

Smart Home Control System

Others also viewed

Python Recommendation Algorithm with Reinforcement Learning using Q-Learning

If I Had to Learn Machine Learning Again in 2026

Become Machine Learnig Engineer zero to hero

Introduction to Machine Learning - SciPy Sparse Matrix

Innovative Machine Learning Projects for 2024

CODING

Artificial Intelligence With Python: Machine Learning

Starting Machine Learning? Do not repeat my mistakes!

MLOps: Integrating Deep Learning with Computer Vision and DevOps

Mastering Machine Learning – From Basics to Advanced Applications

Explore content categories