Upper Confidence Bound-1 Algorithm (Reinforcement Learning Basics)
The image belongs to @Sony, Aibo Robot Dog

Upper Confidence Bound-1 Algorithm (Reinforcement Learning Basics)

When you start to dive deeper into Machine Learning, I am sure anyone's bound to hear the term Reinforcement learning. A dog is trained by rewarding her with treats when she does the right thing. Similarly, we reward the machine as it reaches the estimated goal. This type of teaching is called Reinforcement learning.

The basic or the first step to understand Reinforcement learning is to make sense of the Upper Confidence Bound algorithm.

The UCB is applied to solve one of the famous problem called 'Multi-Armed Bandits.' Back in the day, the slot machines in the casino were called Multi-Armed Bandits (makes sense).

Problem description: You have been given freedom to use n slot machines in a casino and as a professional gambler (are you?), your goal is to maximize the rewards gained from the slot machines.

Let's think about it a bit. Some slot machines in the casinos are rigged up to some extent to make sure minimum rewards are given to the gambler. But, to find information and probabilities of given machines, we need to spend a lot of time and coins on the machines. This step is called Exploration. After gaining information and it's probability distributions, we can spend the rest of the coins on the machine with highest mean and which has better probability distribution (Left Skewed). Please read the below Wikipedia for more information on different ways of approaching the problem-

The most optimal method of solving this type of problem to maximize the profits is to use Upper Confidence Bound Algorithm. A dilemma occurs between exploration and exploitation because the gambler can not choose to both explore and exploit at the same time. Hence, we use the Upper Confidence Bound algorithm to solve the exploration-exploitation dilemma. Please read the below GeeksforGeeks article for the maths behind solving this problem-

I've implemented the UCB-1 algorithm in python which uses the upper bounds to tell us which machine to use next to balance exploration and exploitation and to maximize profits whilst doing so. Follow the below GitHub repository which contains the ipynb notebook with UCB code in it-

ReinforcementLearning/Multi_Armed_Bandit_Problem.ipynb at main · XzeraAnu/ReinforcementLearning · GitHub

Initially, the UCB explores all the machines and determines the upper confidence bound for each machine.

average_reward = Sum_of_rewards[i] / Number_of_selections[i] #Exploitation

delta_i = math.sqrt(3/2 * math.log(n + 1) / Number_of_selections[i]) #Exploration

upper_bound = average_reward + delta_i

At one point after good exploration, the UCB tells us which machine to pick in the next move which has maximum upper confidence bound.

Finally, we plot a histogram which tells us the most picked machine out of N number of rounds and this can help us determine which machine yielded most reward and most profit so we can spend the rest of the coins to exploit that machine.

To conclude, this algorithm might not have practical use as of now but this can help us understand how reinforcement learning works and this can be applied into various real-life applications such as Clinical Trails, Recommending Movies, Advertisements etc.

Thank you for reading the Article.

To view or add a comment, sign in

More articles by Sai Anuraghav Savadam

  • Minecraft: Unleashing Your Inner Engineer

    Minecraft, the widely beloved sandbox video game, has captivated millions of players worldwide with its expansive…

    1 Comment
  • Stochastic Gradient Descent (SGD) - Beginners' Guide

    When delving into the field of machine learning, it's common to come across the term Stochastic Gradient Descent (SGD).…

  • Smart Home Control System

    Are you one of those people who is least bothered about switching on/off the devices in your home? Are you one of those…

Others also viewed

Explore content categories