Upper Confidence Bound-1 Algorithm (Reinforcement Learning Basics)
When you start to dive deeper into Machine Learning, I am sure anyone's bound to hear the term Reinforcement learning. A dog is trained by rewarding her with treats when she does the right thing. Similarly, we reward the machine as it reaches the estimated goal. This type of teaching is called Reinforcement learning.
The basic or the first step to understand Reinforcement learning is to make sense of the Upper Confidence Bound algorithm.
The UCB is applied to solve one of the famous problem called 'Multi-Armed Bandits.' Back in the day, the slot machines in the casino were called Multi-Armed Bandits (makes sense).
Problem description: You have been given freedom to use n slot machines in a casino and as a professional gambler (are you?), your goal is to maximize the rewards gained from the slot machines.
Let's think about it a bit. Some slot machines in the casinos are rigged up to some extent to make sure minimum rewards are given to the gambler. But, to find information and probabilities of given machines, we need to spend a lot of time and coins on the machines. This step is called Exploration. After gaining information and it's probability distributions, we can spend the rest of the coins on the machine with highest mean and which has better probability distribution (Left Skewed). Please read the below Wikipedia for more information on different ways of approaching the problem-
The most optimal method of solving this type of problem to maximize the profits is to use Upper Confidence Bound Algorithm. A dilemma occurs between exploration and exploitation because the gambler can not choose to both explore and exploit at the same time. Hence, we use the Upper Confidence Bound algorithm to solve the exploration-exploitation dilemma. Please read the below GeeksforGeeks article for the maths behind solving this problem-
I've implemented the UCB-1 algorithm in python which uses the upper bounds to tell us which machine to use next to balance exploration and exploitation and to maximize profits whilst doing so. Follow the below GitHub repository which contains the ipynb notebook with UCB code in it-
Recommended by LinkedIn
Initially, the UCB explores all the machines and determines the upper confidence bound for each machine.
average_reward = Sum_of_rewards[i] / Number_of_selections[i] #Exploitation
delta_i = math.sqrt(3/2 * math.log(n + 1) / Number_of_selections[i]) #Exploration
upper_bound = average_reward + delta_i
At one point after good exploration, the UCB tells us which machine to pick in the next move which has maximum upper confidence bound.
Finally, we plot a histogram which tells us the most picked machine out of N number of rounds and this can help us determine which machine yielded most reward and most profit so we can spend the rest of the coins to exploit that machine.
To conclude, this algorithm might not have practical use as of now but this can help us understand how reinforcement learning works and this can be applied into various real-life applications such as Clinical Trails, Recommending Movies, Advertisements etc.
Thank you for reading the Article.