Upper Confidence Bound

Upper Confidence Bound

UCB is a Reinforcement algorithm which uses simple strategy to maximize our profits or score. In simple, We can say that It solves Multi-Armed Bandit Problem.

MULTI-ARMED BANDIT problem:

This is a famous game in which there will be many multi armed bandit machines. You have to select a machine and have to pull the arm. So that your score will be recorded from the machine you choose. So to maximize our score we have to choose the best machine for every round, so we can book good profits. As we can't guess the machine which gives good profit, we have to go for randoms.

No alt text provided for this image

But using machine learning and AI we can do wonders.Here we can use UCB algo which can solve this game and can maximize our profits greatly.

Instead of performing exploration by simply selecting an arbitrary action, chosen with a probability that remains constant, the UCB algorithm changes its exploration-exploitation balance as it gathers more knowledge of the environment. It moves from being primarily focused on exploration, when actions that have been tried the least are preferred, to instead concentrate on exploitation, selecting the action with the highest estimated reward.

With UCB, ‘Aₜ’, the action chosen at time step ‘t’, is given by:

No alt text provided for this image


where:

  • Qₜ(a) is the estimated value of action ‘a’ at time step ‘t’.
  • Nₜ(a) is the number of times that action ‘a’ has been selected, prior to time ‘t’.
  • ‘c’ is a confidence value that controls the level of exploration.

The formula for UCB can be thought of as being formed from 2 distinct parts:

Exploitation:

  • Qₜ(a) represents the exploitation part of the equation. UCB is based on the principle of “optimism in the fact of uncertainty”, which basically means if you don’t know which action is best then choose the one that currently looks to be the best. Taking this half of the equation by itself will do exactly that: the action that currently has the highest estimated reward will be the chosen action.

Exploration:

  • The second half of the equation adds exploration, with the degree of exploration being controlled by the hyper-parameter ‘c’. Effectively this part of the equation provides a measure of the uncertainty for the action’s reward estimate.
  • If an action hasn’t been tried very often, or not at all, then Nₜ(a) will be small. Consequently the uncertainty term will be large, making this action more likely to be selected. Every time an action is taken we become more confident about its estimate. In this case Nₜ(a) increments, and so the uncertainty term decreases, making it less likely that this action will be selected as a result of exploration (although it may still be selected as the action with the highest value, due to the exploitation term).
  • When an action is not being selected, the uncertainty term will grow slowly, due to the log function in the numerator. Whereas, every time that the action is selected, the uncertainty will shrink rapidly due to the increase in Nₜ(a) being linear. So the exploration term will be larger for actions that have been selected infrequently, due to the uncertainty in the estimates of their rewards.
  • As time progresses the exploration term gradually decreases (since as ’n’ goes to infinity log n/n goes to zero), until eventually actions are selected based only on the exploitation term.

The Upper Confidence Bound follows the principle of optimism in the face of uncertainty which implies that if we are uncertain about an action, we should optimistically assume that it is the correct action.



Initially, UCB explores more to systematically reduce uncertainty but its exploration reduces over time. Thus we can say that UCB obtains greater reward on average than other algorithms such as Epsilon-greedy, Optimistic Initial Values, etc.



To view or add a comment, sign in

More articles by Jithendra Katta

  • Deep Q-Learning

    This article will talk about Reinforcement Learning and Deep Q-Learning . I assume that readers have a good…

  • User Research & Field Study on E-Commerce BS

    Project Batch Number : SDP2 GROUP 107 - ECS Project Business System : E-commerce Project name : E-Store E-Commerce…

Others also viewed

Explore content categories