Upper Confidence Bound

Jithendra Katta

Published Oct 29, 2021

UCB is a Reinforcement algorithm which uses simple strategy to maximize our profits or score. In simple, We can say that It solves Multi-Armed Bandit Problem.

MULTI-ARMED BANDIT problem:

This is a famous game in which there will be many multi armed bandit machines. You have to select a machine and have to pull the arm. So that your score will be recorded from the machine you choose. So to maximize our score we have to choose the best machine for every round, so we can book good profits. As we can't guess the machine which gives good profit, we have to go for randoms.

But using machine learning and AI we can do wonders.Here we can use UCB algo which can solve this game and can maximize our profits greatly.

Instead of performing exploration by simply selecting an arbitrary action, chosen with a probability that remains constant, the UCB algorithm changes its exploration-exploitation balance as it gathers more knowledge of the environment. It moves from being primarily focused on exploration, when actions that have been tried the least are preferred, to instead concentrate on exploitation, selecting the action with the highest estimated reward.

With UCB, ‘Aₜ’, the action chosen at time step ‘t’, is given by:

where:

Qₜ(a) is the estimated value of action ‘a’ at time step ‘t’.
Nₜ(a) is the number of times that action ‘a’ has been selected, prior to time ‘t’.
‘c’ is a confidence value that controls the level of exploration.

Recommended by LinkedIn

Invention as Exploration

Michael Frank Martin 8 months ago

When Matt Shumer Meets Mine Reconciliation: What AI…

Julia Oliveira 2 months ago

Visualise Identify Understand: practical examples

Reda Meftahi 8 years ago

The formula for UCB can be thought of as being formed from 2 distinct parts:

Exploitation:

Qₜ(a) represents the exploitation part of the equation. UCB is based on the principle of “optimism in the fact of uncertainty”, which basically means if you don’t know which action is best then choose the one that currently looks to be the best. Taking this half of the equation by itself will do exactly that: the action that currently has the highest estimated reward will be the chosen action.

Exploration:

The second half of the equation adds exploration, with the degree of exploration being controlled by the hyper-parameter ‘c’. Effectively this part of the equation provides a measure of the uncertainty for the action’s reward estimate.
If an action hasn’t been tried very often, or not at all, then Nₜ(a) will be small. Consequently the uncertainty term will be large, making this action more likely to be selected. Every time an action is taken we become more confident about its estimate. In this case Nₜ(a) increments, and so the uncertainty term decreases, making it less likely that this action will be selected as a result of exploration (although it may still be selected as the action with the highest value, due to the exploitation term).
When an action is not being selected, the uncertainty term will grow slowly, due to the log function in the numerator. Whereas, every time that the action is selected, the uncertainty will shrink rapidly due to the increase in Nₜ(a) being linear. So the exploration term will be larger for actions that have been selected infrequently, due to the uncertainty in the estimates of their rewards.
As time progresses the exploration term gradually decreases (since as ’n’ goes to infinity log n/n goes to zero), until eventually actions are selected based only on the exploitation term.

The Upper Confidence Bound follows the principle of optimism in the face of uncertainty which implies that if we are uncertain about an action, we should optimistically assume that it is the correct action.

Initially, UCB explores more to systematically reduce uncertainty but its exploration reduces over time. Thus we can say that UCB obtains greater reward on average than other algorithms such as Epsilon-greedy, Optimistic Initial Values, etc.

To view or add a comment, sign in

Upper Confidence Bound

Jithendra Katta

Recommended by LinkedIn

More articles by Jithendra Katta

Others also viewed

Network analysis as a tool for integrating mineralisation data, visualisation and outlier detection

New alchemy: Turning lead into energy

Automobiles Innovation Tweets Text Mining and Analyzing. Recommendation System by Using Machine Learning.

AI Cannot Do This For You

The crop circles’ mystery finally solved… The revelations are shocking.

Julia´s Study Guide - Ensembling in Machine Learning: Teamwork of Models Driving Smarter Mining

From Pit to Pixel: Unearthing AI's Socio-Technical Roots

More information, less accurate a prediction?

Why Feature mining is the most important thing in data science?

INTRODUCTION TO COMPLEXITY SCIENCE

Explore content categories

Recommended by LinkedIn

More articles by Jithendra Katta

Deep Q-Learning

User Research & Field Study on E-Commerce BS

Others also viewed

Network analysis as a tool for integrating mineralisation data, visualisation and outlier detection

New alchemy: Turning lead into energy

Automobiles Innovation Tweets Text Mining and Analyzing. Recommendation System by Using Machine Learning.

AI Cannot Do This For You

The crop circles’ mystery finally solved… The revelations are shocking.

Julia´s Study Guide - Ensembling in Machine Learning: Teamwork of Models Driving Smarter Mining

From Pit to Pixel: Unearthing AI's Socio-Technical Roots

More information, less accurate a prediction?

Why Feature mining is the most important thing in data science?

INTRODUCTION TO COMPLEXITY SCIENCE

Similar topics

Machine Learning Algorithms for Quantum System Modeling

How to Optimize Machine Learning Performance

Explore content categories