Upper Confidence Bound
UCB is a Reinforcement algorithm which uses simple strategy to maximize our profits or score. In simple, We can say that It solves Multi-Armed Bandit Problem.
MULTI-ARMED BANDIT problem:
This is a famous game in which there will be many multi armed bandit machines. You have to select a machine and have to pull the arm. So that your score will be recorded from the machine you choose. So to maximize our score we have to choose the best machine for every round, so we can book good profits. As we can't guess the machine which gives good profit, we have to go for randoms.
But using machine learning and AI we can do wonders.Here we can use UCB algo which can solve this game and can maximize our profits greatly.
Instead of performing exploration by simply selecting an arbitrary action, chosen with a probability that remains constant, the UCB algorithm changes its exploration-exploitation balance as it gathers more knowledge of the environment. It moves from being primarily focused on exploration, when actions that have been tried the least are preferred, to instead concentrate on exploitation, selecting the action with the highest estimated reward.
With UCB, ‘Aₜ’, the action chosen at time step ‘t’, is given by:
where:
Recommended by LinkedIn
The formula for UCB can be thought of as being formed from 2 distinct parts:
Exploitation:
Exploration:
The Upper Confidence Bound follows the principle of optimism in the face of uncertainty which implies that if we are uncertain about an action, we should optimistically assume that it is the correct action.
Initially, UCB explores more to systematically reduce uncertainty but its exploration reduces over time. Thus we can say that UCB obtains greater reward on average than other algorithms such as Epsilon-greedy, Optimistic Initial Values, etc.