Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning
Introduction
Alright, so you do algorithms based on fierce mathematics especially those which contain lots of probabilities and convince yourself that it is right, and you ended up a week later asking yourself what was this algo supposed to do 😁.
One of simple algos that I didn't dig deep in its proof till yesterday 😁was the policy iteration (the policy improvement theorem part) from dynamic programming (DP) which is considered one of the fundamental concepts of Reinforcement Learning.
Policy Iteration
The policy iteration algorithm consists of 2 steps, the first one is policy evaluation which addresses the question "Given a certain policy, what are the estimated value functions of my states?" and the second one is policy improvement step which addresses the question "How can I improve my policy?"
The second step illustrates that we can use bellman optimality equation obtaining a new policy by selecting an action which maximizes action value function (not the optimal action value function simply because we don't know them) for all given states. and this new policy would be equal or better than our old policy.
This major change in bellman optimality equation (replacing V*[S']) by V[S']) is illustrated to still be valid due to policy improvement theorem.
Policy Improvement Theorem
The Policy Improvement Theorem states that
So, to proof such theorem, we will need to return again to the equation in Figure A which if we expressed it in action value function improvement, will yield such equation.
Recommended by LinkedIn
And By definition, the state value function for a given state is the expectation of the action value function following policy pi.
By using Equation 1 & 2, we can yield the following inequality.
Tracing such inequality, Replacing the right-hand side with expectation of returns after being in state S and taking action a from pi' distribution and then using inequality 1 again, we can obtain that V[S] following Pi' is equal or better than V[S] following pi which means that Pi' is equal or better than Pi which proofs the policy improvement step.
References:
1-Reinforcement Learning: An Introduction - Sutton and Barto (2018)
2-University of Illinois Urbana-Champaign Online Learning and Decision-Making Lectures
3-Fundamentals of Reinforcement Learning Course by University of Alberta- Coursera