Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

Abram George

Published Jul 1, 2023

Introduction

Alright, so you do algorithms based on fierce mathematics especially those which contain lots of probabilities and convince yourself that it is right, and you ended up a week later asking yourself what was this algo supposed to do 😁.

One of simple algos that I didn't dig deep in its proof till yesterday 😁was the policy iteration (the policy improvement theorem part) from dynamic programming (DP) which is considered one of the fundamental concepts of Reinforcement Learning.

Policy Iteration

The policy iteration algorithm consists of 2 steps, the first one is policy evaluation which addresses the question "Given a certain policy, what are the estimated value functions of my states?" and the second one is policy improvement step which addresses the question "How can I improve my policy?"

The second step illustrates that we can use bellman optimality equation obtaining a new policy by selecting an action which maximizes action value function (not the optimal action value function simply because we don't know them) for all given states. and this new policy would be equal or better than our old policy.

No alt text provided for this image — Figure A- Policy Improvement step

This major change in bellman optimality equation (replacing V*[S']) by V[S']) is illustrated to still be valid due to policy improvement theorem.

Policy Improvement Theorem

The Policy Improvement Theorem states that

So, to proof such theorem, we will need to return again to the equation in Figure A which if we expressed it in action value function improvement, will yield such equation.

Recommended by LinkedIn

Reinforcement learning with Arduino using Q-learning

Gerardo Franco Delgado 6 years ago

Learning ComfyUI for Stable Diffusion

Aaron F. Ross 1 year ago

William Heath Robinson, Alan Turing and E-Learning

Jonathan Hill, Assoc CIPD 6 years ago

And By definition, the state value function for a given state is the expectation of the action value function following policy pi.

By using Equation 1 & 2, we can yield the following inequality.

Tracing such inequality, Replacing the right-hand side with expectation of returns after being in state S and taking action a from pi' distribution and then using inequality 1 again, we can obtain that V[S] following Pi' is equal or better than V[S] following pi which means that Pi' is equal or better than Pi which proofs the policy improvement step.

References:

1-Reinforcement Learning: An Introduction - Sutton and Barto (2018)

2-University of Illinois Urbana-Champaign Online Learning and Decision-Making Lectures

3-Fundamentals of Reinforcement Learning Course by University of Alberta- Coursera

To view or add a comment, sign in

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

Abram George

Introduction

Policy Iteration

Policy Improvement Theorem

Recommended by LinkedIn

References:

More articles by Abram George

Others also viewed

From Prompt Engineering to Context Engineering: Making AI Work for Teachers

This AI Agent Doesn't Just Teach You to Code - It Learns With You

Learning about AI with AIWS

ChatBot Project

AI + Python in Teaching: A Game Changer for Education

Learning Deepseek R1 from Liang

Helping Harvard MBAs Become Builders Through AI

Teachable Machine: A No-Code Platform for Easy Machine Learning

Weep, Beep and Sleep: Code as a Second Language

🚀 From "Non-Coder" to App Developer: How I Built an AI Study Assistant for My Daughter’s Class 10 Boards

Explore content categories

Introduction

Policy Iteration

Policy Improvement Theorem

Recommended by LinkedIn

References:

More articles by Abram George

Scaling & Product-Market Fit (PMF) (Final step—going beyond MVP to growth and refinement)

The ‘Will This Even Work?’ Spectrum: MVP, Pilot, POC & Prototype Explained

Where to start? Market Risk vs Product Risk

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Eligibility Traces, Spectrum of new learning algorithms 🪔

Partially Observable MDPs

Reinforcement Learning agents with gifts/disorders

Curiosity-Driven Reinforcement Learning

Policy Gradient Theorem for continuous tasks 💡 -RL

Importance Sampling and Monte Carlo Methods

Others also viewed

From Prompt Engineering to Context Engineering: Making AI Work for Teachers

This AI Agent Doesn't Just Teach You to Code - It Learns With You

Learning about AI with AIWS

ChatBot Project

AI + Python in Teaching: A Game Changer for Education

Learning Deepseek R1 from Liang

Helping Harvard MBAs Become Builders Through AI

Teachable Machine: A No-Code Platform for Easy Machine Learning

Weep, Beep and Sleep: Code as a Second Language

🚀 From "Non-Coder" to App Developer: How I Built an AI Study Assistant for My Daughter’s Class 10 Boards

Explore content categories