L-14 Learning Agents Using Q-Learning (Theory & Code) | Complete Agentic AI Course
ChatGPT Image Generation Tool

L-14 Learning Agents Using Q-Learning (Theory & Code) | Complete Agentic AI Course

A learning agent is a system that improves its behavior over time by learning from experience. One of the most widely used methods for teaching agents is Q-learning, a powerful way to help an agent learn the best actions in different situations.

Imagine an agent moving through a maze. It doesn’t know the way out at first. But over time, it learns which paths lead to the goal and which ones lead to dead ends. It does this without being told the correct route. It just tries things, sees what happens, and learns from it.

This is exactly what Q-learning does. It helps the agent learn how good each action is in each state, based on what it experiences.

The Big Idea Behind Q-Learning

In reinforcement learning, the agent wants to maximize the total reward it receives over time. To do that, it must learn which actions give better results. Q-learning gives the agent a way to estimate the quality of each action in each state. This estimate is stored in a table called the Q-table.

Each entry in the Q-table represents a guess:

If I'm in state S and I take action A, how much total reward can I expect?

This guess is written as:

Q(s, a) — the expected reward for taking action a in state s.

How the Agent Learns with Q-Learning?

At first, the Q-table is filled with zeros. The agent doesn’t know anything. It explores randomly. Every time it takes an action, it gets feedback in the form of a reward. It uses that feedback to update its guess for Q(s, a). Here’s the formula used to update the Q-value:

Q(s, a) = Q(s, a) + α  [R + γ  max(Q(s', a')) - Q(s, a)]        

Let’s break that down step by step:

  • Q(s, a) is the current guess.
  • α (alpha) is the learning rate — how much the new information should affect the old guess.
  • R is the reward the agent gets after taking action a in state s.
  • γ (gamma) is the discount factor — how much future rewards are considered.
  • max(Q(s', a')) is the best possible future reward from the next state.

Exploration vs Exploitation

At each step, the agent must decide:

  • Should it explore and try new actions to learn more?
  • Or should it exploit what it already knows and pick the best action so far?

To manage this, we often use something called the epsilon-greedy strategy. The agent picks a random action with some small probability (say 10%) and picks the best-known action the rest of the time. This way, the agent continues learning but also uses what it already knows to do well.

Coding Q-Learning from Scratch

Let’s look at how Q-learning works in practice, using a simple environment, like a grid or a frozen lake. Here’s a basic version of Q-learning written in Python using the gym library.

import numpy as np
import gym

env = gym.make("FrozenLake-v1", is_slippery=False)

state_space_size = env.observation_space.n
action_space_size = env.action_space.n

q_table = np.zeros((state_space_size, action_space_size))

num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1       # α
discount_rate = 0.99      # γ
exploration_rate = 1.0    # ε
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

for episode in range(num_episodes):
    state = env.reset()[0]
    done = False

    for step in range(max_steps_per_episode):
        if np.random.uniform(0, 1) < exploration_rate:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state, :])

        new_state, reward, done, truncated, info = env.step(action)

        q_table[state, action] = q_table[state, action] + learning_rate * (
            reward + discount_rate * np.max(q_table[new_state, :]) - q_table[state, action]
        )

        state = new_state

        if done:
            break

    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)        

This code shows a working agent learning how to act in a simple environment. It starts knowing nothing and slowly builds up its Q-table until it understands which actions give the best long-term rewards.

Watching the Agent in Action

Once the Q-table is trained, the agent can use it to act intelligently. You can watch it make decisions by running it without exploration:

state = env.reset()[0]
done = False

for step in range(100):
    env.render()
    action = np.argmax(q_table[state, :])
    new_state, reward, done, truncated, info = env.step(action)
    state = new_state
    if done:
        env.render()
        break        

Now the agent chooses the best-known action at every step. It’s no longer guessing. It’s using what it has learned.

What Makes Q-Learning Special

Q-learning doesn’t need a model of the environment. The agent doesn’t have to understand how the world works; it needs to interact with it, learn from feedback, and adjust its Q-table. That makes Q-learning simple, flexible, and effective for many problems where the environment is small enough to fit into a table.

Final Thought

Q-learning teaches the agent what actions work best by assigning a score to each action in each state. These scores get better with experience. Over time, the agent goes from guessing randomly to making smart decisions that maximize long-term rewards. The method is simple, the process is clear, and the results can be powerful when combined with more advanced methods in larger environments.

Spending Veterans’ Day on reinforcement learning, diving into Q-Learning, Q-values, and Q-tables—seeing how an agent accumulates learned knowledge over time to make smarter, more informed decisions: https://huggingface.co/learn/deep-rl-course/unit2/q-learning-example "

To view or add a comment, sign in

Others also viewed

Explore content categories