L-14 Learning Agents Using Q-Learning (Theory & Code) | Complete Agentic AI Course

Saurav Sharma

Published Sep 21, 2025

A learning agent is a system that improves its behavior over time by learning from experience. One of the most widely used methods for teaching agents is Q-learning, a powerful way to help an agent learn the best actions in different situations.

Imagine an agent moving through a maze. It doesn’t know the way out at first. But over time, it learns which paths lead to the goal and which ones lead to dead ends. It does this without being told the correct route. It just tries things, sees what happens, and learns from it.

This is exactly what Q-learning does. It helps the agent learn how good each action is in each state, based on what it experiences.

The Big Idea Behind Q-Learning

In reinforcement learning, the agent wants to maximize the total reward it receives over time. To do that, it must learn which actions give better results. Q-learning gives the agent a way to estimate the quality of each action in each state. This estimate is stored in a table called the Q-table.

Each entry in the Q-table represents a guess:

If I'm in state S and I take action A, how much total reward can I expect?

This guess is written as:

Q(s, a) — the expected reward for taking action a in state s.

How the Agent Learns with Q-Learning?

At first, the Q-table is filled with zeros. The agent doesn’t know anything. It explores randomly. Every time it takes an action, it gets feedback in the form of a reward. It uses that feedback to update its guess for Q(s, a). Here’s the formula used to update the Q-value:

Q(s, a) = Q(s, a) + α  [R + γ  max(Q(s', a')) - Q(s, a)]

Let’s break that down step by step:

Q(s, a) is the current guess.
α (alpha) is the learning rate — how much the new information should affect the old guess.
R is the reward the agent gets after taking action a in state s.
γ (gamma) is the discount factor — how much future rewards are considered.
max(Q(s', a')) is the best possible future reward from the next state.

Recommended by LinkedIn

The New Superpower: Learning at the Speed of NotebookLM

Anthony Onesto 3 months ago

Building the future of learning.

Vivek Bhatia 8 months ago

The Developer’s Guide to Learning Smarter (Not Harder)

Balaji Vasudevan 12 months ago

Exploration vs Exploitation

At each step, the agent must decide:

Should it explore and try new actions to learn more?
Or should it exploit what it already knows and pick the best action so far?

To manage this, we often use something called the epsilon-greedy strategy. The agent picks a random action with some small probability (say 10%) and picks the best-known action the rest of the time. This way, the agent continues learning but also uses what it already knows to do well.

Coding Q-Learning from Scratch

Let’s look at how Q-learning works in practice, using a simple environment, like a grid or a frozen lake. Here’s a basic version of Q-learning written in Python using the gym library.

import numpy as np
import gym

env = gym.make("FrozenLake-v1", is_slippery=False)

state_space_size = env.observation_space.n
action_space_size = env.action_space.n

q_table = np.zeros((state_space_size, action_space_size))

num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1       # α
discount_rate = 0.99      # γ
exploration_rate = 1.0    # ε
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

for episode in range(num_episodes):
    state = env.reset()[0]
    done = False

    for step in range(max_steps_per_episode):
        if np.random.uniform(0, 1) < exploration_rate:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state, :])

        new_state, reward, done, truncated, info = env.step(action)

        q_table[state, action] = q_table[state, action] + learning_rate * (
            reward + discount_rate * np.max(q_table[new_state, :]) - q_table[state, action]
        )

        state = new_state

        if done:
            break

    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

This code shows a working agent learning how to act in a simple environment. It starts knowing nothing and slowly builds up its Q-table until it understands which actions give the best long-term rewards.

Watching the Agent in Action

Once the Q-table is trained, the agent can use it to act intelligently. You can watch it make decisions by running it without exploration:

state = env.reset()[0]
done = False

for step in range(100):
    env.render()
    action = np.argmax(q_table[state, :])
    new_state, reward, done, truncated, info = env.step(action)
    state = new_state
    if done:
        env.render()
        break

Now the agent chooses the best-known action at every step. It’s no longer guessing. It’s using what it has learned.

What Makes Q-Learning Special

Q-learning doesn’t need a model of the environment. The agent doesn’t have to understand how the world works; it needs to interact with it, learn from feedback, and adjust its Q-table. That makes Q-learning simple, flexible, and effective for many problems where the environment is small enough to fit into a table.

Final Thought

Q-learning teaches the agent what actions work best by assigning a score to each action in each state. These scores get better with experience. Over time, the agent goes from guessing randomly to making smart decisions that maximize long-term rewards. The method is simple, the process is clear, and the results can be powerful when combined with more advanced methods in larger environments.

Elliott A. 5mo

Spending Veterans’ Day on reinforcement learning, diving into Q-Learning, Q-values, and Q-tables—seeing how an agent accumulates learned knowledge over time to make smarter, more informed decisions: https://huggingface.co/learn/deep-rl-course/unit2/q-learning-example "

L-14 Learning Agents Using Q-Learning (Theory & Code) | Complete Agentic AI Course

Saurav Sharma

The Big Idea Behind Q-Learning

How the Agent Learns with Q-Learning?

Recommended by LinkedIn

Exploration vs Exploitation

Coding Q-Learning from Scratch

Watching the Agent in Action

What Makes Q-Learning Special

Final Thought

More articles by this author

Others also viewed

How I build a Situgram-helper app and saved 60 minutes on a learning assessment.

Build Your Personalized Learning Plan With AI

AI-First Learning: Why Speed Without Understanding Is the New Failure

The Vibe Learning Framework Explained

FIVE ‘R’ s FOR EFFECTIVE LEARNING

8 Lessons from Building My First Custom GPT

April 2026

Courseless Learning Is the Future. But First, We Need to Survive the AI Course Tsunami.

Prompts for Learning: How to Use AI as Your Personal Tutor and Thinking Partner - Overview Podcast and Video

How to design learning content for your context?

Explore content categories

The Big Idea Behind Q-Learning

How the Agent Learns with Q-Learning?

Recommended by LinkedIn

Exploration vs Exploitation

Coding Q-Learning from Scratch

Watching the Agent in Action

What Makes Q-Learning Special

Final Thought

DeFi Ecosystem Platforms and Service Orchestration

May 3, 2026

Frontend Integration Techniques for DeFi Apps

May 3, 2026

Testing and Deployment Pipelines for DeFi Applications

May 3, 2026

Interoperability Protocols Between DeFi Chains

May 3, 2026

Design, Implementation, and Risks of Liquidity Mining

May 3, 2026

Economic Incentives and Emission Curves in Reward Systems

May 3, 2026

Developing NFT Marketplaces with DeFi Features

May 3, 2026

Gas Fee Reduction Techniques in DeFi Transactions

May 3, 2026

Gas Fee Reduction Techniques in DeFi Transactions

May 3, 2026

Real-Time Transaction Monitoring and Alert Systems

May 2, 2026

Others also viewed

How I build a Situgram-helper app and saved 60 minutes on a learning assessment.

Build Your Personalized Learning Plan With AI

AI-First Learning: Why Speed Without Understanding Is the New Failure

The Vibe Learning Framework Explained

FIVE ‘R’ s FOR EFFECTIVE LEARNING

8 Lessons from Building My First Custom GPT

April 2026

Courseless Learning Is the Future. But First, We Need to Survive the AI Course Tsunami.

Prompts for Learning: How to Use AI as Your Personal Tutor and Thinking Partner - Overview Podcast and Video

How to design learning content for your context?

Similar topics

How Agents Acquire Knowledge in AI

Multi-Agent Systems for Reinforcement Learning

Explore content categories