Efficiently Serving LLMs (Part 3): How Speculative Decoding Boosts Decode Speed

Elizabeth Thomas

Published Nov 14, 2025

In the first two parts of this series, we explored the economics of inference - why output tokens cost more than input tokens, and how cached inputs can make generation cheaper and faster. Now, let’s look at another technique for improving throughput: Speculative Decoding (SD).

Why Speculative Decoding?

Classic generation is strictly sequential: one forward pass per token. Speculative decoding breaks that rhythm:

A draft model proposes several next tokens ahead.
The target model verifies as many as it accepts in a single forward pass, correcting only where needed.

Done right, this reduces inter-token latency and boosts tokens-per-second without sacrificing quality. Explained simply (ELIF): imagine a student quickly drafting an answer while a teacher reviews and corrects it in one pass - faster progress, same accuracy.

Speculative decoding primarily boosts decode throughput and, as a result, lowers inter-token latency once generation starts, while leaving time to first token (TTFT) largely unchanged. TTFT remains unchanged because the prefill phase isn’t changed; SD starts after the first token. It works by having a drafter propose several next tokens that the main model verifies in one pass, reducing forward passes per output token; this makes streaming feel faster and shortens end-to-end time for longer responses. To validate it, watch tokens/sec during decode (should rise) and the inter-token latency p50/p95 (should fall), with TTFT staying roughly the same.

Experiment Setup

The current V1 release of vLLM doesn’t yet include support for the draft-model variant of Speculative Decoding. To experiment with and validate the feature, vLLM was built from source on DGX Spark and the changes from PR #24322 were applied manually. Huge thanks to Tomas Ruiz for drafting and sharing this implementation - it’s a great starting point for exploring speculative decoding in action.

For this experiment, vllm bench serve is used, a CLI utility installable via `uv pip install -e .[bench]`. This command installs the additional dependencies required for benchmarking. As outlined in the official vLLM benchmarks documentation, the [bench] extra must be included to enable bench marking commands.

Start vLLM using the one of the following commands:

# Without Speculative Decoding
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
  --max-model-len 20000 \
  --disable-uvicorn-access-log

# With Speculative Decoding
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
  --speculative_config '{"method": "draft_model", "model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "max_model_len": 20000, "disable_padded_drafter_batch": true}' \
  --max-model-len 20000 \
  --disable-uvicorn-access-log

Measure the online throughput using

# nosd for no speculative decoding 
# sd for speculative decoding 
# c for concurrency
# t for temperature
vllm bench serve \
  --model Qwen/Qwen3-32B \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 80 \
  --max-concurrency (1|100) \
  --temperature (0.0|1.0) \
  --top-p 1.0 2>&1 > results/qwen3-32b-(nosd|sd)-c(1|100)-t(0.0|1.0).out

Benchmark comparison for Qwen3-32B with and without Speculative Decoding for concurrency 1/temperature 0 (c1-t0.0)

Both tests used 80 requests, max concurrency 1, temperature 0.0, on Qwen/Qwen3-32B model. The metrics used for comparison are

TTFT - Time-to-First-Token
TPOT - Time-per-Output-Token
ITL - Inter-Token Latency

Looking at the result files generated for vllm bench serve without speculative decoding and with speculative decoding:

Article content — Latency Metrics comparison without and with speculative decoding for concurrency of 1 and temperature of 0.0

Plotting the throughput and latency metrics over a graph with a help of python script, few themes emerge

Recommended by LinkedIn

"RAG is Dead" They Said. They Were Wrong. Here's What…

Shoeb Ali 1 month ago

Algorithm efficiency

Yasser AZIZI 🇵🇸 2 years ago

On LLM’s potential to build solutions through…

Sebastian Arango Vergara 1 year ago

Slower TTFT with Speculative Decoding

The slower Time to First Token (TTFT) with Speculative Decoding is due to the additional overhead required to initialize and coordinate the speculative decoding process. Here's why:

Draft Model Initialization: Speculative decoding requires loading and initializing a smaller draft model alongside the target model, and both must be ready before generation begins. This dual startup introduces additional initialization overhead that isn’t present in the baseline setup.

Speculation Setup Overhead: Before the first token is emitted, the system routes the prompt through the draft model, has it generate speculative tokens, sets up the verification pass with the target model to accept or reject those proposals, and coordinates the KV cache between both models so accepted tokens can be reused without redundant computation.

TPOT is Better with Speculative Decoding

Speculative decoding generates multiple tokens per iteration instead of just one. In a standard autoregressive loop, each new token requires a full, expensive forward pass of the target model—yielding only one token per pass. With SD, a lightweight draft model quickly proposes several tokens (often 5–10), and the target model verifies them in a single, efficient pass; typically 3–5 are accepted. The result is multiple tokens per target forward pass, significantly boosting decode throughput. Why this works:

Parallel Verification: The target model can verify multiple draft tokens simultaneously using batched KV cache operations.

Amortized Cost: The expensive target model forward pass produces multiple accepted tokens instead of just one.

Draft Model Speed: The draft model (smaller) runs much faster than the target model, so its overhead is minimal.

ITL Gap Between Bursts is Longer

Even though speculative decoding is more efficient overall, each "speculation round" takes longer than a single autoregressive step because it involves draft model execution, target model verification, and batch scheduling overhead (especially at high concurrency).

Why this Trade-off is Acceptable with Speculative Decoding

Even though SD adds overhead (draft model + verification), the ability to accept multiple tokens per target model iteration more than compensates, resulting in net speedup for token generation.

Parting question to consider

If speculative decoding offers up to 2–3× faster generation, what’s the next optimization you’d want to pair it with: quantization, or disaggregated serving? These would be the topics of my upcoming blogs - stay tuned!

Anushapriya Ganesan 5mo

Nice one will dig in detail

1 Reaction

See more comments

To view or add a comment, sign in

Efficiently Serving LLMs (Part 3): How Speculative Decoding Boosts Decode Speed

Elizabeth Thomas

Why Speculative Decoding?

Experiment Setup

Benchmark comparison for Qwen3-32B with and without Speculative Decoding for concurrency 1/temperature 0 (c1-t0.0)

Recommended by LinkedIn

Slower TTFT with Speculative Decoding

TPOT is Better with Speculative Decoding

ITL Gap Between Bursts is Longer

Why this Trade-off is Acceptable with Speculative Decoding

Parting question to consider

More articles by Elizabeth Thomas

Others also viewed

Deepseek R1: Test-Time Compute is Not Intelligence

W-GAN —> Wasserstein GAN

Qwen3.0 State of Art LLM - Think Deeper Smarter and Faster

Leetcode 169 and the Voting Algorithm - What Happened?!

🧠 Open Source LLMs 🧠 , Edition 11, June Week 2, 2025

Algorithm And Time Complexity

Switching between Models for $aving$

From Probable to Provable: What Automated Reasoning Means for the Board

Policy Gradient Theorem for continuous tasks 💡 -RL

The Problem Every LLM Deployment Hits

How Speculative Prefill Improves LLM Performance

Best Practices for LLM Token-Aware Input Testing

Streamlining LLM Inference for Lightweight Deployments

How LLMs Handle Selective Reading Prompts

Explore content categories

Why Speculative Decoding?

Experiment Setup

Benchmark comparison for Qwen3-32B with and without Speculative Decoding for concurrency 1/temperature 0 (c1-t0.0)

Recommended by LinkedIn

Slower TTFT with Speculative Decoding

TPOT is Better with Speculative Decoding

ITL Gap Between Bursts is Longer

Why this Trade-off is Acceptable with Speculative Decoding

Parting question to consider

More articles by Elizabeth Thomas

Multi-Tenant LLM Inference, Assembled Entirely from Open Source

Replication vs Disaggregation: Benchmarking Two-Node LLM Inference with vLLM and NIXL

DGX Spark Dual 100G links: TCP Bonding vs RDMA

DGX Spark Network Benchmarks: RDMA Performance over RoCE

Practical Post-Training Quantization (PTQ) on Real Hardware: Llama 3.1 8B + NVIDIA Model Optimizer (DGX Spark Edition)

TokenLabs Release v0.2.0: From Hosted GPUs to DGX Spark: Rethinking Always-On LLM Inference (Part 2)

From Hosted GPUs to DGX Spark: Rethinking Always-On LLM Inference

Launching tokenlabs.run: A Public LLM Inference Playground (MVP)

Efficiently Serving LLMs (Part 4): How CUDA Graphs make vLLM think faster

Efficiently Serving LLMs (Part 2) - How KV Caching Makes Inputs ~10× Cheaper

Others also viewed

Deepseek R1: Test-Time Compute is Not Intelligence

W-GAN —> Wasserstein GAN

Qwen3.0 State of Art LLM - Think Deeper Smarter and Faster

Leetcode 169 and the Voting Algorithm - What Happened?!

🧠 Open Source LLMs 🧠 , Edition 11, June Week 2, 2025

Algorithm And Time Complexity

Switching between Models for $aving$

From Probable to Provable: What Automated Reasoning Means for the Board

Policy Gradient Theorem for continuous tasks 💡 -RL

The Problem Every LLM Deployment Hits

Similar topics

How Speculative Prefill Improves LLM Performance

Best Practices for LLM Token-Aware Input Testing

Streamlining LLM Inference for Lightweight Deployments

How LLMs Handle Selective Reading Prompts

Explore content categories