Efficiently Serving LLMs (Part 3): How Speculative Decoding Boosts Decode Speed
https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/

Efficiently Serving LLMs (Part 3): How Speculative Decoding Boosts Decode Speed

In the first two parts of this series, we explored the economics of inference - why output tokens cost more than input tokens, and how cached inputs can make generation cheaper and faster. Now, let’s look at another technique for improving throughput: Speculative Decoding (SD).

Why Speculative Decoding?

Classic generation is strictly sequential: one forward pass per token. Speculative decoding breaks that rhythm:

  1. A draft model proposes several next tokens ahead.
  2. The target model verifies as many as it accepts in a single forward pass, correcting only where needed.

Done right, this reduces inter-token latency and boosts tokens-per-second without sacrificing quality. Explained simply (ELIF): imagine a student quickly drafting an answer while a teacher reviews and corrects it in one pass - faster progress, same accuracy.

Speculative decoding primarily boosts decode throughput and, as a result, lowers inter-token latency once generation starts, while leaving time to first token (TTFT) largely unchanged. TTFT remains unchanged because the prefill phase isn’t changed; SD starts after the first token. It works by having a drafter propose several next tokens that the main model verifies in one pass, reducing forward passes per output token; this makes streaming feel faster and shortens end-to-end time for longer responses. To validate it, watch tokens/sec during decode (should rise) and the inter-token latency p50/p95 (should fall), with TTFT staying roughly the same.

Experiment Setup

The current V1 release of vLLM doesn’t yet include support for the draft-model variant of Speculative Decoding. To experiment with and validate the feature, vLLM was built from source on DGX Spark and the changes from PR #24322 were applied manually. Huge thanks to Tomas Ruiz for drafting and sharing this implementation - it’s a great starting point for exploring speculative decoding in action.

For this experiment, vllm bench serve is used, a CLI utility installable via `uv pip install -e .[bench]`. This command installs the additional dependencies required for benchmarking. As outlined in the official vLLM benchmarks documentation, the [bench] extra must be included to enable bench marking commands.

Start vLLM using the one of the following commands:

# Without Speculative Decoding
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
  --max-model-len 20000 \
  --disable-uvicorn-access-log        
# With Speculative Decoding
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
  --speculative_config '{"method": "draft_model", "model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "max_model_len": 20000, "disable_padded_drafter_batch": true}' \
  --max-model-len 20000 \
  --disable-uvicorn-access-log        

Measure the online throughput using

# nosd for no speculative decoding 
# sd for speculative decoding 
# c for concurrency
# t for temperature
vllm bench serve \
  --model Qwen/Qwen3-32B \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 80 \
  --max-concurrency (1|100) \
  --temperature (0.0|1.0) \
  --top-p 1.0 2>&1 > results/qwen3-32b-(nosd|sd)-c(1|100)-t(0.0|1.0).out        

Benchmark comparison for Qwen3-32B with and without Speculative Decoding for concurrency 1/temperature 0 (c1-t0.0)

Both tests used 80 requests, max concurrency 1, temperature 0.0, on Qwen/Qwen3-32B model. The metrics used for comparison are

  • TTFT - Time-to-First-Token
  • TPOT - Time-per-Output-Token
  • ITL - Inter-Token Latency

Looking at the result files generated for vllm bench serve without speculative decoding and with speculative decoding:

Article content
Latency Metrics comparison without and with speculative decoding for concurrency of 1 and temperature of 0.0
Article content
Latency Metrics comparison without and with speculative decoding for concurrency of 100 and temperature of 0.0

Plotting the throughput and latency metrics over a graph with a help of python script, few themes emerge

Article content
Benchmark comparison for concurrency:1 and temperature:0.0
Article content
Benchmark comparison for concurrency: 100 and temperature: 0.0

Slower TTFT with Speculative Decoding

The slower Time to First Token (TTFT) with Speculative Decoding is due to the additional overhead required to initialize and coordinate the speculative decoding process. Here's why:

Draft Model Initialization: Speculative decoding requires loading and initializing a smaller draft model alongside the target model, and both must be ready before generation begins. This dual startup introduces additional initialization overhead that isn’t present in the baseline setup.

Speculation Setup Overhead: Before the first token is emitted, the system routes the prompt through the draft model, has it generate speculative tokens, sets up the verification pass with the target model to accept or reject those proposals, and coordinates the KV cache between both models so accepted tokens can be reused without redundant computation.

TPOT is Better with Speculative Decoding

Speculative decoding generates multiple tokens per iteration instead of just one. In a standard autoregressive loop, each new token requires a full, expensive forward pass of the target model—yielding only one token per pass. With SD, a lightweight draft model quickly proposes several tokens (often 5–10), and the target model verifies them in a single, efficient pass; typically 3–5 are accepted. The result is multiple tokens per target forward pass, significantly boosting decode throughput. Why this works:

Parallel Verification: The target model can verify multiple draft tokens simultaneously using batched KV cache operations.

Amortized Cost: The expensive target model forward pass produces multiple accepted tokens instead of just one.

Draft Model Speed: The draft model (smaller) runs much faster than the target model, so its overhead is minimal.

ITL Gap Between Bursts is Longer

Even though speculative decoding is more efficient overall, each "speculation round" takes longer than a single autoregressive step because it involves draft model execution, target model verification, and batch scheduling overhead (especially at high concurrency).

Why this Trade-off is Acceptable with Speculative Decoding

Even though SD adds overhead (draft model + verification), the ability to accept multiple tokens per target model iteration more than compensates, resulting in net speedup for token generation.

Parting question to consider

If speculative decoding offers up to 2–3× faster generation, what’s the next optimization you’d want to pair it with: quantization, or disaggregated serving? These would be the topics of my upcoming blogs - stay tuned!





To view or add a comment, sign in

More articles by Elizabeth Thomas

Others also viewed

Explore content categories