Efficiently Serving LLMs (Part 3): How Speculative Decoding Boosts Decode Speed
In the first two parts of this series, we explored the economics of inference - why output tokens cost more than input tokens, and how cached inputs can make generation cheaper and faster. Now, let’s look at another technique for improving throughput: Speculative Decoding (SD).
Why Speculative Decoding?
Classic generation is strictly sequential: one forward pass per token. Speculative decoding breaks that rhythm:
Done right, this reduces inter-token latency and boosts tokens-per-second without sacrificing quality. Explained simply (ELIF): imagine a student quickly drafting an answer while a teacher reviews and corrects it in one pass - faster progress, same accuracy.
Speculative decoding primarily boosts decode throughput and, as a result, lowers inter-token latency once generation starts, while leaving time to first token (TTFT) largely unchanged. TTFT remains unchanged because the prefill phase isn’t changed; SD starts after the first token. It works by having a drafter propose several next tokens that the main model verifies in one pass, reducing forward passes per output token; this makes streaming feel faster and shortens end-to-end time for longer responses. To validate it, watch tokens/sec during decode (should rise) and the inter-token latency p50/p95 (should fall), with TTFT staying roughly the same.
Experiment Setup
The current V1 release of vLLM doesn’t yet include support for the draft-model variant of Speculative Decoding. To experiment with and validate the feature, vLLM was built from source on DGX Spark and the changes from PR #24322 were applied manually. Huge thanks to Tomas Ruiz for drafting and sharing this implementation - it’s a great starting point for exploring speculative decoding in action.
For this experiment, vllm bench serve is used, a CLI utility installable via `uv pip install -e .[bench]`. This command installs the additional dependencies required for benchmarking. As outlined in the official vLLM benchmarks documentation, the [bench] extra must be included to enable bench marking commands.
Start vLLM using the one of the following commands:
# Without Speculative Decoding
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
--max-model-len 20000 \
--disable-uvicorn-access-log
# With Speculative Decoding
VLLM_USE_V1=1 vllm serve Qwen/Qwen3-32B \
--speculative_config '{"method": "draft_model", "model": "Qwen/Qwen3-1.7B", "num_speculative_tokens": 3, "max_model_len": 20000, "disable_padded_drafter_batch": true}' \
--max-model-len 20000 \
--disable-uvicorn-access-log
Measure the online throughput using
# nosd for no speculative decoding
# sd for speculative decoding
# c for concurrency
# t for temperature
vllm bench serve \
--model Qwen/Qwen3-32B \
--dataset-name hf \
--dataset-path philschmid/mt-bench \
--num-prompts 80 \
--max-concurrency (1|100) \
--temperature (0.0|1.0) \
--top-p 1.0 2>&1 > results/qwen3-32b-(nosd|sd)-c(1|100)-t(0.0|1.0).out
Benchmark comparison for Qwen3-32B with and without Speculative Decoding for concurrency 1/temperature 0 (c1-t0.0)
Both tests used 80 requests, max concurrency 1, temperature 0.0, on Qwen/Qwen3-32B model. The metrics used for comparison are
Looking at the result files generated for vllm bench serve without speculative decoding and with speculative decoding:
Plotting the throughput and latency metrics over a graph with a help of python script, few themes emerge
Recommended by LinkedIn
Slower TTFT with Speculative Decoding
The slower Time to First Token (TTFT) with Speculative Decoding is due to the additional overhead required to initialize and coordinate the speculative decoding process. Here's why:
Draft Model Initialization: Speculative decoding requires loading and initializing a smaller draft model alongside the target model, and both must be ready before generation begins. This dual startup introduces additional initialization overhead that isn’t present in the baseline setup.
Speculation Setup Overhead: Before the first token is emitted, the system routes the prompt through the draft model, has it generate speculative tokens, sets up the verification pass with the target model to accept or reject those proposals, and coordinates the KV cache between both models so accepted tokens can be reused without redundant computation.
TPOT is Better with Speculative Decoding
Speculative decoding generates multiple tokens per iteration instead of just one. In a standard autoregressive loop, each new token requires a full, expensive forward pass of the target model—yielding only one token per pass. With SD, a lightweight draft model quickly proposes several tokens (often 5–10), and the target model verifies them in a single, efficient pass; typically 3–5 are accepted. The result is multiple tokens per target forward pass, significantly boosting decode throughput. Why this works:
Parallel Verification: The target model can verify multiple draft tokens simultaneously using batched KV cache operations.
Amortized Cost: The expensive target model forward pass produces multiple accepted tokens instead of just one.
Draft Model Speed: The draft model (smaller) runs much faster than the target model, so its overhead is minimal.
ITL Gap Between Bursts is Longer
Even though speculative decoding is more efficient overall, each "speculation round" takes longer than a single autoregressive step because it involves draft model execution, target model verification, and batch scheduling overhead (especially at high concurrency).
Why this Trade-off is Acceptable with Speculative Decoding
Even though SD adds overhead (draft model + verification), the ability to accept multiple tokens per target model iteration more than compensates, resulting in net speedup for token generation.
Parting question to consider
If speculative decoding offers up to 2–3× faster generation, what’s the next optimization you’d want to pair it with: quantization, or disaggregated serving? These would be the topics of my upcoming blogs - stay tuned!
Nice one will dig in detail