Engineering view of decouple of prefill and decoding

Ke Yi

Published Jul 12, 2024

The prefill and decoding processes are critical stages in Large Language Model (LLM) generation. Here's a detailed breakdown of each process as it pertains to the Mooncake architecture, by which Mooncake can handle high loads and long-context scenarios while meeting latency-related Service Level Objectives (SLOs):

A picture is worth a thousand words

Show me the code

Check the draft code in here.

Prefill Process

1. Input Token Processing:

The prefill stage begins with processing all input tokens in parallel. This stage is computationally intensive and involves generating the first output token while storing intermediate results of computed keys and values, known as the KVCache.

2. KVCache Reuse:

The selected prefill node receives a request that includes the raw input, the block IDs of the prefix cache that can be reused, and the block IDs of the full cache allocated to the request. It loads the prefix cache from remote CPU memory into GPU memory based on the prefix cache block IDs to bootstrap the request. This step is skipped if no prefix cache exists.

3. Incremental Prefill:

The prefill node completes the prefill stage using the prefix cache and stores the newly generated incremental KVCache back into CPU memory. If the number of uncached input tokens exceeds a certain threshold (prefill_chunk), the prefill stage is split into multiple chunks and executed in a pipeline manner. This threshold is selected to fully utilize the corresponding GPU's computational power and is typically larger than 1000 tokens.

4. Layer-wise Prefill:

KVCache loading and storing are executed asynchronously via launch and wait operations. Before each layer's attention computation begins, the model waits for the asynchronous loading of that layer's KVCache to complete and triggers the next layer's asynchronous KVCache loading. After the attention calculation is complete, asynchronous storage of that layer's KVCache is launched.

5. KVCache Transfer:

The Messenger service is deployed in each node to manage and transfer these caches. Each Messenger operates as an independent process within its respective inference instance, receiving signals to facilitate high-speed, cross-machine KVCache transfer. This step is asynchronously executed and overlapped with the above incremental prefill step, streaming the KVCache generated by each model layer to the destination decoding node's CPU memory to reduce waiting time.

Recommended by LinkedIn

Beyond Prompting — Building the Context Stack

Abhishek Mane 2 weeks ago

Making LLMs more Memory Efficient

Kalpesh Sharma 3 weeks ago

What If a Transformer Could Talk to a Calculator…

Hannes Lehmann 1 month ago

Decoding Process

1. KVCache Loading:

After all the KVCache is received in the CPU DRAM of the decoding node, the request joins the next batch in a continuous batching manner. Conductor pre-selects the decoding node based on its current load to ensure it does not violate the Time Between Tokens (TBT) SLO.

2. Continuous Batching:

Before each iteration, the scheduler checks the status of all requests, adding newly arrived requests to the batch's prefill stage while removing completed requests. This continuous batching process helps in maximizing the Model FLOPs Utilization (MFU) by aggregating as many tokens as possible in a decoding batch.

3. Autoregressive Token Generation:

The decoding stage processes only one token at a time per batch due to the limitation of autoregressive generation. This stage uses the KVCache to autoregressively generate new tokens, adding new keys and values from the computation to the KVCache.

4. Asynchronous Loading:

For decoding instances, asynchronous loading of KVCache is performed concurrently with GPU decoding to prevent GPU idle time. This ensures that the decoding process is efficient and minimizes latency.

5. Final Output Generation:

The decoding stage continues until the entire sequence of output tokens is generated. The process is constrained by the TBT SLO, ensuring that the latency between successive token generations for the same request is minimized.

Summary

1. KVCache Reuse: Load reusable KVCache blocks into GPU memory for the prefill stage.

2. Incremental Prefill: Process input tokens in chunks, store new KVCache in CPU memory.

3. KVCache Transfer: Use Messenger service for high-speed KVCache transfer to decoding nodes.

4. KVCache Loading: Load KVCache into decoding nodes' CPU DRAM, join requests in continuous batching.

5. Autoregressive Generation: Generate tokens one-by-one using KVCache, update KVCache with new keys and values.

6. Asynchronous Operations: Overlap KVCache loading and storing with computation to reduce latency and improve efficiency.

To view or add a comment, sign in

Engineering view of decouple of prefill and decoding

Ke Yi

A picture is worth a thousand words

Show me the code

Prefill Process

Recommended by LinkedIn

Decoding Process

Summary

More articles by Ke Yi

Others also viewed

Engineering Scalable Edge AI: The Semiconductor Stack Powering the Future

[2026-04-14T20:33:47Z] OPHI RUNTIME ACTIVE — SYSTEM STATE: MANIFOLD VALIDATION — ARCHITECT: LUIS AYALA KPKP

🚀 Build Your Own LLM (CPU/4G GPU), Building My Own Tiny LLM vs LLaMA 3.2 — A Practical Architecture Comparison

Testing Three Models on Two Sparks

Inference Layers Collapse Into One

The Obscure SystemVerilog Constructs That Power AI Chip Verification

Optimizing LLM Inference for Maximum Efficiency

What Actually Limits LLM Inference Performance?

vLLM: High-Throughput, Memory-Efficient LLM Inference Engine

How I Transferred Any Fine-Tuned Adapter from a 120B LLM to a Local CPU in 45 Seconds and Beat GPT-5.2

How Llms Process Language

Data Preprocessing for Large Language Models

How Large Language Models Reshape Data Patterns

Optimizing Large Language Model Planning with Dynamic Belief Updates

Explore content categories

A picture is worth a thousand words

Show me the code

Prefill Process

Recommended by LinkedIn

Decoding Process

Summary

More articles by Ke Yi

Easy Model Deployer: Streamlined Foundation Model Hosting on AWS

🚀 Introducing Intelli-Ops: AI-Powered GitHub Automation with Amazon Bedrock

Introducing Intelli-Agent: Revolutionizing Agent-Based Application Development

Claude Prompt Generator

Revolutionizing Prompt Engineering with Claude3

Claude3 is available on AWS Bedrock!

🚀 Introducing the Stable Diffusion AWS Extension! 🚀

LLM ETL: Cornerstone of RAG-Based Conversational Chatbots

Others also viewed

Engineering Scalable Edge AI: The Semiconductor Stack Powering the Future

[2026-04-14T20:33:47Z] OPHI RUNTIME ACTIVE — SYSTEM STATE: MANIFOLD VALIDATION — ARCHITECT: LUIS AYALA KPKP

🚀 Build Your Own LLM (CPU/4G GPU), Building My Own Tiny LLM vs LLaMA 3.2 — A Practical Architecture Comparison

Testing Three Models on Two Sparks

Inference Layers Collapse Into One

The Obscure SystemVerilog Constructs That Power AI Chip Verification

Optimizing LLM Inference for Maximum Efficiency

What Actually Limits LLM Inference Performance?

vLLM: High-Throughput, Memory-Efficient LLM Inference Engine

How I Transferred Any Fine-Tuned Adapter from a 120B LLM to a Local CPU in 45 Seconds and Beat GPT-5.2

Similar topics

How Llms Process Language

Data Preprocessing for Large Language Models

How Large Language Models Reshape Data Patterns

Optimizing Large Language Model Planning with Dynamic Belief Updates

Explore content categories