Engineering view of decouple of prefill and decoding

The prefill and decoding processes are critical stages in Large Language Model (LLM) generation. Here's a detailed breakdown of each process as it pertains to the Mooncake architecture, by which Mooncake can handle high loads and long-context scenarios while meeting latency-related Service Level Objectives (SLOs):

A picture is worth a thousand words

Article content
Overall Workflow

Show me the code

Check the draft code in here.

Prefill Process

1. Input Token Processing:

The prefill stage begins with processing all input tokens in parallel. This stage is computationally intensive and involves generating the first output token while storing intermediate results of computed keys and values, known as the KVCache.

2. KVCache Reuse:

The selected prefill node receives a request that includes the raw input, the block IDs of the prefix cache that can be reused, and the block IDs of the full cache allocated to the request. It loads the prefix cache from remote CPU memory into GPU memory based on the prefix cache block IDs to bootstrap the request. This step is skipped if no prefix cache exists.

3. Incremental Prefill:

The prefill node completes the prefill stage using the prefix cache and stores the newly generated incremental KVCache back into CPU memory. If the number of uncached input tokens exceeds a certain threshold (prefill_chunk), the prefill stage is split into multiple chunks and executed in a pipeline manner. This threshold is selected to fully utilize the corresponding GPU's computational power and is typically larger than 1000 tokens.

4. Layer-wise Prefill:

KVCache loading and storing are executed asynchronously via launch and wait operations. Before each layer's attention computation begins, the model waits for the asynchronous loading of that layer's KVCache to complete and triggers the next layer's asynchronous KVCache loading. After the attention calculation is complete, asynchronous storage of that layer's KVCache is launched.

5. KVCache Transfer:

The Messenger service is deployed in each node to manage and transfer these caches. Each Messenger operates as an independent process within its respective inference instance, receiving signals to facilitate high-speed, cross-machine KVCache transfer. This step is asynchronously executed and overlapped with the above incremental prefill step, streaming the KVCache generated by each model layer to the destination decoding node's CPU memory to reduce waiting time.

Decoding Process

1. KVCache Loading:

After all the KVCache is received in the CPU DRAM of the decoding node, the request joins the next batch in a continuous batching manner. Conductor pre-selects the decoding node based on its current load to ensure it does not violate the Time Between Tokens (TBT) SLO.

2. Continuous Batching:

Before each iteration, the scheduler checks the status of all requests, adding newly arrived requests to the batch's prefill stage while removing completed requests. This continuous batching process helps in maximizing the Model FLOPs Utilization (MFU) by aggregating as many tokens as possible in a decoding batch.

3. Autoregressive Token Generation:

The decoding stage processes only one token at a time per batch due to the limitation of autoregressive generation. This stage uses the KVCache to autoregressively generate new tokens, adding new keys and values from the computation to the KVCache.

4. Asynchronous Loading:

For decoding instances, asynchronous loading of KVCache is performed concurrently with GPU decoding to prevent GPU idle time. This ensures that the decoding process is efficient and minimizes latency.

5. Final Output Generation:

The decoding stage continues until the entire sequence of output tokens is generated. The process is constrained by the TBT SLO, ensuring that the latency between successive token generations for the same request is minimized.

Summary

1. KVCache Reuse: Load reusable KVCache blocks into GPU memory for the prefill stage.

2. Incremental Prefill: Process input tokens in chunks, store new KVCache in CPU memory.

3. KVCache Transfer: Use Messenger service for high-speed KVCache transfer to decoding nodes.

4. KVCache Loading: Load KVCache into decoding nodes' CPU DRAM, join requests in continuous batching.

5. Autoregressive Generation: Generate tokens one-by-one using KVCache, update KVCache with new keys and values.

6. Asynchronous Operations: Overlap KVCache loading and storing with computation to reduce latency and improve efficiency.

To view or add a comment, sign in

More articles by Ke Yi

Others also viewed

Explore content categories