From Sequential Tokens to Parallel Refinement: Implementing Discrete Diffusion Adaptation
Why Another Decoder?
Large generative models have made spectacular progress in creating coherent text, images and videos, but most of them still decode one token at a time. Whether you are sampling a thousand pixels for a picture or hundreds of words for a caption, the model marches from left to right, predicting each token based only on what came before. This autoregressive approach is simple and stable, but it means inference scales linearly with sequence length. A single 256×256 image might require over 16 k tokens — that’s a lot of forward passes.
Discrete Diffusion Adaptation (DiDA) offers an alternative. Instead of drawing one token after another, DiDA treats the entire sequence as a noisy canvas and refines all positions in parallel. In the original Emu 3.5 release, the authors show that this bidirectional refinement can make image generation roughly 20× faster than classic autoregressive sampling without compromising quality. The key insight is to borrow the noise‑schedule ideas of diffusion models and apply them to discrete sequences. Our open‑source project implements this idea from scratch and adapts it for any vocabulary of tokens—text, image patches or mixed modalities.
A Mosaic Metaphor
Imagine building a mosaic. In the autoregressive workflow you place tiles one by one, unable to see the full picture until the very end. If you decide midway that a corner needs a different colour, you must remove and replace every tile that comes after. In the DiDA workflow, you throw down all the tiles at once in an initial random configuration and then refine the entire mosaic together. Each pass looks at tiles on the left and right to decide if they need to change. By the final pass the mosaic settles into a coherent image.
This metaphor highlights two properties of DiDA:
What’s Inside Our Implementation?
The project is designed to be a clear, modular reference for DiDA. It contains a few core components, each of which can be adapted to your own model:
1. DiscreteDiffusionScheduler
This class defines a noise schedule over a fixed number of refinement steps. Like continuous diffusion models, early steps inject a lot of noise (allowing the model to overwrite most positions), and later steps inject very little noise to focus on fine detail. The scheduler exposes a get_timesteps(reverse=True) method that yields timesteps from the noisiest state down to the clean state. It also provides helper functions to decide how many tokens to update at each step.
2. DiDAAttentionMask
For multimodal tasks, we often want to keep a text prefix fixed while refining the image portion. DiDAAttentionMask constructs hybrid masks for this scenario. The mask ensures that text tokens attend to themselves and cross‑attend to image tokens, while image tokens attend to both text and other image tokens. When sampling a pure image without a text prefix, it falls back to a full attention mask. This hybrid attention is crucial for aligning images with prompts and is one of the distinguishing features of our implementation.
3. DiDACore
DiDACore wraps a Transformer decoder that processes the whole sequence in one forward pass. It embeds tokens, adds timestep embeddings and applies a stack of encoder‑style layers without a causal mask. The core method denoise_step accepts the current sequence of tokens and a timestep and returns logits for every position. Internally it:
This design matches the forward and denoise_step methods in our code base. Because the decoder is bidirectional, it can consistently refine tokens in any order.
4. DiDASampler
The sampler orchestrates the refinement loop. It begins with a sequence initialised either to random tokens (for pure image generation) or to a text prefix followed by mask tokens (for text‑conditioned images). For each timestep provided by the scheduler it performs:
This loop continues until all timesteps are processed. In pure image mode and in hybrid mode the update rules differ slightly, but the overall pattern remains the same.
Additional Features
Pseudocode at a Glance
Below is a high‑level pseudocode that mirrors our implementation. It emphasises the interplay between the scheduler, core model and sampler.
Recommended by LinkedIn
Two notes about this pseudocode:
Implementing Your Own DiDA Pipeline
Here’s a step‑by‑step guide to using our project. You can adapt it to your own models or datasets.
1. Install and Set Up
Clone the repository and install dependencies (PyTorch 2.0+ is required). The README provides detailed instructions along with a simple example. In short:
2. Construct the Components
Import the scheduler, core and sampler from the library. You need to specify the vocabulary size (text and image), embedding dimension and depth of the transformer:
3. Generate a Pure Image
To generate an image without text conditioning, call the sample_image method. You must specify the desired sequence length (number of image tokens) and optionally a batch size:
4. Generate an Image with a Text Prefix
To guide the generation with a caption, first encode your text into tokens and then call sample_image_and_text (method names may vary slightly):
The sampler will keep the text tokens fixed (they are not updated) and refine only the image portion. Because the attention mask allows the image tokens to attend back to the text, the generated image should align well with the caption.
5. Customize and Experiment
6. Evaluate and Extend
Our project comes with unit tests and example scripts. Use them to verify that your modifications do not break the refinement loop. Because the architecture is modular, you can plug in a different backbone (e.g., a mixture‑of‑experts transformer) or adjust the attention mask to handle more than two modalities.
Final Thoughts
Discrete Diffusion Adaptation is more than a clever speed hack. It fundamentally reframes generation as a global refinement problem rather than a strictly sequential process. By combining the strengths of diffusion (noise schedules and parallel updates) with the flexibility of transformers, DiDA enables models to produce coherent images and sequences much faster than traditional decoding. Our implementation emphasises clarity, testability and extensibility. With a few dozen lines of code you can plug DiDA into your own projects and experiment with new ways of generating and refining discrete sequences.
If you’ve read my previous posts on context engineering or atoms of thought, you’ll recognise a recurring theme: rethink the interface between information and inference. DiDA continues that exploration by reimagining how decoders operate. I hope this guide demystifies the core ideas and helps you build your own parallel refinement pipelines. As always, feedback and contributions are welcome!