From Sequential Tokens to Parallel Refinement: Implementing Discrete Diffusion Adaptation

From Sequential Tokens to Parallel Refinement: Implementing Discrete Diffusion Adaptation

Why Another Decoder?

Large generative models have made spectacular progress in creating coherent text, images and videos, but most of them still decode one token at a time. Whether you are sampling a thousand pixels for a picture or hundreds of words for a caption, the model marches from left to right, predicting each token based only on what came before. This autoregressive approach is simple and stable, but it means inference scales linearly with sequence length. A single 256×256 image might require over 16 k tokens — that’s a lot of forward passes.

Discrete Diffusion Adaptation (DiDA) offers an alternative. Instead of drawing one token after another, DiDA treats the entire sequence as a noisy canvas and refines all positions in parallel. In the original Emu 3.5 release, the authors show that this bidirectional refinement can make image generation roughly 20× faster than classic autoregressive sampling without compromising quality. The key insight is to borrow the noise‑schedule ideas of diffusion models and apply them to discrete sequences. Our open‑source project implements this idea from scratch and adapts it for any vocabulary of tokens—text, image patches or mixed modalities.

A Mosaic Metaphor

Imagine building a mosaic. In the autoregressive workflow you place tiles one by one, unable to see the full picture until the very end. If you decide midway that a corner needs a different colour, you must remove and replace every tile that comes after. In the DiDA workflow, you throw down all the tiles at once in an initial random configuration and then refine the entire mosaic together. Each pass looks at tiles on the left and right to decide if they need to change. By the final pass the mosaic settles into a coherent image.

This metaphor highlights two properties of DiDA:

  • Bidirectional context – tokens are refined using information from both directions rather than only from the past. In our implementation the causal mask is removed so the transformer can attend across the full sequence.
  • Parallel updates – every pass updates many positions at once, gradually reducing the amount of noise according to a schedule. This yields significant speed‑ups on modern hardware.

What’s Inside Our Implementation?

The project is designed to be a clear, modular reference for DiDA. It contains a few core components, each of which can be adapted to your own model:

1. DiscreteDiffusionScheduler

This class defines a noise schedule over a fixed number of refinement steps. Like continuous diffusion models, early steps inject a lot of noise (allowing the model to overwrite most positions), and later steps inject very little noise to focus on fine detail. The scheduler exposes a get_timesteps(reverse=True) method that yields timesteps from the noisiest state down to the clean state. It also provides helper functions to decide how many tokens to update at each step.

2. DiDAAttentionMask

For multimodal tasks, we often want to keep a text prefix fixed while refining the image portion. DiDAAttentionMask constructs hybrid masks for this scenario. The mask ensures that text tokens attend to themselves and cross‑attend to image tokens, while image tokens attend to both text and other image tokens. When sampling a pure image without a text prefix, it falls back to a full attention mask. This hybrid attention is crucial for aligning images with prompts and is one of the distinguishing features of our implementation.

3. DiDACore

DiDACore wraps a Transformer decoder that processes the whole sequence in one forward pass. It embeds tokens, adds timestep embeddings and applies a stack of encoder‑style layers without a causal mask. The core method denoise_step accepts the current sequence of tokens and a timestep and returns logits for every position. Internally it:

  1. Embeds the tokens and adds positional and timestep encodings.
  2. Passes the hidden states through a stack of self‑attention and feed‑forward layers. Because we remove the causal mask, each token can attend to both left and right context.
  3. Projects the final hidden states back to token logits through a linear head.

This design matches the forward and denoise_step methods in our code base. Because the decoder is bidirectional, it can consistently refine tokens in any order.

4. DiDASampler

The sampler orchestrates the refinement loop. It begins with a sequence initialised either to random tokens (for pure image generation) or to a text prefix followed by mask tokens (for text‑conditioned images). For each timestep provided by the scheduler it performs:

  1. Construct attention mask: If there is a text prefix, build the hybrid mask; otherwise use a full attention mask.
  2. Predict new tokens: Call DiDACore.denoise_step to get logits for all positions at the current timestep.
  3. Sample and update: Decide which positions to update based on the noise schedule. For those positions, pick the most likely token or sample from the distribution and write it back into the sequence.

This loop continues until all timesteps are processed. In pure image mode and in hybrid mode the update rules differ slightly, but the overall pattern remains the same.

Additional Features

  • Tests and Documentation – The repository includes unit tests and thorough documentation to make it easy to understand each component and modify it for your own needs
  • Plug‑and‑Play Components – You can swap out the underlying transformer with your own architecture, and the scheduler and sampler will remain unchanged.
  • Separate Sampling Pipelines – There are dedicated functions for sampling pure images and for sampling images conditioned on text. This separation makes it straightforward to use DiDA in a vision‑only context or in a vision‑language model.

Pseudocode at a Glance

Below is a high‑level pseudocode that mirrors our implementation. It emphasises the interplay between the scheduler, core model and sampler.

Article content


Two notes about this pseudocode:

  • The scheduler’s compute_update_mask typically updates many positions in early steps and progressively fewer in later steps, analogous to the variance schedule of diffusion models.
  • Because dida_core is bidirectional, it never relies on a causal mask and therefore can incorporate context from both sides at every step.

Implementing Your Own DiDA Pipeline

Here’s a step‑by‑step guide to using our project. You can adapt it to your own models or datasets.

1. Install and Set Up

Clone the repository and install dependencies (PyTorch 2.0+ is required). The README provides detailed instructions along with a simple example. In short:

Article content

2. Construct the Components

Import the scheduler, core and sampler from the library. You need to specify the vocabulary size (text and image), embedding dimension and depth of the transformer:

Article content

3. Generate a Pure Image

To generate an image without text conditioning, call the sample_image method. You must specify the desired sequence length (number of image tokens) and optionally a batch size:

Article content

4. Generate an Image with a Text Prefix

To guide the generation with a caption, first encode your text into tokens and then call sample_image_and_text (method names may vary slightly):

Article content

The sampler will keep the text tokens fixed (they are not updated) and refine only the image portion. Because the attention mask allows the image tokens to attend back to the text, the generated image should align well with the caption.

5. Customize and Experiment

  • Number of steps – Changing num_steps in the scheduler trades quality for speed. Fewer steps lead to faster inference but potentially less refined images.
  • Update schedule – You can override the scheduler’s default update schedule to experiment with different noise schedules or dynamic token selection strategies.
  • Sampler strategies – Instead of always choosing the argmax token, try sampling with a temperature or applying classifier‑free guidance for conditional tasks.

6. Evaluate and Extend

Our project comes with unit tests and example scripts. Use them to verify that your modifications do not break the refinement loop. Because the architecture is modular, you can plug in a different backbone (e.g., a mixture‑of‑experts transformer) or adjust the attention mask to handle more than two modalities.

Final Thoughts

Discrete Diffusion Adaptation is more than a clever speed hack. It fundamentally reframes generation as a global refinement problem rather than a strictly sequential process. By combining the strengths of diffusion (noise schedules and parallel updates) with the flexibility of transformers, DiDA enables models to produce coherent images and sequences much faster than traditional decoding. Our implementation emphasises clarity, testability and extensibility. With a few dozen lines of code you can plug DiDA into your own projects and experiment with new ways of generating and refining discrete sequences.

If you’ve read my previous posts on context engineering or atoms of thought, you’ll recognise a recurring theme: rethink the interface between information and inference. DiDA continues that exploration by reimagining how decoders operate. I hope this guide demystifies the core ideas and helps you build your own parallel refinement pipelines. As always, feedback and contributions are welcome!


To view or add a comment, sign in

More articles by Harel Wilner

Others also viewed

Explore content categories