Generated Using Veo 3.1

How AI Creates Images from Thin Air: Understanding Diffusion Models

Have you ever wondered how ChatGPT or Google's Gemini can create images just from your text description? You type "a cat wearing sunglasses on a beach," and seconds later, there's your picture. One of the technologies behind this magic, called diffusion models, follows a surprisingly logical process.

Imagine dust slowly settling on a beautiful painting until the image completely disappears. Diffusion models learn to reverse this process. They watch thousands of images gradually disappear under noise, then learn to run them backwards. Starting with pure static, they methodically remove noise until a clear image emerges. But here's the puzzle everyone asks: if we start with random noise containing no image, where does the cat actually come from? Let walks through the complete journey.

Part 1: Training the AI to Recognize Patterns in Noise

To understand where images come from, we start with training. Think of teaching a child to recognize animals by showing many examples. Researchers gather thousands of real cat photographs from the internet: orange tabbies, black cats, Persian cats, cats playing, sleeping, indoors and outdoors. The AI will start by knowing nothing, like a baby born into the world.

Researchers take each real photo and deliberately destroy it through forward diffusion. Consider one photo of an orange tabby on a window. The destruction happens gradually over one thousand steps. At step zero, we have the crystal-clear photograph. By step 100, it looks slightly grainy, like an old 1970s photo, with about 90% of the cat still visible. At step 300, the image has become noticeably foggy, perhaps 70% recognizable. By step 500, it's very blurry, only 40% identifiable. At step 800, almost pure noise remains, maybe 10% perceptible. Finally, at step 1000, the image becomes pure random static with 0% of the original cat visible.

The mathematics behind this can be described, as below. At each step, the noisy image equals most of the previous image plus a little random noise:

Article content

The αt values act like a volume control gradually turning down the "cat signal" while turning up the "noise."

Here's where real training begins. Researchers show the AI the blurry image from step 500 and ask: what did this look like at step 499? The AI guesses. At first, its guesses are completely wrong. But researchers have the correct answer because they saved step 499 when creating the noisy versions. They show the AI the right answer and adjust its internal parameters. This repeats millions of times, using thousands of cat photos, at every step from 1000 down to 1.

What the AI learns can be expressed as ϵθ(xt,t), meaning it identifies exactly what random noise was mixed into an image. Once it identifies noise, it can subtract it to clean up the image. The training uses thousands of diverse cat images: tabbies, Siamese, Maine Coons, hairless Sphynx cats, or maybe just merely "kucing kampung" from a small village in West Java. Through this training, the AI learns that certain blur patterns at step 500 usually indicate a cat ear, or that particular noise at step 300 typically conceals whiskers. After weeks of training, the AI possesses the remarkable ability to predict what noise corrupts any image at any step [1].

Part 2: Generation, Creating Something from Nothing

Now for the magic. The AI has completed training and knows how to identify noise at every corruption level. We want to create a brand new cat image that never existed before.

We begin with pure randomness: a 512 by 512 grid of completely random numbers. Visualized as pixels, this looks like television snow. No cat exists here, just pure chaos with no structure or meaning.

We feed this random noise to our trained AI and pose a hypothetical question: if this static were step 1000 of destroying a cat photo, what would step 999 look like? Based on training from thousands of real photos, the AI predicts what slightly cleaner image might have produced this pattern. The response still looks like noise, but the AI has made tiny adjustments based on learned knowledge of how cat images hide under noise.

We repeat this process continuously. At step 999, we ask about step 998. Then 997, 996, and so on. For the first few hundred steps, changes seem meaningless. But around step 900, vague shapes emerge. By step 700, something roundish forms. At step 500, a definite animal-shaped blob appears. By step 300, clear cat features emerge: head, body, ears, tail. At step 100, you have a recognizable cat with increasing detail. Finally, at step 0, after one thousand denoising iterations, you see a sharp, clear photograph with texture, depth, and delicate whiskers. This cat never existed before. The AI didn't copy any training photo but learned the essence of "cat-ness" and constructed something entirely new from randomness.

The mathematics for each denoising step:

Article content

This means: take the noisy image, subtract the AI's predicted noise, and adjust the scale. Perform this one thousand times, and pure noise transforms into coherent imagery [1].

Guiding the AI with Text: From Words to Images

If we start with pure random noise, how does the AI know to create a cat instead of a dog or tree? In basic form, it doesn't. But modern diffusion models add text guidance. During training, each photo comes paired with descriptions like "orange tabby cat sleeping" or "fluffy Persian cat playing." The AI learns connections between words and visual patterns: "sunglasses" typically means dark circular shapes near faces, "beach" means sand-colored textures.

When you type "a steampunk cat wearing goggles," a text encoder converts your words into mathematical representations the AI understands. Generation begins with pure random noise, but at each denoising step, the AI checks whether the emerging image matches your description. At step 1000, pure noise doesn't match. By step 800, shapes form, and the AI guides denoising toward cat patterns. At step 500, a cat outline emerges, and the AI starts revealing goggle shapes near the face. At step 300, the cat's face is clear, and "steampunk" triggers brass, copper, and Victorian aesthetics. By step 100, you have a clear steampunk cat with goggles. Final refinements add details like lens reflections and gear patterns.

This text guidance operates through cross-attention, simultaneously asking at each step: "What noise should I remove?" and "Am I moving toward the text description?" This dual process ensures the final image matches your specific request.

Why One Thousand Steps?

You might ask why we need one thousand steps instead of one giant leap. The answer lies in learning complexity. Imagine describing someone's face in one sentence versus building up through simple questions: "Dark or light hair?" "Round or angular face?" Each small question is manageable; together they capture everything.

Similarly, teaching an AI to transform noise into an image in one step requires learning an incredibly complicated function. But teaching it to make one small cleaning step is much simpler. The AI only needs to know how to make a slightly noisy image slightly cleaner, not how to create from nothing. Like drawing a portrait, you start with basic shapes, then add features, then details, then refinement. Each decision is simple; together they're remarkable [1].

Applications and Innovations

The sample application of diffusion models, such as DALL-E 3, Stable Diffusion, and Google Gemini's image generation, helps enhance, e.g., medical scans, help architects visualize buildings, enable fashion designers to preview clothing, and help game developers create textures. Recent innovations like DDIM reduce the required steps from 1000 to 20 without sacrificing quality, transforming minutes into seconds.

Before diffusion models, GANs dominated image generation through competition between two networks. GANs generate instantly but train unstably. Diffusion models learn one task reliably: removing noise. They're slower but more stable and produce higher quality, more diverse outputs. As techniques improve, the speed gap narrows.

Understanding the Big Picture

Diffusion models learn a single skill: cleaning up noise. During training, they observe millions of images being progressively destroyed, learning what each noise level looks like. During generation, they start with chaos and patiently remove noise one thousand times, guided by learned patterns and text descriptions. The next time you see AI-generated images from ChatGPT, Stable Diffusion, or Gemini, you'll understand the remarkable journey. That image began as pure static. Through one thousand patient denoising steps, guided by patterns from millions of photographs, chaos transformed into meaning. That is the fundamental elegance of diffusion models.

References

[1] J. Ho, A. Jain, and P. Abbeel, "Denoising Diffusion Probabilistic Models," in Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020. https://arxiv.org/abs/2006.11239

[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-Resolution Image Synthesis With Latent Diffusion Models," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 10684-10695, doi: https://doi.org/10.1109/CVPR52688.2022.01042 .

Some parts were refined using AI to correct the grammar and to add article clarity

To view or add a comment, sign in

Others also viewed

Explore content categories