| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - diffusion |
| - ddpm |
| - pixel-art |
| - image-generation |
| - conditional-generation |
| - pytorch |
| metrics: |
| - mse |
| pipeline_tag: image-to-image |
| --- |
| |
| # Pixel Diffusion Model |
|
|
| [](https://www.python.org/downloads/) |
| [](https://pytorch.org/) |
| [](https://opensource.org/licenses/MIT) |
| [](https://gradio.app/) |
| [](https://www.kaggle.com/code/jalpan04/pixel-diffusion-model) |
|
|
| A conditional Denoising Diffusion Probabilistic Model (DDPM) for generating 16x16 pixel art sprites with class-based control and real-time visualization. |
|
|
| --- |
|
|
| ## Overview |
|
|
| This project operates in two phases: a **training phase** (detailed in `Training.ipynb`) and an **inference/application phase** (detailed in `app.py`). The model from the first phase is loaded into the second to create an interactive application for generating pixel art sprites. |
|
|
| --- |
|
|
| ## How It Works: A Detailed Breakdown |
|
|
| The core of this project is a conditional Denoising Diffusion Probabilistic Model (DDPM). The process can be broken down into data handling, model architecture, training, and inference. |
|
|
| ### 1. Data and Scheduling |
|
|
| * **Data Handling:** The model is trained on 16x16 pixel art sprites. The `PixelArtDataset` class in the training notebook is custom-built for this data. |
| * **Noise Schedule:** A `DiffusionSchedule` class implements a **cosine noise schedule**. This defines how noise is added to an image over `T=1000` timesteps. The model's job is to learn how to reverse this process, starting from pure noise and gradually denoising it back to a clean image. |
|
|
| ### 2. The Model: `ContextUNet` |
|
|
| The model's "brain" is the `ContextUNet`. This architecture is specifically designed to handle and be controlled by external information. |
|
|
| * **U-Net Structure:** It is a standard U-Net with a downsampling path, a bottleneck, and an upsampling path. Skip-connections link the downsampling layers to the upsampling layers, which helps the model preserve fine details (crucial for pixel art). |
| * **Context Injection:** The model is given three pieces of information at every step: |
| 1. **The Noisy Image (`x_t`)** |
| 2. **The Timestep (`t`)** |
| 3. **The Class Condition (`c`)**: The control mechanism (e.g., "Characters" or "Monsters"). |
| * **Embedding Combination:** The time and class embeddings are combined (`emb = t_emb + c_emb`) and injected into every `ResidualBlock`. This ensures the model is constantly reminded of the target category and current noise level. |
| |
| ### 3. Training: Learning to Denoise |
| |
| The training loop teaches the model to predict the *original noise* added to a clean image. |
| 1. Load clean image `x` and label `c`. |
| 2. Choose random timestep `t`. |
| 3. Add noise according to the cosine schedule. |
| 4. Feed noisy image, `t`, and `c` into the `ContextUNet`. |
| 5. Optimize using Mean Squared Error (`MSE`) between predicted and actual noise. |
| |
| ### 4. Inference: Guided Generation |
| |
| Using **Classifier-Free Guidance (CFG)** for explicit control: |
| 1. **Start:** Pure random noise. |
| 2. **Denoising Loop:** Iterate backward from `T-1` to `0`. |
| 3. **CFG Step:** The model runs twice (Conditional and Unconditional). |
| 4. **Guidance:** `eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)`. |
| 5. **Step:** Use guided noise to slightly clean the image. |
| |
| --- |
| |
| ## Key Improvements |
| |
| 1. **Cosine Noise Schedule:** Improves sample quality and training stability compared to linear schedules. |
| 2. **Classifier-Free Guidance (CFG):** Allows users to control how strictly the model follows the class prompt. |
| 3. **Exponential Moving Average (EMA):** Uses a "shadow" copy of weights to produce more stable and higher-quality final images. |
| 4. **Nearest Neighbor Interpolation:** Preserves the sharp, blocky nature of pixel art during resizing. |
| 5. **Attention Blocks:** Learns long-range spatial relationships in deeper U-Net layers. |
| 6. **Live-Updating Generator:** Yields intermediate denoising steps for a real-time "fade-in" effect in the UI. |
| |
| --- |
| |
| ## Technical Details |
| |
| - **Architecture:** Conditional U-Net with attention blocks |
| - **Timesteps:** 1000 diffusion steps |
| - **Resolution:** 16x16 pixels (upscaled to 256x256) |
| - **Guidance:** Classifier-Free Guidance (CFG) |
| - **Noise Schedule:** Cosine schedule |
| |
| --- |
| |
| ## License |
| |
| This project is licensed under the MIT License. |
| |
| --- |
| |
| ## Acknowledgments |
| |
| Inspiration drawn from modern diffusion research including DDPM and CFG techniques. |
| |