Pixel Diffusion Model

Python 3.8+ PyTorch License: MIT Gradio Kaggle

A conditional Denoising Diffusion Probabilistic Model (DDPM) for generating 16x16 pixel art sprites with class-based control and real-time visualization.


Overview

This project operates in two phases: a training phase (detailed in Training.ipynb) and an inference/application phase (detailed in app.py). The model from the first phase is loaded into the second to create an interactive application for generating pixel art sprites.


How It Works: A Detailed Breakdown

The core of this project is a conditional Denoising Diffusion Probabilistic Model (DDPM). The process can be broken down into data handling, model architecture, training, and inference.

1. Data and Scheduling

  • Data Handling: The model is trained on 16x16 pixel art sprites. The PixelArtDataset class in the training notebook is custom-built for this data.
  • Noise Schedule: A DiffusionSchedule class implements a cosine noise schedule. This defines how noise is added to an image over T=1000 timesteps. The model's job is to learn how to reverse this process, starting from pure noise and gradually denoising it back to a clean image.

2. The Model: ContextUNet

The model's "brain" is the ContextUNet. This architecture is specifically designed to handle and be controlled by external information.

  • U-Net Structure: It is a standard U-Net with a downsampling path, a bottleneck, and an upsampling path. Skip-connections link the downsampling layers to the upsampling layers, which helps the model preserve fine details (crucial for pixel art).
  • Context Injection: The model is given three pieces of information at every step:
    1. The Noisy Image (x_t)
    2. The Timestep (t)
    3. The Class Condition (c): The control mechanism (e.g., "Characters" or "Monsters").
  • Embedding Combination: The time and class embeddings are combined (emb = t_emb + c_emb) and injected into every ResidualBlock. This ensures the model is constantly reminded of the target category and current noise level.

3. Training: Learning to Denoise

The training loop teaches the model to predict the original noise added to a clean image.

  1. Load clean image x and label c.
  2. Choose random timestep t.
  3. Add noise according to the cosine schedule.
  4. Feed noisy image, t, and c into the ContextUNet.
  5. Optimize using Mean Squared Error (MSE) between predicted and actual noise.

4. Inference: Guided Generation

Using Classifier-Free Guidance (CFG) for explicit control:

  1. Start: Pure random noise.
  2. Denoising Loop: Iterate backward from T-1 to 0.
  3. CFG Step: The model runs twice (Conditional and Unconditional).
  4. Guidance: eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond).
  5. Step: Use guided noise to slightly clean the image.

Key Improvements

  1. Cosine Noise Schedule: Improves sample quality and training stability compared to linear schedules.
  2. Classifier-Free Guidance (CFG): Allows users to control how strictly the model follows the class prompt.
  3. Exponential Moving Average (EMA): Uses a "shadow" copy of weights to produce more stable and higher-quality final images.
  4. Nearest Neighbor Interpolation: Preserves the sharp, blocky nature of pixel art during resizing.
  5. Attention Blocks: Learns long-range spatial relationships in deeper U-Net layers.
  6. Live-Updating Generator: Yields intermediate denoising steps for a real-time "fade-in" effect in the UI.

Technical Details

  • Architecture: Conditional U-Net with attention blocks
  • Timesteps: 1000 diffusion steps
  • Resolution: 16x16 pixels (upscaled to 256x256)
  • Guidance: Classifier-Free Guidance (CFG)
  • Noise Schedule: Cosine schedule

License

This project is licensed under the MIT License.


Acknowledgments

Inspiration drawn from modern diffusion research including DDPM and CFG techniques.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support