Pixel_Diffusion / README.md
jalpan04's picture
Update README.md
37a9041 verified
---
license: mit
library_name: pytorch
tags:
- diffusion
- ddpm
- pixel-art
- image-generation
- conditional-generation
- pytorch
metrics:
- mse
pipeline_tag: image-to-image
---
# Pixel Diffusion Model
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Gradio](https://img.shields.io/badge/Gradio-UI-orange.svg)](https://gradio.app/)
[![Kaggle](https://img.shields.io/badge/Kaggle-Notebook-20BEFF.svg)](https://www.kaggle.com/code/jalpan04/pixel-diffusion-model)
A conditional Denoising Diffusion Probabilistic Model (DDPM) for generating 16x16 pixel art sprites with class-based control and real-time visualization.
---
## Overview
This project operates in two phases: a **training phase** (detailed in `Training.ipynb`) and an **inference/application phase** (detailed in `app.py`). The model from the first phase is loaded into the second to create an interactive application for generating pixel art sprites.
---
## How It Works: A Detailed Breakdown
The core of this project is a conditional Denoising Diffusion Probabilistic Model (DDPM). The process can be broken down into data handling, model architecture, training, and inference.
### 1. Data and Scheduling
* **Data Handling:** The model is trained on 16x16 pixel art sprites. The `PixelArtDataset` class in the training notebook is custom-built for this data.
* **Noise Schedule:** A `DiffusionSchedule` class implements a **cosine noise schedule**. This defines how noise is added to an image over `T=1000` timesteps. The model's job is to learn how to reverse this process, starting from pure noise and gradually denoising it back to a clean image.
### 2. The Model: `ContextUNet`
The model's "brain" is the `ContextUNet`. This architecture is specifically designed to handle and be controlled by external information.
* **U-Net Structure:** It is a standard U-Net with a downsampling path, a bottleneck, and an upsampling path. Skip-connections link the downsampling layers to the upsampling layers, which helps the model preserve fine details (crucial for pixel art).
* **Context Injection:** The model is given three pieces of information at every step:
1. **The Noisy Image (`x_t`)**
2. **The Timestep (`t`)**
3. **The Class Condition (`c`)**: The control mechanism (e.g., "Characters" or "Monsters").
* **Embedding Combination:** The time and class embeddings are combined (`emb = t_emb + c_emb`) and injected into every `ResidualBlock`. This ensures the model is constantly reminded of the target category and current noise level.
### 3. Training: Learning to Denoise
The training loop teaches the model to predict the *original noise* added to a clean image.
1. Load clean image `x` and label `c`.
2. Choose random timestep `t`.
3. Add noise according to the cosine schedule.
4. Feed noisy image, `t`, and `c` into the `ContextUNet`.
5. Optimize using Mean Squared Error (`MSE`) between predicted and actual noise.
### 4. Inference: Guided Generation
Using **Classifier-Free Guidance (CFG)** for explicit control:
1. **Start:** Pure random noise.
2. **Denoising Loop:** Iterate backward from `T-1` to `0`.
3. **CFG Step:** The model runs twice (Conditional and Unconditional).
4. **Guidance:** `eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)`.
5. **Step:** Use guided noise to slightly clean the image.
---
## Key Improvements
1. **Cosine Noise Schedule:** Improves sample quality and training stability compared to linear schedules.
2. **Classifier-Free Guidance (CFG):** Allows users to control how strictly the model follows the class prompt.
3. **Exponential Moving Average (EMA):** Uses a "shadow" copy of weights to produce more stable and higher-quality final images.
4. **Nearest Neighbor Interpolation:** Preserves the sharp, blocky nature of pixel art during resizing.
5. **Attention Blocks:** Learns long-range spatial relationships in deeper U-Net layers.
6. **Live-Updating Generator:** Yields intermediate denoising steps for a real-time "fade-in" effect in the UI.
---
## Technical Details
- **Architecture:** Conditional U-Net with attention blocks
- **Timesteps:** 1000 diffusion steps
- **Resolution:** 16x16 pixels (upscaled to 256x256)
- **Guidance:** Classifier-Free Guidance (CFG)
- **Noise Schedule:** Cosine schedule
---
## License
This project is licensed under the MIT License.
---
## Acknowledgments
Inspiration drawn from modern diffusion research including DDPM and CFG techniques.