Update README.md

37a9041 verified 20 days ago

4.71 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- diffusion
	- ddpm
	- pixel-art
	- image-generation
	- conditional-generation
	- pytorch
	metrics:
	- mse
	pipeline_tag: image-to-image
	---

	# Pixel Diffusion Model

	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Gradio](https://img.shields.io/badge/Gradio-UI-orange.svg)](https://gradio.app/)
	[![Kaggle](https://img.shields.io/badge/Kaggle-Notebook-20BEFF.svg)](https://www.kaggle.com/code/jalpan04/pixel-diffusion-model)

	A conditional Denoising Diffusion Probabilistic Model (DDPM) for generating 16x16 pixel art sprites with class-based control and real-time visualization.

	---

	## Overview

	This project operates in two phases: a training phase (detailed in `Training.ipynb`) and an inference/application phase (detailed in `app.py`). The model from the first phase is loaded into the second to create an interactive application for generating pixel art sprites.

	---

	## How It Works: A Detailed Breakdown

	The core of this project is a conditional Denoising Diffusion Probabilistic Model (DDPM). The process can be broken down into data handling, model architecture, training, and inference.

	### 1. Data and Scheduling

	* Data Handling: The model is trained on 16x16 pixel art sprites. The `PixelArtDataset` class in the training notebook is custom-built for this data.
	* Noise Schedule: A `DiffusionSchedule` class implements a cosine noise schedule. This defines how noise is added to an image over `T=1000` timesteps. The model's job is to learn how to reverse this process, starting from pure noise and gradually denoising it back to a clean image.

	### 2. The Model: `ContextUNet`

	The model's "brain" is the `ContextUNet`. This architecture is specifically designed to handle and be controlled by external information.

	* U-Net Structure: It is a standard U-Net with a downsampling path, a bottleneck, and an upsampling path. Skip-connections link the downsampling layers to the upsampling layers, which helps the model preserve fine details (crucial for pixel art).
	* Context Injection: The model is given three pieces of information at every step:
	1. The Noisy Image (`x_t`)
	2. The Timestep (`t`)
	3. The Class Condition (`c`): The control mechanism (e.g., "Characters" or "Monsters").
	* Embedding Combination: The time and class embeddings are combined (`emb = t_emb + c_emb`) and injected into every `ResidualBlock`. This ensures the model is constantly reminded of the target category and current noise level.

	### 3. Training: Learning to Denoise

	The training loop teaches the model to predict the original noise added to a clean image.
	1. Load clean image `x` and label `c`.
	2. Choose random timestep `t`.
	3. Add noise according to the cosine schedule.
	4. Feed noisy image, `t`, and `c` into the `ContextUNet`.
	5. Optimize using Mean Squared Error (`MSE`) between predicted and actual noise.

	### 4. Inference: Guided Generation

	Using Classifier-Free Guidance (CFG) for explicit control:
	1. Start: Pure random noise.
	2. Denoising Loop: Iterate backward from `T-1` to `0`.
	3. CFG Step: The model runs twice (Conditional and Unconditional).
	4. Guidance: `eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)`.
	5. Step: Use guided noise to slightly clean the image.

	---

	## Key Improvements

	1. Cosine Noise Schedule: Improves sample quality and training stability compared to linear schedules.
	2. Classifier-Free Guidance (CFG): Allows users to control how strictly the model follows the class prompt.
	3. Exponential Moving Average (EMA): Uses a "shadow" copy of weights to produce more stable and higher-quality final images.
	4. Nearest Neighbor Interpolation: Preserves the sharp, blocky nature of pixel art during resizing.
	5. Attention Blocks: Learns long-range spatial relationships in deeper U-Net layers.
	6. Live-Updating Generator: Yields intermediate denoising steps for a real-time "fade-in" effect in the UI.

	---

	## Technical Details

	- Architecture: Conditional U-Net with attention blocks
	- Timesteps: 1000 diffusion steps
	- Resolution: 16x16 pixels (upscaled to 256x256)
	- Guidance: Classifier-Free Guidance (CFG)
	- Noise Schedule: Cosine schedule

	---

	## License

	This project is licensed under the MIT License.

	---

	## Acknowledgments

	Inspiration drawn from modern diffusion research including DDPM and CFG techniques.