DDPM CIFAR-10 Diffusion Model

A Denoising Diffusion Probabilistic Model (DDPM) trained on CIFAR-10 for 300 epochs. This model generates 32×32 synthetic images across 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).

Model Architecture

Component	Specification
Architecture	U-Net with self-attention
Parameters	26.8 M
Base channels	128
Channel multipliers	[1, 2, 2, 2]
Attention resolutions	16×16, 8×8, 4×4 (multi-head=4)
ResBlocks per stage	2
Dropout	0.1
Normalization	GroupNorm (32 groups)
Activation	SiLU
Time embedding	Sinusoidal → MLP(128→512→512)

U-Net Data Flow

Input (3×32×32)
  → Init Conv (3→128)
  → Down[0]: ResBlocks(128) → 32×32 skip
  → Down[1]: ResBlocks(256) + SelfAttn → 16×16 skip
  → Down[2]: ResBlocks(256) + SelfAttn → 8×8 skip
  → Down[3]: ResBlocks(256) + SelfAttn → 4×4 skip
  → Mid: ResBlock + SelfAttn + ResBlock
  → Up[0]: SkipCat + ResBlocks(256) + SelfAttn → upsample 8×8
  → Up[1]: SkipCat + ResBlocks(256) + SelfAttn → upsample 16×16
  → Up[2]: SkipCat + ResBlocks(256) + SelfAttn → upsample 32×32
  → Up[3]: SkipCat + ResBlocks(128)
  → Out: GroupNorm + SiLU + Conv → 3×32×32

Diffusion Process

Forward diffusion: Linear noise schedule, predicting noise ε (ε-prediction) rather than x₀
Schedule: Cosine β schedule from 1e-4 to 0.02 over T=1000 timesteps
x₀ clipping: Predicted x₀ is clipped to [-1, 1] before computing posterior mean (prevents numerical explosion)
Sampling: DDPM ancestral sampler with EMA shadow weights
Objective: Simple MSE loss between predicted and true noise

Training

Setting	Value
Dataset	CIFAR-10 (50k)
Epochs	300
Batch size	256
Optimizer	AdamW
Learning rate	2×10⁻⁴
Mixed precision	BF16 (AMP)
EMA decay	0.9999 (warmup)
Steps	58,500
Final loss	~0.0547
Hardware	RTX 5080 16 GB

Usage

import torch
import json
from safetensors.torch import load_file

from config import Config
from model import UNet
from diffusion import GaussianDiffusion

# Load config and model
with open('config.json') as f:
    cfg_dict = json.load(f)

cfg = Config()
cfg.model.base_channels = cfg_dict['base_channels']
cfg.model.channel_multipliers = tuple(cfg_dict['channel_multipliers'])
cfg.model.attention_resolutions = tuple(cfg_dict['attention_resolutions'])
# ... (set remaining fields from config.json)

model = UNet(cfg.model)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval().cuda()

# Set up diffusion
diff = GaussianDiffusion(
    timesteps=1000,
    beta_start=1e-4,
    beta_end=0.02,
)

# Generate 64 images (8×8 grid)
with torch.no_grad():
    samples = diff.p_sample_loop(model, (64, 3, 32, 32), device='cuda')
    # samples.shape → (64, 3, 32, 32), range [-1, 1]

Samples

Training progression — same 8 random seeds tracked across the run:

Step 500	Step 30,000	Step 58,500 (final)
Early blurry shapes	Semi-recognizable objects	Sharp, diverse CIFAR-10 samples

Limitations

Resolution: Fixed 32×32 — CIFAR-10 native resolution
Class conditioning: This is an unconditional model; no class labels used during training
FID: Not evaluated (training-in-progress checkpoint)
Artifacts: Some generated samples may have checkerboard artifacts or unnatural colors

Citation

@inproceedings{ho2020denoising,
  title={Denoising Diffusion Probabilistic Models},
  author={Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
  booktitle={Advances in Neural Information Processing Systems},
  year={2020}
}

Downloads last month: 35

Safetensors

Model size

26.8M params

Tensor type

F32

Inference Providers NEW

Unconditional Image Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mlvar
/

ddpm-cifar10