Imagen
Collection
My DIT-VAE models!
•
1 item
•
Updated
•
1
A prototype modern diffusion transformer trained on high-quality recaptioned data
![]() |
![]() |
![]() |
![]() |
![]() |
| "A silver pot on wood" | "Entrance of a luxury house" | "A house in a green field" | "A forest" | "A handwritten poem" |
Pipeline: Text Prompt → T5-Large (770M) → DiT-320M ← Noise + Timestep → SDXL VAE Decoder → Image (256×256)
Each transformer block contains:
| Architecture | Diffusion Transformer (DiT) |
| Parameters | ~320M (DiT only) |
| Hidden Dimension | 1024 |
| Transformer Depth | 12 layers |
| Attention Heads | 16 |
| Patch Size | 2×2 |
| MLP Ratio | 4× |
| Context Dimension | 1024 (T5-Large) |
| Feature | Description | Origin |
|---|---|---|
| RoPE | Rotary Positional Embeddings for 2D patches | LLaMA, Flux |
| QK-Normalization | Stabilizes attention at scale | ViT-22B |
| SwiGLU | Gated activation for better gradient flow | PaLM, LLaMA |
| AdaLN-Zero | Adaptive layer norm for timestep conditioning | DiT |
| RMSNorm | Faster than LayerNorm with similar quality | LLaMA |
| Resolution | 256×256 |
| Batch Size | 80 (effective, 10 × 8 accumulation) |
| Learning Rate | 2×10⁻⁴ |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Precision | bfloat16 |
| EMA Decay | 0.9999 |
| Warmup Steps | 500 |
| Gradient Clipping | 1.0 |
| Timesteps | 1000 |
| Schedule | Cosine (Improved DDPM) |
| Prediction | ε-prediction |
| CFG Dropout | 10% |
| Sampler | DDIM |
pip install torch transformers diffusers einops huggingface_hub
Download and run the inference script:
# Download inference script
wget https://huggingface.co/kerzgrr/imagenv1m/resolve/main/inference.py
# Generate an image
python inference.py "A cat sitting on a windowsill"
# With options
python inference.py "A forest at sunset" --steps 100 --cfg 7.5 --seed 42 --output forest.png
The script automatically downloads all required models from HuggingFace:
kerzgrr/imagenv1mgoogle/flan-t5-large stabilityai/sdxl-vaeAll model sizes are available as separate repositories in the Imagen Collection.
| Version | Parameters | Repository | Status |
|---|---|---|---|
| v1-medium | ~320M | kerzgrr/imagenv1m | ✅ Available |
| Version | Parameters | Repository | Status |
|---|---|---|---|
| v1-nano | ~50M | kerzgrr/imagenv1n | 🔜 Planned |
| v1-mini | ~150M | kerzgrr/imagenv1s | 🔜 Planned |
| v1-large | ~700M | kerzgrr/imagenv1l | 🔜 Planned |
| v1-xlarge | ~1.5B | kerzgrr/imagenv1xl | 🔜 Planned |
{
"model_state_dict": ..., # DiT weights (EMA)
"step": int, # Training step
"config": dict, # Model config
}
@misc{imagenv1m,
title={Imagen v1 Medium: A Diffusion Transformer for Text-to-Image Generation},
author={kerzgrr},
year={2025},
url={https://huggingface.co/kerzgrr/imagenv1m}
}
Built on the shoulders of giants:
Trained on an RTX 5090 Laptop Edition for around 3 days, specifically an MSI Titan 18 HX Dragon Edition Norse Myth A2XW