🎨 Imagen DiT-320M

Frontier Diffusion Transformer for Text-to-Image Generation

A prototype modern diffusion transformer trained on high-quality recaptioned data

🖼️ Sample Generations



"A silver pot on wood"	"Entrance of a luxury house"	"A house in a green field"	"A forest"	"A handwritten poem"

🏗️ Model Architecture

Pipeline: Text Prompt → T5-Large (770M) → DiT-320M ← Noise + Timestep → SDXL VAE Decoder → Image (256×256)

DiT Block (×12)

Each transformer block contains:

Self-Attention with RoPE positional embeddings and QK-Normalization
Cross-Attention for text conditioning from T5
SwiGLU MLP with 4× hidden expansion
AdaLN-Zero for timestep conditioning
RMSNorm instead of LayerNorm for efficiency

Technical Specifications

Architecture	Diffusion Transformer (DiT)
Parameters	~320M (DiT only)
Hidden Dimension	1024
Transformer Depth	12 layers
Attention Heads	16
Patch Size	2×2
MLP Ratio	4×
Context Dimension	1024 (T5-Large)

Modern Features

Feature	Description	Origin
RoPE	Rotary Positional Embeddings for 2D patches	LLaMA, Flux
QK-Normalization	Stabilizes attention at scale	ViT-22B
SwiGLU	Gated activation for better gradient flow	PaLM, LLaMA
AdaLN-Zero	Adaptive layer norm for timestep conditioning	DiT
RMSNorm	Faster than LayerNorm with similar quality	LLaMA

📊 Training Details

Dataset

Source: UCSC-VLAA/Recap-DataComp-1B
Captions: LLaVA recaptioned (detailed descriptions)
Images: ~2M high-quality samples

Training Configuration

Resolution	256×256
Batch Size	80 (effective, 10 × 8 accumulation)
Learning Rate	2×10⁻⁴
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Precision	bfloat16
EMA Decay	0.9999
Warmup Steps	500
Gradient Clipping	1.0

Diffusion Process

Timesteps	1000
Schedule	Cosine (Improved DDPM)
Prediction	ε-prediction
CFG Dropout	10%
Sampler	DDIM

🚀 Quick Start

Installation

pip install torch transformers diffusers einops huggingface_hub

Inference

Download and run the inference script:

# Download inference script
wget https://huggingface.co/kerzgrr/imagenv1m/resolve/main/inference.py

# Generate an image
python inference.py "A cat sitting on a windowsill"

# With options
python inference.py "A forest at sunset" --steps 100 --cfg 7.5 --seed 42 --output forest.png

The script automatically downloads all required models from HuggingFace:

DiT checkpoint from kerzgrr/imagenv1m
T5-Large from google/flan-t5-large
SDXL VAE from stabilityai/sdxl-vae

📁 Model Versions

All model sizes are available as separate repositories in the Imagen Collection.

Available Now

Version	Parameters	Repository	Status
v1-medium	~320M	kerzgrr/imagenv1m	✅ Available

Coming Soon

Version	Parameters	Repository	Status
v1-nano	~50M	kerzgrr/imagenv1n	🔜 Planned
v1-mini	~150M	kerzgrr/imagenv1s	🔜 Planned
v1-large	~700M	kerzgrr/imagenv1l	🔜 Planned
v1-xlarge	~1.5B	kerzgrr/imagenv1xl	🔜 Planned

Checkpoint Contents

{
    "model_state_dict": ...,      # DiT weights (EMA)
    "step": int,                  # Training step
    "config": dict,               # Model config
}

⚠️ Limitations

Quality: This is only a prototype, not for production use
Resolution: Currently only supports 256×256
Subjects: May struggle with very specific prompts
Text Rendering: Cannot make text in images yet
Hands/Anatomy: Human and animal anatomy not yet understood fully

📜 Citation

@misc{imagenv1m,
  title={Imagen v1 Medium: A Diffusion Transformer for Text-to-Image Generation},
  author={kerzgrr},
  year={2025},
  url={https://huggingface.co/kerzgrr/imagenv1m}
}

🙏 Acknowledgments

Built on the shoulders of giants:

DiT - Original Diffusion Transformer architecture (Peebles & Xie, 2023)
SDXL VAE - Latent autoencoder from Stability AI
Flan-T5 - Text encoder from Google
Recap-DataComp-1B - Training dataset from UCSC-VLAA

Trained on an RTX 5090 Laptop Edition for around 3 days, specifically an MSI Titan 18 HX Dragon Edition Norse Myth A2XW

Downloads last month: 4

Dataset used to train kerzgrr/imagenv1m

Collection including kerzgrr/imagenv1m

Imagen

Collection

My DIT-VAE models! • 1 item • Updated Dec 20, 2025 • 1