atandra2000
/

sd-from-scratch-v1

+---
+license: mit
+library_name: pytorch
+pipeline_tag: text-to-image
+tags:
+  - stable-diffusion
+  - latent-diffusion
+  - text-to-image
+  - from-scratch
+  - pytorch
+  - diffusion
+  - generative-ai
+  - rtx-5090
+  - blackwell
+  - sd-1.x
+language:
+  - en
+base_model: []
+datasets:
+  - laion/laion2B-en-aesthetic
+  - poloclub/diffusiondb
+  - JourneyDB/JourneyDB
+---
+# SD-From-Scratch v1 (epoch 42)
+A **Stable Diffusion 1.x-class** latent diffusion model **trained from scratch** on **2× RTX 5090 (Blackwell, 33.7 GB VRAM each)** under the [SD_Train.py](https://github.com/atandra2000/StableDiffusion/blob/main/SD_Train.py) pipeline.
+This checkpoint (`sd_epoch_042.pt`) is the visually-best checkpoint produced after **48 epochs across 7 training phases** (LAION broad → LAION refined → DiffusionDB/JourneyDB mix → VGGFace2 face fine-tune → COCO full-body → consolidation → final mixed).
+- **Architecture:** UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) — ~860M trainable parameters
+- **Frozen components:** VAE `stabilityai/sd-vae-ft-mse`, text encoder `openai/clip-vit-large-patch14`
+- **Latent space:** 64×64×4 (8× spatial compression of 512×512 RGB), scale factor `0.18215`
+- **Training schedule:** DDPM scaled-linear betas (0.00085 → 0.012, 1000 steps), Min-SNR γ=5.0 (γ=3.0 in late refinement)
+- **Precision:** BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW
+- **EMA decay:** 0.9999 (GPU-resident shadow)
+- **Best training loss:** 0.0947 (epoch 16, on filtered LAION 213k subset)
+- **Released checkpoint:** epoch 42 (~12.5 GB) — best visual coherence across portraits, landscapes and scenes
+## Files
+- `sd_epoch_042.pt` — full training checkpoint: UNet `state_dict`, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. **Load EMA weights at inference for best quality** (see usage below).
+## License
+MIT. See [LICENSE](https://github.com/atandra2000/StableDiffusion/blob/main/LICENSE) in the source repo.
+## Intended use
+Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona.
+## How to use
+The checkpoint is loaded by the inference scripts in the source repo.
+```bash
+git clone https://github.com/atandra2000/StableDiffusion.git
+cd StableDiffusion
+pip install torch torchvision diffusers transformers huggingface_hub pillow
+# Download the checkpoint from this HF model repo:
+huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir .
+# Generate an image (uses EMA weights internally and removes the DDIM
+# pred_x0 clamp, which would otherwise destroy SD latent signal):
+python inference.py \
+    --checkpoint sd_epoch_042.pt \
+    --prompt "a beautiful sunset over mountain peaks, cinematic lighting" \
+    --steps 50 \
+    --guidance 7.5 \
+    --output sample.png
+```
+For portraits and faces, push DDIM steps to 100 and CFG to 8.5:
+```bash
+python inference.py \
+    --checkpoint sd_epoch_042.pt \
+    --prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \
+    --steps 100 \
+    --guidance 8.5 \
+    --output portrait.png
+```
+For negative-prompt CFG (CUDA only), use `SD_ImageGen.py`:
+```bash
+python SD_ImageGen.py \
+    --checkpoint sd_epoch_042.pt \
+    --prompts "a futuristic city skyline at night with neon lights" \
+    --negative "blurry, low quality, distorted, watermark" \
+    --steps 50 --guidance 7.5
+```
+## Training summary
+| Phase | Epochs | Dataset (after filtering) | LR | Notes |
+|-------|--------|---------------------------|----|-------|
+| 1 | 1–10  | LAION-2B-en aesthetic ≥ 6.5 (1.32M) | 1e-5 | Broad pretraining (~3 h/epoch) |
+| 2 | 11–17 | LAION aesthetic ≥ 7.5, CLIP ≥ 0.30 (213k) | 1e-5 | Best loss 0.0947 @ ep 16 |
+| 3 | 18–22 | DiffusionDB + JourneyDB (~482k) | 1e-5 | Synthetic / Midjourney domain mix |
+| 4 | 23–29 | VGGFace2 (51,786) | 2e-6 | Face anatomy fine-tune |
+| 5 | 30–38 | COCO person crops (59,494) | 1.5e-6 | Full-body fine-tune (caused face regression) |
+| 6 | 39–42 | Mixed LAION + VGGFace2 + COCO (250k) | 1e-6 | **Released checkpoint — sweet spot** |
+| 7 | 43–48 | Comprehensive mix (572k) | 1e-6 | Final consolidation |
+Detailed write-up and engineering lessons:
+- 📘 [`blog_post.md`](https://github.com/atandra2000/StableDiffusion/blob/main/blog_post.md) — full technical Medium-style write-up
+- 📋 [`summary.md`](https://github.com/atandra2000/StableDiffusion/blob/main/summary.md) — authoritative engineering reference
+## Known limitations
+- Faces are usable but eyes still slightly oversized at higher CFG
+- Full-body anatomy: left-arm geometry is occasionally weak
+- Animal faces are the weakest category in current evaluation
+- Food prompts confuse categories (pasta ↔ noodles, etc.)
+- This is an **SD 1.x-class** model — no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed)
+## Reproducibility
+All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at:
+> 🔗 **https://github.com/atandra2000/StableDiffusion**
+## Citation
+```bibtex
+@software{atandra2000_sd_from_scratch_v1,
+  author  = {Bharati, Atandra},
+  title   = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s},
+  year    = {2026},
+  url     = {https://huggingface.co/atandra2000/sd-from-scratch-v1},
+  license = {MIT}
+}
+```