Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
pipeline_tag: text-to-image
|
| 5 |
+
tags:
|
| 6 |
+
- stable-diffusion
|
| 7 |
+
- latent-diffusion
|
| 8 |
+
- text-to-image
|
| 9 |
+
- from-scratch
|
| 10 |
+
- pytorch
|
| 11 |
+
- diffusion
|
| 12 |
+
- generative-ai
|
| 13 |
+
- rtx-5090
|
| 14 |
+
- blackwell
|
| 15 |
+
- sd-1.x
|
| 16 |
+
language:
|
| 17 |
+
- en
|
| 18 |
+
base_model: []
|
| 19 |
+
datasets:
|
| 20 |
+
- laion/laion2B-en-aesthetic
|
| 21 |
+
- poloclub/diffusiondb
|
| 22 |
+
- JourneyDB/JourneyDB
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
# SD-From-Scratch v1 (epoch 42)
|
| 26 |
+
|
| 27 |
+
A **Stable Diffusion 1.x-class** latent diffusion model **trained from scratch** on **2Γ RTX 5090 (Blackwell, 33.7 GB VRAM each)** under the [SD_Train.py](https://github.com/atandra2000/StableDiffusion/blob/main/SD_Train.py) pipeline.
|
| 28 |
+
|
| 29 |
+
This checkpoint (`sd_epoch_042.pt`) is the visually-best checkpoint produced after **48 epochs across 7 training phases** (LAION broad β LAION refined β DiffusionDB/JourneyDB mix β VGGFace2 face fine-tune β COCO full-body β consolidation β final mixed).
|
| 30 |
+
|
| 31 |
+
- **Architecture:** UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) β ~860M trainable parameters
|
| 32 |
+
- **Frozen components:** VAE `stabilityai/sd-vae-ft-mse`, text encoder `openai/clip-vit-large-patch14`
|
| 33 |
+
- **Latent space:** 64Γ64Γ4 (8Γ spatial compression of 512Γ512 RGB), scale factor `0.18215`
|
| 34 |
+
- **Training schedule:** DDPM scaled-linear betas (0.00085 β 0.012, 1000 steps), Min-SNR Ξ³=5.0 (Ξ³=3.0 in late refinement)
|
| 35 |
+
- **Precision:** BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW
|
| 36 |
+
- **EMA decay:** 0.9999 (GPU-resident shadow)
|
| 37 |
+
- **Best training loss:** 0.0947 (epoch 16, on filtered LAION 213k subset)
|
| 38 |
+
- **Released checkpoint:** epoch 42 (~12.5 GB) β best visual coherence across portraits, landscapes and scenes
|
| 39 |
+
|
| 40 |
+
## Files
|
| 41 |
+
|
| 42 |
+
- `sd_epoch_042.pt` β full training checkpoint: UNet `state_dict`, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. **Load EMA weights at inference for best quality** (see usage below).
|
| 43 |
+
|
| 44 |
+
## License
|
| 45 |
+
|
| 46 |
+
MIT. See [LICENSE](https://github.com/atandra2000/StableDiffusion/blob/main/LICENSE) in the source repo.
|
| 47 |
+
|
| 48 |
+
## Intended use
|
| 49 |
+
|
| 50 |
+
Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona.
|
| 51 |
+
|
| 52 |
+
## How to use
|
| 53 |
+
|
| 54 |
+
The checkpoint is loaded by the inference scripts in the source repo.
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
git clone https://github.com/atandra2000/StableDiffusion.git
|
| 58 |
+
cd StableDiffusion
|
| 59 |
+
pip install torch torchvision diffusers transformers huggingface_hub pillow
|
| 60 |
+
|
| 61 |
+
# Download the checkpoint from this HF model repo:
|
| 62 |
+
huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir .
|
| 63 |
+
|
| 64 |
+
# Generate an image (uses EMA weights internally and removes the DDIM
|
| 65 |
+
# pred_x0 clamp, which would otherwise destroy SD latent signal):
|
| 66 |
+
python inference.py \
|
| 67 |
+
--checkpoint sd_epoch_042.pt \
|
| 68 |
+
--prompt "a beautiful sunset over mountain peaks, cinematic lighting" \
|
| 69 |
+
--steps 50 \
|
| 70 |
+
--guidance 7.5 \
|
| 71 |
+
--output sample.png
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
For portraits and faces, push DDIM steps to 100 and CFG to 8.5:
|
| 75 |
+
|
| 76 |
+
```bash
|
| 77 |
+
python inference.py \
|
| 78 |
+
--checkpoint sd_epoch_042.pt \
|
| 79 |
+
--prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \
|
| 80 |
+
--steps 100 \
|
| 81 |
+
--guidance 8.5 \
|
| 82 |
+
--output portrait.png
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
For negative-prompt CFG (CUDA only), use `SD_ImageGen.py`:
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
python SD_ImageGen.py \
|
| 89 |
+
--checkpoint sd_epoch_042.pt \
|
| 90 |
+
--prompts "a futuristic city skyline at night with neon lights" \
|
| 91 |
+
--negative "blurry, low quality, distorted, watermark" \
|
| 92 |
+
--steps 50 --guidance 7.5
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## Training summary
|
| 96 |
+
|
| 97 |
+
| Phase | Epochs | Dataset (after filtering) | LR | Notes |
|
| 98 |
+
|-------|--------|---------------------------|----|-------|
|
| 99 |
+
| 1 | 1β10 | LAION-2B-en aesthetic β₯ 6.5 (1.32M) | 1e-5 | Broad pretraining (~3 h/epoch) |
|
| 100 |
+
| 2 | 11β17 | LAION aesthetic β₯ 7.5, CLIP β₯ 0.30 (213k) | 1e-5 | Best loss 0.0947 @ ep 16 |
|
| 101 |
+
| 3 | 18β22 | DiffusionDB + JourneyDB (~482k) | 1e-5 | Synthetic / Midjourney domain mix |
|
| 102 |
+
| 4 | 23β29 | VGGFace2 (51,786) | 2e-6 | Face anatomy fine-tune |
|
| 103 |
+
| 5 | 30β38 | COCO person crops (59,494) | 1.5e-6 | Full-body fine-tune (caused face regression) |
|
| 104 |
+
| 6 | 39β42 | Mixed LAION + VGGFace2 + COCO (250k) | 1e-6 | **Released checkpoint β sweet spot** |
|
| 105 |
+
| 7 | 43β48 | Comprehensive mix (572k) | 1e-6 | Final consolidation |
|
| 106 |
+
|
| 107 |
+
Detailed write-up and engineering lessons:
|
| 108 |
+
- π [`blog_post.md`](https://github.com/atandra2000/StableDiffusion/blob/main/blog_post.md) β full technical Medium-style write-up
|
| 109 |
+
- π [`summary.md`](https://github.com/atandra2000/StableDiffusion/blob/main/summary.md) β authoritative engineering reference
|
| 110 |
+
|
| 111 |
+
## Known limitations
|
| 112 |
+
|
| 113 |
+
- Faces are usable but eyes still slightly oversized at higher CFG
|
| 114 |
+
- Full-body anatomy: left-arm geometry is occasionally weak
|
| 115 |
+
- Animal faces are the weakest category in current evaluation
|
| 116 |
+
- Food prompts confuse categories (pasta β noodles, etc.)
|
| 117 |
+
- This is an **SD 1.x-class** model β no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed)
|
| 118 |
+
|
| 119 |
+
## Reproducibility
|
| 120 |
+
|
| 121 |
+
All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at:
|
| 122 |
+
|
| 123 |
+
> π **https://github.com/atandra2000/StableDiffusion**
|
| 124 |
+
|
| 125 |
+
## Citation
|
| 126 |
+
|
| 127 |
+
```bibtex
|
| 128 |
+
@software{atandra2000_sd_from_scratch_v1,
|
| 129 |
+
author = {Bharati, Atandra},
|
| 130 |
+
title = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s},
|
| 131 |
+
year = {2026},
|
| 132 |
+
url = {https://huggingface.co/atandra2000/sd-from-scratch-v1},
|
| 133 |
+
license = {MIT}
|
| 134 |
+
}
|
| 135 |
+
```
|