atandra2000 commited on
Commit
595a6b9
Β·
verified Β·
1 Parent(s): 0d0b786

Add model card

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: pytorch
4
+ pipeline_tag: text-to-image
5
+ tags:
6
+ - stable-diffusion
7
+ - latent-diffusion
8
+ - text-to-image
9
+ - from-scratch
10
+ - pytorch
11
+ - diffusion
12
+ - generative-ai
13
+ - rtx-5090
14
+ - blackwell
15
+ - sd-1.x
16
+ language:
17
+ - en
18
+ base_model: []
19
+ datasets:
20
+ - laion/laion2B-en-aesthetic
21
+ - poloclub/diffusiondb
22
+ - JourneyDB/JourneyDB
23
+ ---
24
+
25
+ # SD-From-Scratch v1 (epoch 42)
26
+
27
+ A **Stable Diffusion 1.x-class** latent diffusion model **trained from scratch** on **2Γ— RTX 5090 (Blackwell, 33.7 GB VRAM each)** under the [SD_Train.py](https://github.com/atandra2000/StableDiffusion/blob/main/SD_Train.py) pipeline.
28
+
29
+ This checkpoint (`sd_epoch_042.pt`) is the visually-best checkpoint produced after **48 epochs across 7 training phases** (LAION broad β†’ LAION refined β†’ DiffusionDB/JourneyDB mix β†’ VGGFace2 face fine-tune β†’ COCO full-body β†’ consolidation β†’ final mixed).
30
+
31
+ - **Architecture:** UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) β€” ~860M trainable parameters
32
+ - **Frozen components:** VAE `stabilityai/sd-vae-ft-mse`, text encoder `openai/clip-vit-large-patch14`
33
+ - **Latent space:** 64Γ—64Γ—4 (8Γ— spatial compression of 512Γ—512 RGB), scale factor `0.18215`
34
+ - **Training schedule:** DDPM scaled-linear betas (0.00085 β†’ 0.012, 1000 steps), Min-SNR Ξ³=5.0 (Ξ³=3.0 in late refinement)
35
+ - **Precision:** BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW
36
+ - **EMA decay:** 0.9999 (GPU-resident shadow)
37
+ - **Best training loss:** 0.0947 (epoch 16, on filtered LAION 213k subset)
38
+ - **Released checkpoint:** epoch 42 (~12.5 GB) β€” best visual coherence across portraits, landscapes and scenes
39
+
40
+ ## Files
41
+
42
+ - `sd_epoch_042.pt` β€” full training checkpoint: UNet `state_dict`, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. **Load EMA weights at inference for best quality** (see usage below).
43
+
44
+ ## License
45
+
46
+ MIT. See [LICENSE](https://github.com/atandra2000/StableDiffusion/blob/main/LICENSE) in the source repo.
47
+
48
+ ## Intended use
49
+
50
+ Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona.
51
+
52
+ ## How to use
53
+
54
+ The checkpoint is loaded by the inference scripts in the source repo.
55
+
56
+ ```bash
57
+ git clone https://github.com/atandra2000/StableDiffusion.git
58
+ cd StableDiffusion
59
+ pip install torch torchvision diffusers transformers huggingface_hub pillow
60
+
61
+ # Download the checkpoint from this HF model repo:
62
+ huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir .
63
+
64
+ # Generate an image (uses EMA weights internally and removes the DDIM
65
+ # pred_x0 clamp, which would otherwise destroy SD latent signal):
66
+ python inference.py \
67
+ --checkpoint sd_epoch_042.pt \
68
+ --prompt "a beautiful sunset over mountain peaks, cinematic lighting" \
69
+ --steps 50 \
70
+ --guidance 7.5 \
71
+ --output sample.png
72
+ ```
73
+
74
+ For portraits and faces, push DDIM steps to 100 and CFG to 8.5:
75
+
76
+ ```bash
77
+ python inference.py \
78
+ --checkpoint sd_epoch_042.pt \
79
+ --prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \
80
+ --steps 100 \
81
+ --guidance 8.5 \
82
+ --output portrait.png
83
+ ```
84
+
85
+ For negative-prompt CFG (CUDA only), use `SD_ImageGen.py`:
86
+
87
+ ```bash
88
+ python SD_ImageGen.py \
89
+ --checkpoint sd_epoch_042.pt \
90
+ --prompts "a futuristic city skyline at night with neon lights" \
91
+ --negative "blurry, low quality, distorted, watermark" \
92
+ --steps 50 --guidance 7.5
93
+ ```
94
+
95
+ ## Training summary
96
+
97
+ | Phase | Epochs | Dataset (after filtering) | LR | Notes |
98
+ |-------|--------|---------------------------|----|-------|
99
+ | 1 | 1–10 | LAION-2B-en aesthetic β‰₯ 6.5 (1.32M) | 1e-5 | Broad pretraining (~3 h/epoch) |
100
+ | 2 | 11–17 | LAION aesthetic β‰₯ 7.5, CLIP β‰₯ 0.30 (213k) | 1e-5 | Best loss 0.0947 @ ep 16 |
101
+ | 3 | 18–22 | DiffusionDB + JourneyDB (~482k) | 1e-5 | Synthetic / Midjourney domain mix |
102
+ | 4 | 23–29 | VGGFace2 (51,786) | 2e-6 | Face anatomy fine-tune |
103
+ | 5 | 30–38 | COCO person crops (59,494) | 1.5e-6 | Full-body fine-tune (caused face regression) |
104
+ | 6 | 39–42 | Mixed LAION + VGGFace2 + COCO (250k) | 1e-6 | **Released checkpoint β€” sweet spot** |
105
+ | 7 | 43–48 | Comprehensive mix (572k) | 1e-6 | Final consolidation |
106
+
107
+ Detailed write-up and engineering lessons:
108
+ - πŸ“˜ [`blog_post.md`](https://github.com/atandra2000/StableDiffusion/blob/main/blog_post.md) β€” full technical Medium-style write-up
109
+ - πŸ“‹ [`summary.md`](https://github.com/atandra2000/StableDiffusion/blob/main/summary.md) β€” authoritative engineering reference
110
+
111
+ ## Known limitations
112
+
113
+ - Faces are usable but eyes still slightly oversized at higher CFG
114
+ - Full-body anatomy: left-arm geometry is occasionally weak
115
+ - Animal faces are the weakest category in current evaluation
116
+ - Food prompts confuse categories (pasta ↔ noodles, etc.)
117
+ - This is an **SD 1.x-class** model β€” no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed)
118
+
119
+ ## Reproducibility
120
+
121
+ All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at:
122
+
123
+ > πŸ”— **https://github.com/atandra2000/StableDiffusion**
124
+
125
+ ## Citation
126
+
127
+ ```bibtex
128
+ @software{atandra2000_sd_from_scratch_v1,
129
+ author = {Bharati, Atandra},
130
+ title = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s},
131
+ year = {2026},
132
+ url = {https://huggingface.co/atandra2000/sd-from-scratch-v1},
133
+ license = {MIT}
134
+ }
135
+ ```