atandra2000

Add model card

595a6b9 verified 5 days ago

5.65 kB

	---
	license: mit
	library_name: pytorch
	pipeline_tag: text-to-image
	tags:
	- stable-diffusion
	- latent-diffusion
	- text-to-image
	- from-scratch
	- pytorch
	- diffusion
	- generative-ai
	- rtx-5090
	- blackwell
	- sd-1.x
	language:
	- en
	base_model: []
	datasets:
	- laion/laion2B-en-aesthetic
	- poloclub/diffusiondb
	- JourneyDB/JourneyDB
	---

	# SD-From-Scratch v1 (epoch 42)

	A Stable Diffusion 1.x-class latent diffusion model trained from scratch on 2× RTX 5090 (Blackwell, 33.7 GB VRAM each) under the [SD_Train.py](https://github.com/atandra2000/StableDiffusion/blob/main/SD_Train.py) pipeline.

	This checkpoint (`sd_epoch_042.pt`) is the visually-best checkpoint produced after 48 epochs across 7 training phases (LAION broad → LAION refined → DiffusionDB/JourneyDB mix → VGGFace2 face fine-tune → COCO full-body → consolidation → final mixed).

	- Architecture: UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) — ~860M trainable parameters
	- Frozen components: VAE `stabilityai/sd-vae-ft-mse`, text encoder `openai/clip-vit-large-patch14`
	- Latent space: 64×64×4 (8× spatial compression of 512×512 RGB), scale factor `0.18215`
	- Training schedule: DDPM scaled-linear betas (0.00085 → 0.012, 1000 steps), Min-SNR γ=5.0 (γ=3.0 in late refinement)
	- Precision: BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW
	- EMA decay: 0.9999 (GPU-resident shadow)
	- Best training loss: 0.0947 (epoch 16, on filtered LAION 213k subset)
	- Released checkpoint: epoch 42 (~12.5 GB) — best visual coherence across portraits, landscapes and scenes

	## Files

	- `sd_epoch_042.pt` — full training checkpoint: UNet `state_dict`, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. Load EMA weights at inference for best quality (see usage below).

	## License

	MIT. See [LICENSE](https://github.com/atandra2000/StableDiffusion/blob/main/LICENSE) in the source repo.

	## Intended use

	Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona.

	## How to use

	The checkpoint is loaded by the inference scripts in the source repo.

	```bash
	git clone https://github.com/atandra2000/StableDiffusion.git
	cd StableDiffusion
	pip install torch torchvision diffusers transformers huggingface_hub pillow

	# Download the checkpoint from this HF model repo:
	huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir .

	# Generate an image (uses EMA weights internally and removes the DDIM
	# pred_x0 clamp, which would otherwise destroy SD latent signal):
	python inference.py \
	--checkpoint sd_epoch_042.pt \
	--prompt "a beautiful sunset over mountain peaks, cinematic lighting" \
	--steps 50 \
	--guidance 7.5 \
	--output sample.png
	```

	For portraits and faces, push DDIM steps to 100 and CFG to 8.5:

	```bash
	python inference.py \
	--checkpoint sd_epoch_042.pt \
	--prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \
	--steps 100 \
	--guidance 8.5 \
	--output portrait.png
	```

	For negative-prompt CFG (CUDA only), use `SD_ImageGen.py`:

	```bash
	python SD_ImageGen.py \
	--checkpoint sd_epoch_042.pt \
	--prompts "a futuristic city skyline at night with neon lights" \
	--negative "blurry, low quality, distorted, watermark" \
	--steps 50 --guidance 7.5
	```

	## Training summary

	\| Phase \| Epochs \| Dataset (after filtering) \| LR \| Notes \|
	\|-------\|--------\|---------------------------\|----\|-------\|
	\| 1 \| 1–10 \| LAION-2B-en aesthetic ≥ 6.5 (1.32M) \| 1e-5 \| Broad pretraining (~3 h/epoch) \|
	\| 2 \| 11–17 \| LAION aesthetic ≥ 7.5, CLIP ≥ 0.30 (213k) \| 1e-5 \| Best loss 0.0947 @ ep 16 \|
	\| 3 \| 18–22 \| DiffusionDB + JourneyDB (~482k) \| 1e-5 \| Synthetic / Midjourney domain mix \|
	\| 4 \| 23–29 \| VGGFace2 (51,786) \| 2e-6 \| Face anatomy fine-tune \|
	\| 5 \| 30–38 \| COCO person crops (59,494) \| 1.5e-6 \| Full-body fine-tune (caused face regression) \|
	\| 6 \| 39–42 \| Mixed LAION + VGGFace2 + COCO (250k) \| 1e-6 \| Released checkpoint — sweet spot \|
	\| 7 \| 43–48 \| Comprehensive mix (572k) \| 1e-6 \| Final consolidation \|

	Detailed write-up and engineering lessons:
	- 📘 [`blog_post.md`](https://github.com/atandra2000/StableDiffusion/blob/main/blog_post.md) — full technical Medium-style write-up
	- 📋 [`summary.md`](https://github.com/atandra2000/StableDiffusion/blob/main/summary.md) — authoritative engineering reference

	## Known limitations

	- Faces are usable but eyes still slightly oversized at higher CFG
	- Full-body anatomy: left-arm geometry is occasionally weak
	- Animal faces are the weakest category in current evaluation
	- Food prompts confuse categories (pasta ↔ noodles, etc.)
	- This is an SD 1.x-class model — no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed)

	## Reproducibility

	All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at:

	> 🔗 https://github.com/atandra2000/StableDiffusion

	## Citation

	```bibtex
	@software{atandra2000_sd_from_scratch_v1,
	author = {Bharati, Atandra},
	title = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s},
	year = {2026},
	url = {https://huggingface.co/atandra2000/sd-from-scratch-v1},
	license = {MIT}
	}
	```