| --- |
| license: mit |
| library_name: pytorch |
| pipeline_tag: text-to-image |
| tags: |
| - stable-diffusion |
| - latent-diffusion |
| - text-to-image |
| - from-scratch |
| - pytorch |
| - diffusion |
| - generative-ai |
| - rtx-5090 |
| - blackwell |
| - sd-1.x |
| language: |
| - en |
| base_model: [] |
| datasets: |
| - laion/laion2B-en-aesthetic |
| - poloclub/diffusiondb |
| - JourneyDB/JourneyDB |
| --- |
| |
| # SD-From-Scratch v1 (epoch 42) |
|
|
| A **Stable Diffusion 1.x-class** latent diffusion model **trained from scratch** on **2Γ RTX 5090 (Blackwell, 33.7 GB VRAM each)** under the [SD_Train.py](https://github.com/atandra2000/StableDiffusion/blob/main/SD_Train.py) pipeline. |
|
|
| This checkpoint (`sd_epoch_042.pt`) is the visually-best checkpoint produced after **48 epochs across 7 training phases** (LAION broad β LAION refined β DiffusionDB/JourneyDB mix β VGGFace2 face fine-tune β COCO full-body β consolidation β final mixed). |
|
|
| - **Architecture:** UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) β ~860M trainable parameters |
| - **Frozen components:** VAE `stabilityai/sd-vae-ft-mse`, text encoder `openai/clip-vit-large-patch14` |
| - **Latent space:** 64Γ64Γ4 (8Γ spatial compression of 512Γ512 RGB), scale factor `0.18215` |
| - **Training schedule:** DDPM scaled-linear betas (0.00085 β 0.012, 1000 steps), Min-SNR Ξ³=5.0 (Ξ³=3.0 in late refinement) |
| - **Precision:** BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW |
| - **EMA decay:** 0.9999 (GPU-resident shadow) |
| - **Best training loss:** 0.0947 (epoch 16, on filtered LAION 213k subset) |
| - **Released checkpoint:** epoch 42 (~12.5 GB) β best visual coherence across portraits, landscapes and scenes |
|
|
| ## Files |
|
|
| - `sd_epoch_042.pt` β full training checkpoint: UNet `state_dict`, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. **Load EMA weights at inference for best quality** (see usage below). |
| |
| ## License |
| |
| MIT. See [LICENSE](https://github.com/atandra2000/StableDiffusion/blob/main/LICENSE) in the source repo. |
| |
| ## Intended use |
| |
| Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona. |
| |
| ## How to use |
| |
| The checkpoint is loaded by the inference scripts in the source repo. |
| |
| ```bash |
| git clone https://github.com/atandra2000/StableDiffusion.git |
| cd StableDiffusion |
| pip install torch torchvision diffusers transformers huggingface_hub pillow |
|
|
| # Download the checkpoint from this HF model repo: |
| huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir . |
|
|
| # Generate an image (uses EMA weights internally and removes the DDIM |
| # pred_x0 clamp, which would otherwise destroy SD latent signal): |
| python inference.py \ |
| --checkpoint sd_epoch_042.pt \ |
| --prompt "a beautiful sunset over mountain peaks, cinematic lighting" \ |
| --steps 50 \ |
| --guidance 7.5 \ |
| --output sample.png |
| ``` |
| |
| For portraits and faces, push DDIM steps to 100 and CFG to 8.5: |
| |
| ```bash |
| python inference.py \ |
| --checkpoint sd_epoch_042.pt \ |
| --prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \ |
| --steps 100 \ |
| --guidance 8.5 \ |
| --output portrait.png |
| ``` |
| |
| For negative-prompt CFG (CUDA only), use `SD_ImageGen.py`: |
|
|
| ```bash |
| python SD_ImageGen.py \ |
| --checkpoint sd_epoch_042.pt \ |
| --prompts "a futuristic city skyline at night with neon lights" \ |
| --negative "blurry, low quality, distorted, watermark" \ |
| --steps 50 --guidance 7.5 |
| ``` |
|
|
| ## Training summary |
|
|
| | Phase | Epochs | Dataset (after filtering) | LR | Notes | |
| |-------|--------|---------------------------|----|-------| |
| | 1 | 1β10 | LAION-2B-en aesthetic β₯ 6.5 (1.32M) | 1e-5 | Broad pretraining (~3 h/epoch) | |
| | 2 | 11β17 | LAION aesthetic β₯ 7.5, CLIP β₯ 0.30 (213k) | 1e-5 | Best loss 0.0947 @ ep 16 | |
| | 3 | 18β22 | DiffusionDB + JourneyDB (~482k) | 1e-5 | Synthetic / Midjourney domain mix | |
| | 4 | 23β29 | VGGFace2 (51,786) | 2e-6 | Face anatomy fine-tune | |
| | 5 | 30β38 | COCO person crops (59,494) | 1.5e-6 | Full-body fine-tune (caused face regression) | |
| | 6 | 39β42 | Mixed LAION + VGGFace2 + COCO (250k) | 1e-6 | **Released checkpoint β sweet spot** | |
| | 7 | 43β48 | Comprehensive mix (572k) | 1e-6 | Final consolidation | |
|
|
| Detailed write-up and engineering lessons: |
| - π [`blog_post.md`](https://github.com/atandra2000/StableDiffusion/blob/main/blog_post.md) β full technical Medium-style write-up |
| - π [`summary.md`](https://github.com/atandra2000/StableDiffusion/blob/main/summary.md) β authoritative engineering reference |
|
|
| ## Known limitations |
|
|
| - Faces are usable but eyes still slightly oversized at higher CFG |
| - Full-body anatomy: left-arm geometry is occasionally weak |
| - Animal faces are the weakest category in current evaluation |
| - Food prompts confuse categories (pasta β noodles, etc.) |
| - This is an **SD 1.x-class** model β no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed) |
|
|
| ## Reproducibility |
|
|
| All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at: |
|
|
| > π **https://github.com/atandra2000/StableDiffusion** |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{atandra2000_sd_from_scratch_v1, |
| author = {Bharati, Atandra}, |
| title = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s}, |
| year = {2026}, |
| url = {https://huggingface.co/atandra2000/sd-from-scratch-v1}, |
| license = {MIT} |
| } |
| ``` |
|
|