--- license: apache-2.0 pipeline_tag: text-to-image library_name: safetensors tags: [hobbylm, text-to-image, diffusion, dit, flow-matching] --- # HobbyLM-Image — 1024px text-to-image DiT The odd one out in the HobbyLM family: not a language model, but a **333M in-context flow-matching DiT** that generates 1024×1024 images. It was built to see how good a text-to-image model you can train on a genuinely small budget — the whole thing came together for roughly **$300 of Modal GPU time** by working in a heavily compressed latent space instead of pixels. It runs in the **DC-AE f32c32 (SANA-1.1)** latent (32× spatial compression → a 32×32×32 latent at 1024px) and is conditioned on **CLIP-L** text features, with classifier-free guidance. ## Intended use Text-to-image generation at 1024×1024. Strongest on single objects and cinematic scenes. A sibling 512px checkpoint additionally does instruction-based image editing. ## How it works ``` CLIP-L(prompt) ─┐ ├─► DiT ──(rectified-flow / CFG sampler, ~100 steps)──► latent ──► DC-AE decode ──► 1024² image Gaussian noise ─┘ (this repo) (frozen VAE) ``` The two frozen components are **not** included (download them from their own repos): `mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers` (VAE) and `openai/clip-vit-large-patch14` (text encoder). A full from-scratch CPU implementation of this pipeline (CLIP + DiT + DC-AE, in Rust) lives in [`hobby-rs`](https://github.com/harishsg993010/HobbyLM). ## Samples 1024×1024, generated by this model (CFG ≈ 5, ~100 steps): ![HobbyLM-Image scene samples](sample_scenes.png) ## Results This is a hobby-scale generator, so the honest "benchmark" is the training curve and qualitative behaviour rather than FID / GenEval (which we did not compute): | Property | Value | |---|---| | Flow-matching loss (final) | **0.76** (lowest of the model lineage — still decreasing) | | Parameters | 333M (DiT only) | | Resolution | 1024×1024 (32×32×32 latent) | | VAE reconstruction | ~26 dB PSNR @512px; sharper at 1024px (32×32 latent) | Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is **soft on people, hands, and multi-person scenes** — the real small-model / latent-resolution ceiling. Loss was still dropping at the end of training, so the 333M DiT is not yet saturated. ## Files - `model.safetensors` — the DiT weights. - `config.json` — DiT config, `lat_std`, and the VAE `scaling_factor`. There is no GGUF build: image-generation DiTs have no standard GGUF runtime. ## Limitations - Hands and multi-person scenes are unreliable. - Fine object crispness is capped by the 32× DC-AE latent; a less-compressed VAE would sharpen it at higher cost. - Instruction-based **editing** is limited (the CLIP-L text encoder is a weak instruction follower); the real fix is a stronger conditioner, which is future work. ## License Apache-2.0.