| --- |
| license: apache-2.0 |
| pipeline_tag: text-to-image |
| library_name: safetensors |
| tags: [hobbylm, text-to-image, diffusion, dit, flow-matching] |
| --- |
| |
| # HobbyLM-Image β 1024px text-to-image DiT |
|
|
| The odd one out in the HobbyLM family: not a language model, but a **333M in-context flow-matching DiT** that |
| generates 1024Γ1024 images. It was built to see how good a text-to-image model you can train on a genuinely |
| small budget β the whole thing came together for roughly **$300 of Modal GPU time** by working in a heavily |
| compressed latent space instead of pixels. |
|
|
| It runs in the **DC-AE f32c32 (SANA-1.1)** latent (32Γ spatial compression β a 32Γ32Γ32 latent at 1024px) and |
| is conditioned on **CLIP-L** text features, with classifier-free guidance. |
|
|
| ## Intended use |
|
|
| Text-to-image generation at 1024Γ1024. Strongest on single objects and cinematic scenes. A sibling 512px |
| checkpoint additionally does instruction-based image editing. |
|
|
| ## How it works |
|
|
| ``` |
| CLIP-L(prompt) ββ |
| βββΊ DiT ββ(rectified-flow / CFG sampler, ~100 steps)βββΊ latent βββΊ DC-AE decode βββΊ 1024Β² image |
| Gaussian noise ββ (this repo) (frozen VAE) |
| ``` |
|
|
| The two frozen components are **not** included (download them from their own repos): |
| `mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers` (VAE) and `openai/clip-vit-large-patch14` (text encoder). |
| A full from-scratch CPU implementation of this pipeline (CLIP + DiT + DC-AE, in Rust) lives in |
| [`hobby-rs`](https://github.com/harishsg993010/HobbyLM). |
|
|
| ## Samples |
|
|
| 1024Γ1024, generated by this model (CFG β 5, ~100 steps): |
|
|
|  |
|
|
| ## Results |
|
|
| This is a hobby-scale generator, so the honest "benchmark" is the training curve and qualitative behaviour |
| rather than FID / GenEval (which we did not compute): |
|
|
| | Property | Value | |
| |---|---| |
| | Flow-matching loss (final) | **0.76** (lowest of the model lineage β still decreasing) | |
| | Parameters | 333M (DiT only) | |
| | Resolution | 1024Γ1024 (32Γ32Γ32 latent) | |
| | VAE reconstruction | ~26 dB PSNR @512px; sharper at 1024px (32Γ32 latent) | |
|
|
| Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is **soft on people, |
| hands, and multi-person scenes** β the real small-model / latent-resolution ceiling. Loss was still dropping |
| at the end of training, so the 333M DiT is not yet saturated. |
|
|
| ## Files |
|
|
| - `model.safetensors` β the DiT weights. |
| - `config.json` β DiT config, `lat_std`, and the VAE `scaling_factor`. |
|
|
| There is no GGUF build: image-generation DiTs have no standard GGUF runtime. |
|
|
| ## Limitations |
|
|
| - Hands and multi-person scenes are unreliable. |
| - Fine object crispness is capped by the 32Γ DC-AE latent; a less-compressed VAE would sharpen it at higher cost. |
| - Instruction-based **editing** is limited (the CLIP-L text encoder is a weak instruction follower); the real |
| fix is a stronger conditioner, which is future work. |
|
|
| ## License |
|
|
| Apache-2.0. |
|
|