rootxhacker
/

HobbyLM-Image

+---
+license: apache-2.0
+pipeline_tag: text-to-image
+library_name: safetensors
+tags: [hobbylm, text-to-image, diffusion, dit, flow-matching]
+---
+# HobbyLM-Image (1024px text-to-image DiT)
+An in-context latent **flow-matching DiT** that generates 1024×1024 images, trained on a $300-class budget.
+It operates in the **DC-AE f32c32 (SANA-1.1)** latent space and is conditioned on **CLIP-L** text features.
+## Components (frozen, not included)
+- VAE: `mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers` (32× spatial compression → 32×32×32 latent at 1024px).
+- Text encoder: `openai/clip-vit-large-patch14`.
+## Files
+- `model.safetensors` — the DiT weights. `config.json` — DiT config, `lat_std`, VAE `scaling_factor`.
+## Pipeline (sketch)
+Encode the text prompt with CLIP-L → start from Gaussian latent noise → run the DiT's rectified-flow / CFG
+sampler for ~100 steps → decode the latent with the DC-AE VAE → 1024px image. (No GGUF: image-gen DiTs have
+no standard GGUF runtime.)
+## Capabilities
+Watermark-free; accurate objects; cinematic scenes; usable single-person portraits. Soft on hands /
+multi-person (the small-model ceiling). Editing is available in a sibling 512px checkpoint.
+## License
+Apache-2.0.