Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-to-image
|
| 4 |
+
library_name: safetensors
|
| 5 |
+
tags: [hobbylm, text-to-image, diffusion, dit, flow-matching]
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# HobbyLM-Image (1024px text-to-image DiT)
|
| 9 |
+
|
| 10 |
+
An in-context latent **flow-matching DiT** that generates 1024×1024 images, trained on a $300-class budget.
|
| 11 |
+
It operates in the **DC-AE f32c32 (SANA-1.1)** latent space and is conditioned on **CLIP-L** text features.
|
| 12 |
+
|
| 13 |
+
## Components (frozen, not included)
|
| 14 |
+
|
| 15 |
+
- VAE: `mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers` (32× spatial compression → 32×32×32 latent at 1024px).
|
| 16 |
+
- Text encoder: `openai/clip-vit-large-patch14`.
|
| 17 |
+
|
| 18 |
+
## Files
|
| 19 |
+
- `model.safetensors` — the DiT weights. `config.json` — DiT config, `lat_std`, VAE `scaling_factor`.
|
| 20 |
+
|
| 21 |
+
## Pipeline (sketch)
|
| 22 |
+
Encode the text prompt with CLIP-L → start from Gaussian latent noise → run the DiT's rectified-flow / CFG
|
| 23 |
+
sampler for ~100 steps → decode the latent with the DC-AE VAE → 1024px image. (No GGUF: image-gen DiTs have
|
| 24 |
+
no standard GGUF runtime.)
|
| 25 |
+
|
| 26 |
+
## Capabilities
|
| 27 |
+
Watermark-free; accurate objects; cinematic scenes; usable single-person portraits. Soft on hands /
|
| 28 |
+
multi-person (the small-model ceiling). Editing is available in a sibling 512px checkpoint.
|
| 29 |
+
|
| 30 |
+
## License
|
| 31 |
+
Apache-2.0.
|