rootxhacker
/

HobbyLM-Image

Model card Files Files and versions

HobbyLM-Image / README.md

rootxhacker's picture

Upload README.md with huggingface_hub

6750d81 verified 9 days ago

|

History Blame Contribute Delete

3.03 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-image
	library_name: safetensors
	tags: [hobbylm, text-to-image, diffusion, dit, flow-matching]
	---

	# HobbyLM-Image — 1024px text-to-image DiT

	The odd one out in the HobbyLM family: not a language model, but a 333M in-context flow-matching DiT that
	generates 1024×1024 images. It was built to see how good a text-to-image model you can train on a genuinely
	small budget — the whole thing came together for roughly $300 of Modal GPU time by working in a heavily
	compressed latent space instead of pixels.

	It runs in the DC-AE f32c32 (SANA-1.1) latent (32× spatial compression → a 32×32×32 latent at 1024px) and
	is conditioned on CLIP-L text features, with classifier-free guidance.

	## Intended use

	Text-to-image generation at 1024×1024. Strongest on single objects and cinematic scenes. A sibling 512px
	checkpoint additionally does instruction-based image editing.

	## How it works

	```
	CLIP-L(prompt) ─┐
	├─► DiT ──(rectified-flow / CFG sampler, ~100 steps)──► latent ──► DC-AE decode ──► 1024² image
	Gaussian noise ─┘ (this repo) (frozen VAE)
	```

	The two frozen components are not included (download them from their own repos):
	`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers` (VAE) and `openai/clip-vit-large-patch14` (text encoder).
	A full from-scratch CPU implementation of this pipeline (CLIP + DiT + DC-AE, in Rust) lives in
	[`hobby-rs`](https://github.com/harishsg993010/HobbyLM).

	## Samples

	1024×1024, generated by this model (CFG ≈ 5, ~100 steps):

	![HobbyLM-Image scene samples](sample_scenes.png)

	## Results

	This is a hobby-scale generator, so the honest "benchmark" is the training curve and qualitative behaviour
	rather than FID / GenEval (which we did not compute):

	\| Property \| Value \|
	\|---\|---\|
	\| Flow-matching loss (final) \| 0.76 (lowest of the model lineage — still decreasing) \|
	\| Parameters \| 333M (DiT only) \|
	\| Resolution \| 1024×1024 (32×32×32 latent) \|
	\| VAE reconstruction \| ~26 dB PSNR @512px; sharper at 1024px (32×32 latent) \|

	Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is **soft on people,
	hands, and multi-person scenes** — the real small-model / latent-resolution ceiling. Loss was still dropping
	at the end of training, so the 333M DiT is not yet saturated.

	## Files

	- `model.safetensors` — the DiT weights.
	- `config.json` — DiT config, `lat_std`, and the VAE `scaling_factor`.

	There is no GGUF build: image-generation DiTs have no standard GGUF runtime.

	## Limitations

	- Hands and multi-person scenes are unreliable.
	- Fine object crispness is capped by the 32× DC-AE latent; a less-compressed VAE would sharpen it at higher cost.
	- Instruction-based editing is limited (the CLIP-L text encoder is a weak instruction follower); the real
	fix is a stronger conditioner, which is future work.

	## License

	Apache-2.0.