File size: 3,025 Bytes
0870869
 
 
 
 
 
 
086afba
0870869
086afba
 
 
 
0870869
086afba
 
0870869
086afba
 
6750d81
 
086afba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6750d81
 
 
0870869
 
 
086afba
 
0870869
086afba
 
 
 
 
 
 
 
0870869
 
086afba
0870869
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
pipeline_tag: text-to-image
library_name: safetensors
tags: [hobbylm, text-to-image, diffusion, dit, flow-matching]
---

# HobbyLM-Image β€” 1024px text-to-image DiT

The odd one out in the HobbyLM family: not a language model, but a **333M in-context flow-matching DiT** that
generates 1024Γ—1024 images. It was built to see how good a text-to-image model you can train on a genuinely
small budget β€” the whole thing came together for roughly **$300 of Modal GPU time** by working in a heavily
compressed latent space instead of pixels.

It runs in the **DC-AE f32c32 (SANA-1.1)** latent (32Γ— spatial compression β†’ a 32Γ—32Γ—32 latent at 1024px) and
is conditioned on **CLIP-L** text features, with classifier-free guidance.

## Intended use

Text-to-image generation at 1024Γ—1024. Strongest on single objects and cinematic scenes. A sibling 512px
checkpoint additionally does instruction-based image editing.

## How it works

```
CLIP-L(prompt) ─┐
                β”œβ”€β–Ί  DiT  ──(rectified-flow / CFG sampler, ~100 steps)──►  latent  ──►  DC-AE decode  ──►  1024Β² image
 Gaussian noise β”€β”˜     (this repo)                                                       (frozen VAE)
```

The two frozen components are **not** included (download them from their own repos):
`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers` (VAE) and `openai/clip-vit-large-patch14` (text encoder).
A full from-scratch CPU implementation of this pipeline (CLIP + DiT + DC-AE, in Rust) lives in
[`hobby-rs`](https://github.com/harishsg993010/HobbyLM).

## Samples

1024Γ—1024, generated by this model (CFG β‰ˆ 5, ~100 steps):

![HobbyLM-Image scene samples](sample_scenes.png)

## Results

This is a hobby-scale generator, so the honest "benchmark" is the training curve and qualitative behaviour
rather than FID / GenEval (which we did not compute):

| Property | Value |
|---|---|
| Flow-matching loss (final) | **0.76** (lowest of the model lineage β€” still decreasing) |
| Parameters | 333M (DiT only) |
| Resolution | 1024Γ—1024 (32Γ—32Γ—32 latent) |
| VAE reconstruction | ~26 dB PSNR @512px; sharper at 1024px (32Γ—32 latent) |

Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is **soft on people,
hands, and multi-person scenes** β€” the real small-model / latent-resolution ceiling. Loss was still dropping
at the end of training, so the 333M DiT is not yet saturated.

## Files

- `model.safetensors` β€” the DiT weights.
- `config.json` β€” DiT config, `lat_std`, and the VAE `scaling_factor`.

There is no GGUF build: image-generation DiTs have no standard GGUF runtime.

## Limitations

- Hands and multi-person scenes are unreliable.
- Fine object crispness is capped by the 32Γ— DC-AE latent; a less-compressed VAE would sharpen it at higher cost.
- Instruction-based **editing** is limited (the CLIP-L text encoder is a weak instruction follower); the real
  fix is a stronger conditioner, which is future work.

## License

Apache-2.0.