File size: 6,907 Bytes
24bf751
 
 
 
 
 
 
 
 
 
f6ce539
24bf751
 
 
 
 
 
 
 
 
f6ce539
24bf751
f6ce539
24bf751
 
 
f6ce539
 
 
 
 
 
 
24bf751
 
 
f6ce539
 
 
 
 
 
 
 
 
24bf751
 
 
f6ce539
24bf751
f6ce539
 
 
 
beb7553
f6ce539
beb7553
24bf751
 
 
 
 
 
 
 
 
 
 
 
f6ce539
24bf751
 
f6ce539
 
 
 
 
 
 
 
24bf751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6ce539
 
24bf751
 
f6ce539
 
24bf751
 
 
 
 
 
 
 
 
 
f6ce539
24bf751
 
 
 
 
f6ce539
24bf751
 
f6ce539
24bf751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6ce539
24bf751
f6ce539
 
 
 
24bf751
 
 
 
f6ce539
 
24bf751
 
f6ce539
24bf751
f6ce539
24bf751
 
 
 
 
 
 
 
 
f6ce539
24bf751
f6ce539
 
24bf751
 
 
f6ce539
 
24bf751
f6ce539
 
24bf751
 
 
f6ce539
 
24bf751
f6ce539
 
24bf751
f6ce539
 
24bf751
 
 
 
 
 
 
 
 
 
 
 
 
 
f6ce539
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: mit
language:
- en
tags:
- diffusion
- flow-matching
- flux
- text-to-image
- image-generation
- tiny
- experimental
library_name: pytorch
pipeline_tag: text-to-image
base_model:
- black-forest-labs/FLUX.1-schnell
datasets:
- AbstractPhil/flux-schnell-teacher-latents
---

# TinyFlux

A **/12 scaled** Flux architecture for experimentation and research. TinyFlux maintains the core MMDiT (Multimodal Diffusion Transformer) design of Flux while dramatically reducing parameter count for faster iteration and lower resource requirements.

## Model Description

TinyFlux is a miniaturized version of [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) that preserves the essential architectural components:

- **Double-stream blocks** (MMDiT style) - separate text/image pathways with joint attention
- **Single-stream blocks** - concatenated text+image with shared weights  
- **AdaLN-Zero modulation** - adaptive layer norm with gating
- **3D RoPE** - rotary position embeddings for temporal + spatial positions
- **Flow matching** - rectified flow training objective

### Architecture Comparison

| Component | Flux | TinyFlux | Scale |
|-----------|------|----------|-------|
| Hidden size | 3072 | 256 | /12 |
| Attention heads | 24 | 2 | /12 |
| Head dimension | 128 | 128 | preserved |
| Double-stream layers | 19 | 3 | /6 |
| Single-stream layers | 38 | 3 | /12 |
| VAE channels | 16 | 16 | preserved |
| **Total params** | ~12B | ~8M | /1500 |

### Text Encoders

TinyFlux uses smaller text encoders than standard Flux:

| Role | Flux | TinyFlux |
|------|------|----------|
| Sequence encoder | T5-XXL (4096 dim) | flan-t5-base (768 dim) |
| Pooled encoder | CLIP-L (768 dim) | CLIP-L (768 dim) |

## Training

### Dataset

Trained on [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents):
- 10,000 samples
- Pre-computed VAE latents (16, 64, 64) from 512Γ—512 images
- Diverse prompts covering people, objects, scenes, styles

### Training Details

- **Objective**: Flow matching (rectified flow)
- **Timestep sampling**: Logit-normal with Flux shift (s=3.0)
- **Loss weighting**: Min-SNR-Ξ³ (Ξ³=5.0)
- **Optimizer**: AdamW (lr=1e-4, Ξ²=(0.9, 0.99), wd=0.01)
- **Schedule**: Cosine with warmup
- **Precision**: bfloat16

### Flow Matching Formulation

```
Interpolation: x_t = (1 - t) * noise + t * data
Target velocity: v = data - noise
Loss: MSE(predicted_v, target_v) * min_snr_weight(t)
```

## Usage

### Installation

```bash
pip install torch transformers diffusers safetensors huggingface_hub
```

### Inference

```python
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL

# Load model (copy TinyFlux class definition first)
config = TinyFluxConfig()
model = TinyFlux(config).to("cuda").to(torch.bfloat16)

weights = load_file(hf_hub_download("AbstractPhil/tiny-flux", "model.safetensors"))
model.load_state_dict(weights)
model.eval()

# Load encoders
t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base")
t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda")
clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda")
vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")

# Encode prompt
prompt = "a photo of a cat"
t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
t5_out = t5_enc(**t5_in).last_hidden_state
clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
clip_out = clip_enc(**clip_in).pooler_output

# Euler sampling (t: 0→1, noise→data)
x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda")
timesteps = torch.linspace(0, 1, 21, device="cuda")

for i in range(20):
    t = timesteps[i].unsqueeze(0)
    dt = timesteps[i+1] - timesteps[i]
    guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16)
    
    v = model(
        hidden_states=x,
        encoder_hidden_states=t5_out,
        pooled_projections=clip_out,
        timestep=t,
        img_ids=img_ids,
        guidance=guidance,
    )
    x = x + v * dt

# Decode
latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)
latents = latents / vae.config.scaling_factor
image = vae.decode(latents.float()).sample
image = (image / 2 + 0.5).clamp(0, 1)
```

### Full Inference Script

See the [inference_colab.py](https://huggingface.co/AbstractPhil/tiny-flux/blob/main/inference_colab.py) for a complete generation pipeline with:
- Classifier-free guidance
- Batch generation
- Image saving

## Files

```
AbstractPhil/tiny-flux/
β”œβ”€β”€ model.safetensors      # Model weights (~32MB)
β”œβ”€β”€ config.json            # Model configuration
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ model.py               # Model architecture definition
β”œβ”€β”€ inference_colab.py     # Inference script
β”œβ”€β”€ train_colab.py         # Training script
β”œβ”€β”€ checkpoints/           # Training checkpoints
β”‚   └── step_*.safetensors
β”œβ”€β”€ logs/                  # Tensorboard logs
└── samples/               # Generated samples during training
```

## Limitations

- **Resolution**: Trained on 512Γ—512 only
- **Quality**: Significantly lower than full Flux due to reduced capacity
- **Text understanding**: Limited by smaller T5 encoder (768 vs 4096 dim)
- **Fine details**: May struggle with complex scenes or fine-grained details
- **Experimental**: Intended for research and learning, not production use

## Intended Use

- Understanding Flux/MMDiT architecture
- Rapid prototyping and experimentation
- Educational purposes
- Resource-constrained environments
- Baseline for architecture modifications

## Citation

If you use TinyFlux in your research, please cite:

```bibtex
@misc{tinyflux2025,
  title={TinyFlux: A Miniaturized Flux Architecture for Experimentation},
  author={AbstractPhil},
  year={2025},
  url={https://huggingface.co/AbstractPhil/tiny-flux}
}
```

## Acknowledgments

- [Black Forest Labs](https://blackforestlabs.ai/) for the original Flux architecture
- [Hugging Face](https://huggingface.co/) for diffusers and transformers libraries

## License

MIT License - See LICENSE file for details.

---

**Note**: This is an experimental research model. For high-quality image generation, use the full [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) or [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) models.