File size: 6,907 Bytes
24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 beb7553 f6ce539 beb7553 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 24bf751 f6ce539 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
---
license: mit
language:
- en
tags:
- diffusion
- flow-matching
- flux
- text-to-image
- image-generation
- tiny
- experimental
library_name: pytorch
pipeline_tag: text-to-image
base_model:
- black-forest-labs/FLUX.1-schnell
datasets:
- AbstractPhil/flux-schnell-teacher-latents
---
# TinyFlux
A **/12 scaled** Flux architecture for experimentation and research. TinyFlux maintains the core MMDiT (Multimodal Diffusion Transformer) design of Flux while dramatically reducing parameter count for faster iteration and lower resource requirements.
## Model Description
TinyFlux is a miniaturized version of [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) that preserves the essential architectural components:
- **Double-stream blocks** (MMDiT style) - separate text/image pathways with joint attention
- **Single-stream blocks** - concatenated text+image with shared weights
- **AdaLN-Zero modulation** - adaptive layer norm with gating
- **3D RoPE** - rotary position embeddings for temporal + spatial positions
- **Flow matching** - rectified flow training objective
### Architecture Comparison
| Component | Flux | TinyFlux | Scale |
|-----------|------|----------|-------|
| Hidden size | 3072 | 256 | /12 |
| Attention heads | 24 | 2 | /12 |
| Head dimension | 128 | 128 | preserved |
| Double-stream layers | 19 | 3 | /6 |
| Single-stream layers | 38 | 3 | /12 |
| VAE channels | 16 | 16 | preserved |
| **Total params** | ~12B | ~8M | /1500 |
### Text Encoders
TinyFlux uses smaller text encoders than standard Flux:
| Role | Flux | TinyFlux |
|------|------|----------|
| Sequence encoder | T5-XXL (4096 dim) | flan-t5-base (768 dim) |
| Pooled encoder | CLIP-L (768 dim) | CLIP-L (768 dim) |
## Training
### Dataset
Trained on [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents):
- 10,000 samples
- Pre-computed VAE latents (16, 64, 64) from 512Γ512 images
- Diverse prompts covering people, objects, scenes, styles
### Training Details
- **Objective**: Flow matching (rectified flow)
- **Timestep sampling**: Logit-normal with Flux shift (s=3.0)
- **Loss weighting**: Min-SNR-Ξ³ (Ξ³=5.0)
- **Optimizer**: AdamW (lr=1e-4, Ξ²=(0.9, 0.99), wd=0.01)
- **Schedule**: Cosine with warmup
- **Precision**: bfloat16
### Flow Matching Formulation
```
Interpolation: x_t = (1 - t) * noise + t * data
Target velocity: v = data - noise
Loss: MSE(predicted_v, target_v) * min_snr_weight(t)
```
## Usage
### Installation
```bash
pip install torch transformers diffusers safetensors huggingface_hub
```
### Inference
```python
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL
# Load model (copy TinyFlux class definition first)
config = TinyFluxConfig()
model = TinyFlux(config).to("cuda").to(torch.bfloat16)
weights = load_file(hf_hub_download("AbstractPhil/tiny-flux", "model.safetensors"))
model.load_state_dict(weights)
model.eval()
# Load encoders
t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base")
t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda")
clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda")
vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
# Encode prompt
prompt = "a photo of a cat"
t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
t5_out = t5_enc(**t5_in).last_hidden_state
clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
clip_out = clip_enc(**clip_in).pooler_output
# Euler sampling (t: 0β1, noiseβdata)
x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda")
timesteps = torch.linspace(0, 1, 21, device="cuda")
for i in range(20):
t = timesteps[i].unsqueeze(0)
dt = timesteps[i+1] - timesteps[i]
guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16)
v = model(
hidden_states=x,
encoder_hidden_states=t5_out,
pooled_projections=clip_out,
timestep=t,
img_ids=img_ids,
guidance=guidance,
)
x = x + v * dt
# Decode
latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)
latents = latents / vae.config.scaling_factor
image = vae.decode(latents.float()).sample
image = (image / 2 + 0.5).clamp(0, 1)
```
### Full Inference Script
See the [inference_colab.py](https://huggingface.co/AbstractPhil/tiny-flux/blob/main/inference_colab.py) for a complete generation pipeline with:
- Classifier-free guidance
- Batch generation
- Image saving
## Files
```
AbstractPhil/tiny-flux/
βββ model.safetensors # Model weights (~32MB)
βββ config.json # Model configuration
βββ README.md # This file
βββ model.py # Model architecture definition
βββ inference_colab.py # Inference script
βββ train_colab.py # Training script
βββ checkpoints/ # Training checkpoints
β βββ step_*.safetensors
βββ logs/ # Tensorboard logs
βββ samples/ # Generated samples during training
```
## Limitations
- **Resolution**: Trained on 512Γ512 only
- **Quality**: Significantly lower than full Flux due to reduced capacity
- **Text understanding**: Limited by smaller T5 encoder (768 vs 4096 dim)
- **Fine details**: May struggle with complex scenes or fine-grained details
- **Experimental**: Intended for research and learning, not production use
## Intended Use
- Understanding Flux/MMDiT architecture
- Rapid prototyping and experimentation
- Educational purposes
- Resource-constrained environments
- Baseline for architecture modifications
## Citation
If you use TinyFlux in your research, please cite:
```bibtex
@misc{tinyflux2025,
title={TinyFlux: A Miniaturized Flux Architecture for Experimentation},
author={AbstractPhil},
year={2025},
url={https://huggingface.co/AbstractPhil/tiny-flux}
}
```
## Acknowledgments
- [Black Forest Labs](https://blackforestlabs.ai/) for the original Flux architecture
- [Hugging Face](https://huggingface.co/) for diffusers and transformers libraries
## License
MIT License - See LICENSE file for details.
---
**Note**: This is an experimental research model. For high-quality image generation, use the full [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) or [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) models. |