|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- diffusion |
|
|
- flow-matching |
|
|
- flux |
|
|
- text-to-image |
|
|
- image-generation |
|
|
- tiny |
|
|
- experimental |
|
|
library_name: pytorch |
|
|
pipeline_tag: text-to-image |
|
|
base_model: |
|
|
- black-forest-labs/FLUX.1-schnell |
|
|
datasets: |
|
|
- AbstractPhil/flux-schnell-teacher-latents |
|
|
--- |
|
|
|
|
|
# TinyFlux |
|
|
|
|
|
A **/12 scaled** Flux architecture for experimentation and research. TinyFlux maintains the core MMDiT (Multimodal Diffusion Transformer) design of Flux while dramatically reducing parameter count for faster iteration and lower resource requirements. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
TinyFlux is a miniaturized version of [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) that preserves the essential architectural components: |
|
|
|
|
|
- **Double-stream blocks** (MMDiT style) - separate text/image pathways with joint attention |
|
|
- **Single-stream blocks** - concatenated text+image with shared weights |
|
|
- **AdaLN-Zero modulation** - adaptive layer norm with gating |
|
|
- **3D RoPE** - rotary position embeddings for temporal + spatial positions |
|
|
- **Flow matching** - rectified flow training objective |
|
|
|
|
|
### Architecture Comparison |
|
|
|
|
|
| Component | Flux | TinyFlux | Scale | |
|
|
|-----------|------|----------|-------| |
|
|
| Hidden size | 3072 | 256 | /12 | |
|
|
| Attention heads | 24 | 2 | /12 | |
|
|
| Head dimension | 128 | 128 | preserved | |
|
|
| Double-stream layers | 19 | 3 | /6 | |
|
|
| Single-stream layers | 38 | 3 | /12 | |
|
|
| VAE channels | 16 | 16 | preserved | |
|
|
| **Total params** | ~12B | ~8M | /1500 | |
|
|
|
|
|
### Text Encoders |
|
|
|
|
|
TinyFlux uses smaller text encoders than standard Flux: |
|
|
|
|
|
| Role | Flux | TinyFlux | |
|
|
|------|------|----------| |
|
|
| Sequence encoder | T5-XXL (4096 dim) | flan-t5-base (768 dim) | |
|
|
| Pooled encoder | CLIP-L (768 dim) | CLIP-L (768 dim) | |
|
|
|
|
|
## Training |
|
|
|
|
|
### Dataset |
|
|
|
|
|
Trained on [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents): |
|
|
- 10,000 samples |
|
|
- Pre-computed VAE latents (16, 64, 64) from 512Γ512 images |
|
|
- Diverse prompts covering people, objects, scenes, styles |
|
|
|
|
|
### Training Details |
|
|
|
|
|
- **Objective**: Flow matching (rectified flow) |
|
|
- **Timestep sampling**: Logit-normal with Flux shift (s=3.0) |
|
|
- **Loss weighting**: Min-SNR-Ξ³ (Ξ³=5.0) |
|
|
- **Optimizer**: AdamW (lr=1e-4, Ξ²=(0.9, 0.99), wd=0.01) |
|
|
- **Schedule**: Cosine with warmup |
|
|
- **Precision**: bfloat16 |
|
|
|
|
|
### Flow Matching Formulation |
|
|
|
|
|
``` |
|
|
Interpolation: x_t = (1 - t) * noise + t * data |
|
|
Target velocity: v = data - noise |
|
|
Loss: MSE(predicted_v, target_v) * min_snr_weight(t) |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers diffusers safetensors huggingface_hub |
|
|
``` |
|
|
|
|
|
### Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from huggingface_hub import hf_hub_download |
|
|
from safetensors.torch import load_file |
|
|
from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer |
|
|
from diffusers import AutoencoderKL |
|
|
|
|
|
# Load model (copy TinyFlux class definition first) |
|
|
config = TinyFluxConfig() |
|
|
model = TinyFlux(config).to("cuda").to(torch.bfloat16) |
|
|
|
|
|
weights = load_file(hf_hub_download("AbstractPhil/tiny-flux", "model.safetensors")) |
|
|
model.load_state_dict(weights) |
|
|
model.eval() |
|
|
|
|
|
# Load encoders |
|
|
t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base") |
|
|
t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda") |
|
|
clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") |
|
|
clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda") |
|
|
vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda") |
|
|
|
|
|
# Encode prompt |
|
|
prompt = "a photo of a cat" |
|
|
t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda") |
|
|
t5_out = t5_enc(**t5_in).last_hidden_state |
|
|
clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda") |
|
|
clip_out = clip_enc(**clip_in).pooler_output |
|
|
|
|
|
# Euler sampling (t: 0β1, noiseβdata) |
|
|
x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16) |
|
|
img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda") |
|
|
timesteps = torch.linspace(0, 1, 21, device="cuda") |
|
|
|
|
|
for i in range(20): |
|
|
t = timesteps[i].unsqueeze(0) |
|
|
dt = timesteps[i+1] - timesteps[i] |
|
|
guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16) |
|
|
|
|
|
v = model( |
|
|
hidden_states=x, |
|
|
encoder_hidden_states=t5_out, |
|
|
pooled_projections=clip_out, |
|
|
timestep=t, |
|
|
img_ids=img_ids, |
|
|
guidance=guidance, |
|
|
) |
|
|
x = x + v * dt |
|
|
|
|
|
# Decode |
|
|
latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2) |
|
|
latents = latents / vae.config.scaling_factor |
|
|
image = vae.decode(latents.float()).sample |
|
|
image = (image / 2 + 0.5).clamp(0, 1) |
|
|
``` |
|
|
|
|
|
### Full Inference Script |
|
|
|
|
|
See the [inference_colab.py](https://huggingface.co/AbstractPhil/tiny-flux/blob/main/inference_colab.py) for a complete generation pipeline with: |
|
|
- Classifier-free guidance |
|
|
- Batch generation |
|
|
- Image saving |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
AbstractPhil/tiny-flux/ |
|
|
βββ model.safetensors # Model weights (~32MB) |
|
|
βββ config.json # Model configuration |
|
|
βββ README.md # This file |
|
|
βββ model.py # Model architecture definition |
|
|
βββ inference_colab.py # Inference script |
|
|
βββ train_colab.py # Training script |
|
|
βββ checkpoints/ # Training checkpoints |
|
|
β βββ step_*.safetensors |
|
|
βββ logs/ # Tensorboard logs |
|
|
βββ samples/ # Generated samples during training |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Resolution**: Trained on 512Γ512 only |
|
|
- **Quality**: Significantly lower than full Flux due to reduced capacity |
|
|
- **Text understanding**: Limited by smaller T5 encoder (768 vs 4096 dim) |
|
|
- **Fine details**: May struggle with complex scenes or fine-grained details |
|
|
- **Experimental**: Intended for research and learning, not production use |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Understanding Flux/MMDiT architecture |
|
|
- Rapid prototyping and experimentation |
|
|
- Educational purposes |
|
|
- Resource-constrained environments |
|
|
- Baseline for architecture modifications |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use TinyFlux in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{tinyflux2025, |
|
|
title={TinyFlux: A Miniaturized Flux Architecture for Experimentation}, |
|
|
author={AbstractPhil}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/AbstractPhil/tiny-flux} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [Black Forest Labs](https://blackforestlabs.ai/) for the original Flux architecture |
|
|
- [Hugging Face](https://huggingface.co/) for diffusers and transformers libraries |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See LICENSE file for details. |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: This is an experimental research model. For high-quality image generation, use the full [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) or [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) models. |