tiny-flux / README.md

Update README.md

f6ce539 verified 7 days ago

6.91 kB

	---
	license: mit
	language:
	- en
	tags:
	- diffusion
	- flow-matching
	- flux
	- text-to-image
	- image-generation
	- tiny
	- experimental
	library_name: pytorch
	pipeline_tag: text-to-image
	base_model:
	- black-forest-labs/FLUX.1-schnell
	datasets:
	- AbstractPhil/flux-schnell-teacher-latents
	---

	# TinyFlux

	A /12 scaled Flux architecture for experimentation and research. TinyFlux maintains the core MMDiT (Multimodal Diffusion Transformer) design of Flux while dramatically reducing parameter count for faster iteration and lower resource requirements.

	## Model Description

	TinyFlux is a miniaturized version of [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) that preserves the essential architectural components:

	- Double-stream blocks (MMDiT style) - separate text/image pathways with joint attention
	- Single-stream blocks - concatenated text+image with shared weights
	- AdaLN-Zero modulation - adaptive layer norm with gating
	- 3D RoPE - rotary position embeddings for temporal + spatial positions
	- Flow matching - rectified flow training objective

	### Architecture Comparison

	\| Component \| Flux \| TinyFlux \| Scale \|
	\|-----------\|------\|----------\|-------\|
	\| Hidden size \| 3072 \| 256 \| /12 \|
	\| Attention heads \| 24 \| 2 \| /12 \|
	\| Head dimension \| 128 \| 128 \| preserved \|
	\| Double-stream layers \| 19 \| 3 \| /6 \|
	\| Single-stream layers \| 38 \| 3 \| /12 \|
	\| VAE channels \| 16 \| 16 \| preserved \|
	\| Total params \| ~12B \| ~8M \| /1500 \|

	### Text Encoders

	TinyFlux uses smaller text encoders than standard Flux:

	\| Role \| Flux \| TinyFlux \|
	\|------\|------\|----------\|
	\| Sequence encoder \| T5-XXL (4096 dim) \| flan-t5-base (768 dim) \|
	\| Pooled encoder \| CLIP-L (768 dim) \| CLIP-L (768 dim) \|

	## Training

	### Dataset

	Trained on [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents):
	- 10,000 samples
	- Pre-computed VAE latents (16, 64, 64) from 512×512 images
	- Diverse prompts covering people, objects, scenes, styles

	### Training Details

	- Objective: Flow matching (rectified flow)
	- Timestep sampling: Logit-normal with Flux shift (s=3.0)
	- Loss weighting: Min-SNR-γ (γ=5.0)
	- Optimizer: AdamW (lr=1e-4, β=(0.9, 0.99), wd=0.01)
	- Schedule: Cosine with warmup
	- Precision: bfloat16

	### Flow Matching Formulation

	```
	Interpolation: x_t = (1 - t) * noise + t * data
	Target velocity: v = data - noise
	Loss: MSE(predicted_v, target_v) * min_snr_weight(t)
	```

	## Usage

	### Installation

	```bash
	pip install torch transformers diffusers safetensors huggingface_hub
	```

	### Inference

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
	from diffusers import AutoencoderKL

	# Load model (copy TinyFlux class definition first)
	config = TinyFluxConfig()
	model = TinyFlux(config).to("cuda").to(torch.bfloat16)

	weights = load_file(hf_hub_download("AbstractPhil/tiny-flux", "model.safetensors"))
	model.load_state_dict(weights)
	model.eval()

	# Load encoders
	t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base")
	t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda")
	clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
	clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda")
	vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")

	# Encode prompt
	prompt = "a photo of a cat"
	t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
	t5_out = t5_enc(**t5_in).last_hidden_state
	clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
	clip_out = clip_enc(**clip_in).pooler_output

	# Euler sampling (t: 0→1, noise→data)
	x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
	img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda")
	timesteps = torch.linspace(0, 1, 21, device="cuda")

	for i in range(20):
	t = timesteps[i].unsqueeze(0)
	dt = timesteps[i+1] - timesteps[i]
	guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16)

	v = model(
	hidden_states=x,
	encoder_hidden_states=t5_out,
	pooled_projections=clip_out,
	timestep=t,
	img_ids=img_ids,
	guidance=guidance,
	)
	x = x + v * dt

	# Decode
	latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)
	latents = latents / vae.config.scaling_factor
	image = vae.decode(latents.float()).sample
	image = (image / 2 + 0.5).clamp(0, 1)
	```

	### Full Inference Script

	See the [inference_colab.py](https://huggingface.co/AbstractPhil/tiny-flux/blob/main/inference_colab.py) for a complete generation pipeline with:
	- Classifier-free guidance
	- Batch generation
	- Image saving

	## Files

	```
	AbstractPhil/tiny-flux/
	├── model.safetensors # Model weights (~32MB)
	├── config.json # Model configuration
	├── README.md # This file
	├── model.py # Model architecture definition
	├── inference_colab.py # Inference script
	├── train_colab.py # Training script
	├── checkpoints/ # Training checkpoints
	│ └── step_*.safetensors
	├── logs/ # Tensorboard logs
	└── samples/ # Generated samples during training
	```

	## Limitations

	- Resolution: Trained on 512×512 only
	- Quality: Significantly lower than full Flux due to reduced capacity
	- Text understanding: Limited by smaller T5 encoder (768 vs 4096 dim)
	- Fine details: May struggle with complex scenes or fine-grained details
	- Experimental: Intended for research and learning, not production use

	## Intended Use

	- Understanding Flux/MMDiT architecture
	- Rapid prototyping and experimentation
	- Educational purposes
	- Resource-constrained environments
	- Baseline for architecture modifications

	## Citation

	If you use TinyFlux in your research, please cite:

	```bibtex
	@misc{tinyflux2025,
	title={TinyFlux: A Miniaturized Flux Architecture for Experimentation},
	author={AbstractPhil},
	year={2025},
	url={https://huggingface.co/AbstractPhil/tiny-flux}
	}
	```

	## Acknowledgments

	- [Black Forest Labs](https://blackforestlabs.ai/) for the original Flux architecture
	- [Hugging Face](https://huggingface.co/) for diffusers and transformers libraries

	## License

	MIT License - See LICENSE file for details.

	---

	Note: This is an experimental research model. For high-quality image generation, use the full [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) or [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) models.