--- license: mit language: - en tags: - diffusion - flow-matching - flux - text-to-image - image-generation - tiny - experimental library_name: pytorch pipeline_tag: text-to-image base_model: - black-forest-labs/FLUX.1-schnell datasets: - AbstractPhil/flux-schnell-teacher-latents --- # TinyFlux A **/12 scaled** Flux architecture for experimentation and research. TinyFlux maintains the core MMDiT (Multimodal Diffusion Transformer) design of Flux while dramatically reducing parameter count for faster iteration and lower resource requirements. ## Model Description TinyFlux is a miniaturized version of [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) that preserves the essential architectural components: - **Double-stream blocks** (MMDiT style) - separate text/image pathways with joint attention - **Single-stream blocks** - concatenated text+image with shared weights - **AdaLN-Zero modulation** - adaptive layer norm with gating - **3D RoPE** - rotary position embeddings for temporal + spatial positions - **Flow matching** - rectified flow training objective ### Architecture Comparison | Component | Flux | TinyFlux | Scale | |-----------|------|----------|-------| | Hidden size | 3072 | 256 | /12 | | Attention heads | 24 | 2 | /12 | | Head dimension | 128 | 128 | preserved | | Double-stream layers | 19 | 3 | /6 | | Single-stream layers | 38 | 3 | /12 | | VAE channels | 16 | 16 | preserved | | **Total params** | ~12B | ~8M | /1500 | ### Text Encoders TinyFlux uses smaller text encoders than standard Flux: | Role | Flux | TinyFlux | |------|------|----------| | Sequence encoder | T5-XXL (4096 dim) | flan-t5-base (768 dim) | | Pooled encoder | CLIP-L (768 dim) | CLIP-L (768 dim) | ## Training ### Dataset Trained on [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents): - 10,000 samples - Pre-computed VAE latents (16, 64, 64) from 512×512 images - Diverse prompts covering people, objects, scenes, styles ### Training Details - **Objective**: Flow matching (rectified flow) - **Timestep sampling**: Logit-normal with Flux shift (s=3.0) - **Loss weighting**: Min-SNR-γ (γ=5.0) - **Optimizer**: AdamW (lr=1e-4, β=(0.9, 0.99), wd=0.01) - **Schedule**: Cosine with warmup - **Precision**: bfloat16 ### Flow Matching Formulation ``` Interpolation: x_t = (1 - t) * noise + t * data Target velocity: v = data - noise Loss: MSE(predicted_v, target_v) * min_snr_weight(t) ``` ## Usage ### Installation ```bash pip install torch transformers diffusers safetensors huggingface_hub ``` ### Inference ```python import torch from huggingface_hub import hf_hub_download from safetensors.torch import load_file from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer from diffusers import AutoencoderKL # Load model (copy TinyFlux class definition first) config = TinyFluxConfig() model = TinyFlux(config).to("cuda").to(torch.bfloat16) weights = load_file(hf_hub_download("AbstractPhil/tiny-flux", "model.safetensors")) model.load_state_dict(weights) model.eval() # Load encoders t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base") t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda") clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda") vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda") # Encode prompt prompt = "a photo of a cat" t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda") t5_out = t5_enc(**t5_in).last_hidden_state clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda") clip_out = clip_enc(**clip_in).pooler_output # Euler sampling (t: 0→1, noise→data) x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16) img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda") timesteps = torch.linspace(0, 1, 21, device="cuda") for i in range(20): t = timesteps[i].unsqueeze(0) dt = timesteps[i+1] - timesteps[i] guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16) v = model( hidden_states=x, encoder_hidden_states=t5_out, pooled_projections=clip_out, timestep=t, img_ids=img_ids, guidance=guidance, ) x = x + v * dt # Decode latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2) latents = latents / vae.config.scaling_factor image = vae.decode(latents.float()).sample image = (image / 2 + 0.5).clamp(0, 1) ``` ### Full Inference Script See the [inference_colab.py](https://huggingface.co/AbstractPhil/tiny-flux/blob/main/inference_colab.py) for a complete generation pipeline with: - Classifier-free guidance - Batch generation - Image saving ## Files ``` AbstractPhil/tiny-flux/ ├── model.safetensors # Model weights (~32MB) ├── config.json # Model configuration ├── README.md # This file ├── model.py # Model architecture definition ├── inference_colab.py # Inference script ├── train_colab.py # Training script ├── checkpoints/ # Training checkpoints │ └── step_*.safetensors ├── logs/ # Tensorboard logs └── samples/ # Generated samples during training ``` ## Limitations - **Resolution**: Trained on 512×512 only - **Quality**: Significantly lower than full Flux due to reduced capacity - **Text understanding**: Limited by smaller T5 encoder (768 vs 4096 dim) - **Fine details**: May struggle with complex scenes or fine-grained details - **Experimental**: Intended for research and learning, not production use ## Intended Use - Understanding Flux/MMDiT architecture - Rapid prototyping and experimentation - Educational purposes - Resource-constrained environments - Baseline for architecture modifications ## Citation If you use TinyFlux in your research, please cite: ```bibtex @misc{tinyflux2025, title={TinyFlux: A Miniaturized Flux Architecture for Experimentation}, author={AbstractPhil}, year={2025}, url={https://huggingface.co/AbstractPhil/tiny-flux} } ``` ## Acknowledgments - [Black Forest Labs](https://blackforestlabs.ai/) for the original Flux architecture - [Hugging Face](https://huggingface.co/) for diffusers and transformers libraries ## License MIT License - See LICENSE file for details. --- **Note**: This is an experimental research model. For high-quality image generation, use the full [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) or [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) models.