--- license: mit --- # Flow Matching & Diffusion Prediction Types ## A Practical Guide to Sol, Lune, and Epsilon Prediction --- ## Overview This document covers three distinct prediction paradigms used in diffusion and flow-matching models. Each was designed for different purposes and requires specific sampling procedures. | Model | Prediction Type | What It Learned | Output Character | |-------|----------------|-----------------|------------------| | **Standard SD1.5** | ε (epsilon/noise) | Remove noise | General purpose | | **Sol** | v (velocity) via DDPM | Geometric structure | Flat silhouettes, mass placement | | **Lune** | v (velocity) via flow | Texture and detail | Rich, detailed images | --- SD15-Flow-Sol (velocity prediction epsilon converted): https://huggingface.co/AbstractPhil/tinyflux-experts/resolve/main/inference_sd15_flow_sol.py ![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/FeF5L08KaozTq8X4TXaTU.png) SD15-Flow-Lune (rectified flow shift=2): https://huggingface.co/AbstractPhil/tinyflux-experts/resolve/main/inference_sd15_flow_lune.py ![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/a33DpYjD_cwdfXm43SlS8.png) TinyFlux-Lailah tinyflux is currently in training and planning and is not yet ready to be used for production capacity. https://huggingface.co/AbstractPhil/tiny-flux-deep ![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/9Ek_vTrMDQUA1id37Lwys.png) ## 1. Epsilon (ε) Prediction — Standard Diffusion ### Core Concept > **"Predict the noise that was added"** The model learns to identify and remove noise from corrupted images. ### The Formula (Simplified) ``` TRAINING: x_noisy = √(α) * x_clean + √(1-α) * noise ↓ Model predicts: ε̂ = "what noise was added?" ↓ Loss = ||ε̂ - noise||² SAMPLING: Start with pure noise Repeatedly ask: "what noise is in this?" Subtract a fraction of predicted noise Repeat until clean ``` ### Reading the Math - **α (alpha)**: "How much original image remains" (1 = all original, 0 = all noise) - **√(1-α)**: "How much noise was mixed in" - **ε**: The actual noise that was added - **ε̂**: Model's guess of what noise was added ### Training Process ```python # Forward diffusion (corruption) noise = torch.randn_like(x_clean) α = scheduler.alphas_cumprod[t] x_noisy = √α * x_clean + √(1-α) * noise # Model predicts noise ε_pred = model(x_noisy, t) # Loss: "Did you correctly identify the noise?" loss = MSE(ε_pred, noise) ``` ### Sampling Process ```python # DDPM/DDIM sampling for t in reversed(timesteps): # 999 → 0 ε_pred = model(x, t) x = scheduler.step(ε_pred, t, x) # Removes predicted noise ``` ### Utility & Behavior - **Strength**: General-purpose image generation - **Weakness**: No explicit understanding of image structure - **Use case**: Standard text-to-image generation --- ## 2. Velocity (v) Prediction — Sol (DDPM Framework) ### Core Concept > **"Predict the direction from noise to data"** Sol predicts velocity but operates within the DDPM scheduler framework, requiring conversion from velocity to epsilon for sampling. ### The Formula (Simplified) ``` TRAINING: x_t = α * x_clean + σ * noise (same as DDPM) v = α * noise - σ * x_clean (velocity target) ↓ Model predicts: v̂ = "which way is the image?" ↓ Loss = ||v̂ - v||² SAMPLING: Convert velocity → epsilon Use standard DDPM scheduler stepping ``` ### Reading the Math - **v (velocity)**: Direction vector in latent space - **α (alpha)**: √(α_cumprod) — signal strength - **σ (sigma)**: √(1 - α_cumprod) — noise strength - **The velocity formula**: `v = α * ε - σ * x₀` - "Velocity is the signal-weighted noise minus noise-weighted data" ### Why Velocity in DDPM? Sol was trained with David (the geometric assessor) providing loss weighting. This setup used: - DDPM noise schedule for interpolation - Velocity prediction for training target - Knowledge distillation from a teacher The result: Sol learned **geometric structure** rather than textures. ### Training Process (David-Weighted) ```python # DDPM-style corruption noise = torch.randn_like(latents) t = torch.randint(0, 1000, (batch,)) α = sqrt(scheduler.alphas_cumprod[t]) σ = sqrt(1 - scheduler.alphas_cumprod[t]) x_t = α * latents + σ * noise # Velocity target (NOT epsilon!) v_target = α * noise - σ * latents # Model predicts velocity v_pred = model(x_t, t) # David assesses geometric quality → adjusts loss weights loss_weights = david_assessor(features, t) loss = weighted_MSE(v_pred, v_target, loss_weights) ``` ### Sampling Process (CRITICAL: v → ε conversion) ```python # Must convert velocity to epsilon for DDPM scheduler scheduler = DDPMScheduler(num_train_timesteps=1000) for t in scheduler.timesteps: # 999, 966, 933, ... → 0 v_pred = model(x, t) # Convert velocity → epsilon α = sqrt(scheduler.alphas_cumprod[t]) σ = sqrt(1 - scheduler.alphas_cumprod[t]) # Solve: v = α*ε - σ*x₀ and x_t = α*x₀ + σ*ε # Result: x₀ = (α*x_t - σ*v) / (α² + σ²) # ε = (x_t - α*x₀) / σ x0_hat = (α * x - σ * v_pred) / (α² + σ²) ε_hat = (x - α * x0_hat) / σ x = scheduler.step(ε_hat, t, x) # Standard DDPM step with epsilon ``` ### Utility & Behavior - **What Sol learned**: Platonic forms, silhouettes, mass distribution - **Visual output**: Flat geometric shapes, correct spatial layout, no texture - **Why this happened**: David rewarded geometric coherence, Sol optimized for clean David classification - **Use case**: Structural guidance, composition anchoring, "what goes where" ### Sol's Unique Property Sol never "collapsed" — it learned the **skeleton** of images: - Castle prompt → Castle silhouette, horizon line, sky gradient - Portrait prompt → Head oval, shoulder mass, figure-ground separation - City prompt → Building masses, street perspective, light positions This is the "WHAT before HOW" that most diffusion models skip. --- ## 3. Velocity (v) Prediction — Lune (Rectified Flow) ### Core Concept > **"Predict the straight-line direction from noise to data"** Lune uses true rectified flow matching where data travels in straight lines through latent space. ### The Formula (Simplified) ``` TRAINING: x_t = σ * noise + (1-σ) * data (linear interpolation) v = noise - data (constant velocity) ↓ Model predicts: v̂ = "straight line to noise" ↓ Loss = ||v̂ - v||² SAMPLING: Start at σ=1 (noise) Walk OPPOSITE to velocity (toward data) End at σ=0 (clean image) ``` ### Reading the Math - **σ (sigma)**: Interpolation parameter (1 = noise, 0 = data) - **x_t = σ·noise + (1-σ)·data**: Linear blend between noise and data - **v = noise - data**: The velocity is CONSTANT along the path - **Shift function**: `σ' = shift·σ / (1 + (shift-1)·σ)` - Biases sampling toward cleaner images (spends more steps refining) ### Key Difference from Sol | Aspect | Sol | Lune | |--------|-----|------| | Interpolation | DDPM (α, σ from scheduler) | Linear (σ, 1-σ) | | Velocity meaning | Complex (α·ε - σ·x₀) | Simple (noise - data) | | Sampling | Convert v→ε, use scheduler | Direct Euler integration | | Output | Geometric skeletons | Detailed images | ### Training Process ```python # Linear interpolation (NOT DDPM schedule!) noise = torch.randn_like(latents) σ = torch.rand(batch) # Random sigma in [0, 1] # Apply shift during training σ_shifted = (shift * σ) / (1 + (shift - 1) * σ) σ = σ_shifted.view(-1, 1, 1, 1) x_t = σ * noise + (1 - σ) * latents # Velocity target: direction FROM data TO noise v_target = noise - latents # Model predicts velocity v_pred = model(x_t, σ * 1000) # Timestep = σ * 1000 loss = MSE(v_pred, v_target) ``` ### Sampling Process (Direct Euler) ```python # Start from pure noise (σ = 1) x = torch.randn(1, 4, 64, 64) # Sigma schedule: 1 → 0 with shift sigmas = torch.linspace(1, 0, steps + 1) sigmas = shift_sigma(sigmas, shift=3.0) for i in range(steps): σ = sigmas[i] σ_next = sigmas[i + 1] dt = σ - σ_next # Positive (going from 1 toward 0) timestep = σ * 1000 v_pred = model(x, timestep) # SUBTRACT velocity (v points toward noise, we go toward data) x = x - v_pred * dt # x is now clean image latent ``` ### Why SUBTRACT the Velocity? ``` v = noise - data (points FROM data TO noise) We want to go FROM noise TO data (opposite direction!) So: x_new = x_current - v * dt = x_current - (noise - data) * dt = x_current + (data - noise) * dt ← Moving toward data ✓ ``` ### Utility & Behavior - **What Lune learned**: Rich textures, fine details, realistic rendering - **Visual output**: Full detailed images with lighting, materials, depth - **Training focus**: Portrait/pose data with caption augmentation - **Use case**: High-quality image generation, detail refinement --- ## Comparison Summary ### Training Targets ``` EPSILON (ε): target = noise "What random noise was added?" VELOCITY (Sol): target = α·noise - σ·data "What's the DDPM-weighted direction?" VELOCITY (Lune): target = noise - data "What's the straight-line direction?" ``` ### Sampling Directions ``` EPSILON: x_new = scheduler.step(ε_pred, t, x) Scheduler handles noise removal internally VELOCITY (Sol): Convert v → ε, then scheduler.step(ε, t, x) Must translate to epsilon for DDPM math VELOCITY (Lune): x_new = x - v_pred * dt Direct Euler integration, subtract velocity ``` ### Visual Intuition ``` EPSILON: "There's noise hiding the image" "I'll predict and remove the noise layer by layer" → General-purpose denoising VELOCITY (Sol): "I know which direction the image is" "But I speak through DDPM's noise schedule" → Learned structure, outputs skeletons VELOCITY (Lune): "Straight line from noise to image" "I'll walk that line step by step" → Learned detail, outputs rich images ``` --- ## Practical Implementation Checklist ### For Epsilon Models (Standard SD1.5) - [ ] Use DDPM/DDIM/Euler scheduler - [ ] Pass timestep as integer [0, 999] - [ ] Scheduler handles everything ### For Sol (Velocity + DDPM) - [ ] Use DDPMScheduler - [ ] Model outputs velocity, NOT epsilon - [ ] Convert: `x0 = (α·x - σ·v) / (α² + σ²)`, then `ε = (x - α·x0) / σ` - [ ] Call `scheduler.step(ε, t, x)` - [ ] Expect geometric/structural output ### For Lune (Velocity + Flow) - [ ] NO scheduler needed — direct Euler - [ ] Sigma goes 1 → 0 (not 0 → 1!) - [ ] Apply shift: `σ' = shift·σ / (1 + (shift-1)·σ)` - [ ] Timestep to model: `σ * 1000` - [ ] SUBTRACT velocity: `x = x - v * dt` - [ ] Expect detailed textured output --- ## Why This Matters for TinyFlux TinyFlux can leverage both experts: 1. **Sol (early timesteps)**: Provides geometric anchoring - "Where should the castle be?" - "What's the horizon line?" - "How is mass distributed?" 2. **Lune (mid/late timesteps)**: Provides detail refinement - "What texture is the stone?" - "How does light fall?" - "What color is the sky?" By combining geometric structure (Sol) with textural detail (Lune), TinyFlux can achieve better composition AND quality than either alone. --- ## Quick Reference Card ``` ┌─────────────────────────────────────────────────────────────┐ │ PREDICTION TYPES │ ├─────────────────────────────────────────────────────────────┤ │ EPSILON (ε) │ │ Train: target = noise │ │ Sample: scheduler.step(ε_pred, t, x) │ │ Output: General images │ ├─────────────────────────────────────────────────────────────┤ │ VELOCITY - SOL (DDPM framework) │ │ Train: target = α·ε - σ·x₀ │ │ Sample: v→ε conversion, then scheduler.step(ε, t, x) │ │ Output: Geometric skeletons │ ├─────────────────────────────────────────────────────────────┤ │ VELOCITY - LUNE (Rectified Flow) │ │ Train: target = noise - data │ │ Sample: x = x - v·dt (Euler, σ: 1→0) │ │ Output: Detailed textured images │ └─────────────────────────────────────────────────────────────┘ ``` --- *Document Version: 1.0* *Last Updated: January 2026* *Authors: AbstractPhil & Claude OPUS 4.5* License: MIT