krystv
/

LatentRecurrentFlow

@@ -8,7 +8,6 @@ tags:
   - recursive-reasoning
   - novel-architecture
   - subquadratic-attention
-  - gated-linear-attention
   - research
 library_name: lrf
 pipeline_tag: text-to-image
@@ -19,590 +18,192 @@ license: apache-2.0
 > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
----
-## Table of Contents
-1. [Architecture Overview](#1-architecture-overview)
-2. [Shortlist of Most Relevant Papers](#2-shortlist-of-most-relevant-papers)
-3. [Paper Critiques](#3-paper-critiques)
-4. [Full Proposed Architecture](#4-full-proposed-architecture-latentrecurrentflow)
-5. [Module-by-Module Diagram](#5-module-by-module-diagram)
-6. [Mathematical Formulation](#6-mathematical-formulation)
-7. [Training Objective & Losses](#7-training-objective--losses)
-8. [Memory & Compute Budget](#8-memory--compute-budget)
-9. [Training Curriculum](#9-training-curriculum)
-10. [Deployment Plan for Mobile](#10-deployment-plan-for-mobile)
-11. [Failure Mode Analysis](#11-failure-mode-analysis)
-12. [Ablation Plan](#12-ablation-plan)
-13. [Editing Roadmap](#13-editing-roadmap)
----
-## 1. Architecture Overview
-LRF combines five key innovations into a single coherent architecture:
-| Innovation | Source Inspiration | What It Does |
-|---|---|---|
-| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
-| **Gated Linear Diffusion (GLD)** blocks | ViG/GLA + DyDiLA | O(N) subquadratic spatial mixing replacing O(N²) attention |
-| **Compact f=16 VAE** | SANA DC-AE + SnapGen | 16× spatial compression with ~280K decoder |
-| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
-| **Multimodal Conditioning** | OmniGen | Same core supports text-to-image AND editing via additive image conditioning |
-### Key Numbers (Tiny Config — 5.7M params)
-| Component | Parameters | FP32 Size | INT8 Size |
-|---|---|---|---|
-| VAE Encoder | 777K | 3.0 MB | 0.7 MB |
-| VAE Decoder | 283K | 1.1 MB | 0.3 MB |
-| Text Encoder | 4.5M | 17.3 MB | 4.3 MB |
-| Denoising Core | 102K | 0.4 MB | 0.1 MB |
-| **Total** | **5.7M** | **21.7 MB** | **5.4 MB** |
-### Key Numbers (Default Config — 16.3M params)
-| Component | Parameters | FP32 Size | INT8 Size |
-|---|---|---|---|
-| VAE Encoder | 3.1M | 11.7 MB | 2.9 MB |
-| VAE Decoder | 1.1M | 4.1 MB | 1.0 MB |
-| Text Encoder | 11.5M | 43.9 MB | 11.0 MB |
-| Denoising Core | 651K | 2.5 MB | 0.6 MB |
-| **Total** | **16.3M** | **62.2 MB** | **15.6 MB** |
----
-## 2. Shortlist of Most Relevant Papers
-### A. Subquadratic Spatial Mixing for Image Generation
-| Paper | arxiv | Key Contribution | FID Result |
-|---|---|---|---|
-| **PDE-SSM-DiT** | 2603.13663 | Fourier PDE operator replaces attention, O(N log N), 34× speedup | 18.36 (CelebA-HQ 256) |
-| **DiMSUM** (NeurIPS 2024) | 2411.04168 | Mamba + wavelet subbands + shared transformer | **2.11** (CelebA-HQ 256) |
-| **ViG/GLA** | 2405.18425 | Gated Linear Attention with 2D locality injection | 90% less memory at 1024² |
-| **DyDiLA** | 2601.13683 | Dynamic differential linear attention | **6.80** (SubIN 256) |
-| **Mamba2D** | 2412.16146 | True 2D SSM with wavefront scan | 84.0% top-1 IN-1K (27M) |
-### B. Recursive/Iterative Reasoning
-| Paper | arxiv | Key Contribution |
-|---|---|---|
-| **HRM** | 2506.21734 | 2-level recurrent fixed-point reasoning, O(1) memory via IFT |
-| **TRM** (6473 ⭐) | 2510.04871 | 7M params → 45% ARC-AGI-1 via deep recursion |
-| **Thinking Pixel** | 2604.25299 | Sparse MoE adapters for recursive visual reasoning in DiT |
-### C. Compact Latent Spaces
-| Paper | arxiv | Compression | Quality |
-|---|---|---|---|
-| **SANA DC-AE** | 2410.10629 | f=32, C=32 → 32×32 latents for 1024² | PSNR 29.29, rFID 0.34 |
-| **SnapGen** | 2412.09619 | 1.38M tiny decoder (35× smaller than SD3) | PSNR 27.85 |
-| **TiTok** | 2406.07550 | 32 tokens per 256² image | gFID 1.97 (IN-256) |
-| **MobileDiffusion** | 2311.16567 | f=8, c=8 VAE, sub-second on iPhone | Better than SD-1.5 at 8 steps |
-### D. Few-Step Generation
-| Paper | arxiv | Key Result |
-|---|---|---|
-| **Consistency Models** | 2303.01469 | One-step generation from diffusion |
-| **LCM** | 2310.04378 | 2-4 step high-quality via consistency distillation |
-| **SD3.5-Flash** | 2509.21318 | Few-step distillation with timestep sharing |
-### E. Unified Generation + Editing
-| Paper | arxiv | Key Contribution |
-|---|---|---|
-| **OmniGen** | 2409.11340 | Single model for T2I + editing + control, interleaved image-text input |
-| **OmniGen2** | 2506.18871 | Dual decoding pathways, decoupled image tokenizer |
-| **InstructPix2Pix** | 2211.09800 | Image editing from text instructions |
-### F. Mobile Deployment
-| Paper | arxiv | Device Performance |
-|---|---|---|
-| **SnapGen** | 2412.09619 | 1.4s on iPhone 15 Pro, 372M UNet |
-| **SnapGen++** | 2601.08303 | 1.8s on iPhone 16, 0.4B sub-DiT |
-| **MobileDiffusion** | 2311.16567 | Sub-second on iPhone, ~400M params |
----
-## 3. Paper Critiques
-### PDE-SSM (2603.13663) ✅ Borrowed: Physical inductive bias concept
-- **Why it helps**: 34× speedup from FFT-based spatial operator with physically grounded bias
-- **What it fails at**: FID still behind DiMSUM (18.36 vs 2.11); requires FFT which is non-trivial on mobile
-- **Borrowed**: Concept of learnable PDE-style spatial operators; we adapt this to our GLD blocks
-### HRM/TRM (2506.21734, 2510.04871) ✅ Borrowed: Core recursive architecture
-- **Why it helps**: O(1) memory backprop via IFT; extreme parameter efficiency (7M → 45% ARC-AGI)
-- **What it fails at**: Never applied to image generation; fixed-point convergence not guaranteed for images
-- **Borrowed**: Two-level recursion (abstract + detail), IFT training, recursion depth embedding
-### ViG/GLA (2405.18425) ✅ Borrowed: Spatial mixing block
-- **Why it helps**: Hardware-aware, 90% memory savings, bidirectional GLA with locality injection
-- **What it fails at**: Only tested on classification/detection, not generation
-- **Borrowed**: Bidirectional GLA core, depthwise conv locality injection (GaLI), token differential (from DyDiLA)
-### SANA DC-AE (2410.10629) ✅ Borrowed: Latent space design principles
-- **Why it helps**: f=32 achieves similar quality to f=8 but 16× fewer tokens
-- **What it fails at**: Decoder is still large (50M); typography needs decoder-only LLM text encoder
-- **Borrowed**: High-compression VAE principle; we use f=16 as a compromise for fine detail
-### SnapGen (2412.09619) ✅ Borrowed: Tiny decoder architecture
-- **Why it helps**: 35× smaller decoder, 54× faster decode, negligible quality loss
-- **What it fails at**: Proprietary weights; still uses quadratic attention in the UNet backbone
-- **Borrowed**: Attention-free decoder, SepConv, minimal GroupNorm, SiLU instead of GELU
-### TiTok (2406.07550) ❌ Rejected: Too aggressive compression
-- **Why it was considered**: 32 tokens per image is incredibly compact
-- **Why rejected**: rFID=16.2 means visible artifacts; fine detail and typography badly degraded at 32 tokens
-### DiMSUM (2411.04168) ⚠️ Partially borrowed: Wavelet concept
-- **Why it helps**: Best FID (2.11) among SSM-based approaches
-- **What it fails at**: Still uses cross-attention fusion → partially quadratic; complex architecture
-- **Borrowed**: Wavelet decomposition concept for frequency-aware processing
----
-## 4. Full Proposed Architecture: LatentRecurrentFlow
-### Name: **LatentRecurrentFlow (LRF)**
-LRF is a **recursive flow-matching image generator** that uses:
-- A compact VAE with f=16 compression and a ~280K tiny decoder
-- A **Recursive Latent Refinement (RLR) core** that iteratively refines image latents through shared GLD blocks
-- A **rectified flow** training objective for clean few-step generation
-- **Additive image conditioning** for editing-readiness
-The core insight: **instead of stacking many unique layers, reuse a small set of blocks recursively**. This exploits the observation from HRM/TRM that iterative application of the same function can converge to a fixed point that represents the solution — analogous to how diffusion models iteratively denoise.
----
-## 5. Module-by-Module Diagram
 ```
-┌─────────────────────────────────────────────────────────────┐
-│                    LatentRecurrentFlow                        │
-│                                                              │
-│  ┌─────────────┐    ┌──────────────┐    ┌────────────────┐  │
-│  │   Compact    │    │   Simple     │    │   Rectified    │  │
-│  │     VAE      │    │   Text       ���    │   Flow         │  │
-│  │  (f=16)      │    │   Encoder    │    │   Scheduler    │  │
-│  │              │    │              │    │                │  │
-│  │  Encoder ────┤    │  Embed ──────┤    │  t ~ U[0,1]   │  │
-│  │  (3.1M)      │    │  Transformer │    │  z_t = (1-t)  │  │
-│  │              │    │  (11.5M)     │    │    z_0 + tε    │  │
-│  │  Decoder ────┤    │              │    │                │  │
-│  │  (1.1M, tiny)│    │  → text_emb  │    │  v = ε - z_0  │  │
-│  └──────┬───────┘    │  → text_glob │    └────────┬───────┘  │
-│         │            └──────┬───────┘             │          │
-│         │                   │                     │          │
-│  ┌──────▼───────────────────▼─────────────────────▼──────┐  │
-│  │              Recursive Latent Core (RLR)                │  │
-│  │                                                         │  │
-│  │  ┌─────────────────────────────────────────────────┐   │  │
-│  │  │  OUTER LOOP (j = 1..T_outer)                     │   │  │
-│  │  │                                                   │   │  │
-│  │  │  z_abstract ← f_slow(z, z_pooled)  [H-module]   │   │  │
-│  │  │                                                   │   │  │
-│  │  │  ┌─────────────────────────────────────────┐     │   │  │
-│  │  │  │  INNER LOOP (i = 1..T_inner)             │     │   │  │
-│  │  │  │                                           │     │   │  │
-│  │  │  │  cond = t_emb + text_global + rec_emb    │     │   │  │
-│  │  │  │  z_in = z + z_abstract                   │     │   │  │
-│  │  │  │                                           │     │   │  │
-│  │  │  │  FOR block in GLD_blocks:                │     │   │  │
-│  │  │  │  ┌─────────────────────────────────┐     │     │   │  │
-│  │  │  │  │  GLD Block                       │     │     │   │  │
-│  │  │  │  │                                   │     │     │   │  │
-│  │  │  │  │  1. AdaLN-modulate(z, cond)      │     │     │   │  │
-│  │  │  │  │  2. GLA: BiDir scan + DiffToken  │     │     │   │  │
-│  │  │  │  │     + DW-Conv locality gate      │     │     │   │  │
-│  │  │  │  │  3. Cross-attn to text_emb       │     │     │   │  │
-│  │  │  │  │  4. AdaLN-modulate(z, cond)      │     │     │   │  │
-│  │  │  │  │  5. SwiGLU FFN                   │     │     │   │  │
-│  │  │  │  └─────────────────────────────────┘     │     │   │  │
-│  │  │  │                                           │     │   │  │
-│  │  │  │  z = z + 0.5 * (blocks(z_in) - z)       │     │   │  │
-│  │  │  └─────────────────────────────────────────┘     │   │  │
-│  │  └─────────────────────────────────────────────────┘   │  │
-│  │                                                         │  │
-│  │  v = out_proj(out_norm(z))  ← velocity prediction      │  │
-│  └─────────────────────────────────────────────────────────┘  │
-│                                                              │
-│  Training: IFT backprop (O(1) memory through recursion)      │
-│  Inference: Full recursion (no grad needed)                   │
-└─────────────────────────────────────────────────────────────┘
 ```
 ---
-## 6. Mathematical Formulation
-### Forward Process (Rectified Flow)
-Given clean latent z₀ and noise ε ~ N(0, I):
-```
-z_t = (1 - t) · z₀ + t · ε,    t ∈ [0, 1]
-```
-### Velocity Target
-```
-v* = ε - z₀
-```
-### Denoising Core (RLR)
-Let f_θ denote the shared GLD blocks, and g_φ denote the abstract updater.
-**Initialization:**
-```
-z⁽⁰⁾ = input_proj(flatten(z_t))
-c = time_embed(sinusoidal(t)) + text_global
-z_abs⁽⁰⁾ = mean_pool(z⁽⁰⁾)
-```
-**Outer loop** (j = 1..T_outer):
-```
-z_abs⁽ʲ⁾ = z_abs⁽ʲ⁻¹⁾ + tanh(α) · g_φ([norm(z), mean_pool(z)])
-```
-**Inner loop** (i = 1..T_inner):
-```
-c_step = c + recursion_embed(j · T_inner + i)
-z_in = z + z_abs⁽ʲ⁾
-z ← z + 0.5 · (f_θ(z_in, c_step, text_emb) - z)
-```
-**Output:**
-```
-v_θ(z_t, t, c) = out_proj(out_norm(z))
-```
-### GLA Block (within f_θ)
-```
-Q, K, V = W_qkv · x            (linear projection)
-Q̃ = Q - λ · shift(Q)           (token differential)
-K̃ = K - λ · shift(K)
-Q̃ = φ(Q̃), K̃ = φ(K̃)           where φ(x) = 1 + elu(x)
-Forward scan:  S_i = γ · S_{i-1} + K̃_i^T · V_i;  O_i^fwd = Q̃_i · S_i
-Backward scan: (same in reverse)
-O = O^fwd + O^bwd
-O = sigmoid(W_g · x) · norm(O) · sigmoid(DWConv(W_local · x))
-output = W_out · O
-```
-Complexity: **O(N · d²)** per direction, where d is head dimension and N is token count.
-### IFT Training (O(1) Memory)
-During training, we detach gradients for all but the last recursion:
-```
-with no_grad():
-    for j in range(T_outer - 1):
-        z = recursive_refinement(z, c, text_emb)
-z = recursive_refinement(z, c, text_emb)  # grad only here
-```
-By the Implicit Function Theorem, if z* is a fixed point of f, then:
 ```
-∂z*/∂θ = (I - ∂f/∂z)⁻¹ · ∂f/∂θ
-```
-The 1-step gradient approximates this, giving correct gradient direction with O(1) memory.
 ---
-## 7. Training Objective & Losses
-### Stage 1: VAE Training
-```
-L_VAE = L_recon + λ_perc · L_perceptual + λ_KL · L_KL
-L_recon = |x - x̂|₁                              (L1 reconstruction)
-L_perceptual = (1/3) Σ_{s=0}^{2} MSE(pool_s(x), pool_s(x̂))  (multi-scale)
-L_KL = -0.5 · E[1 + log(σ²) - μ² - σ²]         (KL divergence)
-λ_perc = 1.0, λ_KL = 1e-6
-```
-### Stage 2: Flow Matching
-```
-L_flow = E_{t,z₀,ε} [ w(t) · ‖v_θ(z_t, t, c) - (ε - z₀)‖² ]
-w(t) = 1 / (t(1-t) + 0.01)    (SNR weighting, normalized)
-With 10% classifier-free guidance dropout:
-P(c = ∅) = 0.1
-```
-### Stage 3: Consistency Distillation
-```
-L_CD = ‖f_θ(z_{t_n}, t_n, c) - sg[f_{teacher}(z_{t_{n-1}}, t_{n-1}, c)]‖²
-where f_teacher uses the trained flow model with one Euler step:
-z_{t_{n-1}} = z_{t_n} - (t_n - t_{n-1}) · v_teacher(z_{t_n}, t_n, c)
 ```
-### Stage 4: Editing Fine-tuning
-Same flow matching loss, but with additional image condition:
 ```
-v_θ(z_t, t, c, z_src)    where z_src = encode(source_image)
-```
-Additive conditioning: `z_input = z + z_src` before the RLR core.
 ---
-## 8. Memory & Compute Budget
-### Inference (1024×1024, Default Config, INT8)
-| Component | Memory |
 |---|---|
-| Text Encoder (INT8) | 11 MB |
-| VAE Decoder (INT8) | 1 MB |
-| Denoising Core (INT8) | 0.6 MB |
-| Latent activations (64×64×32) | 0.5 MB |
-| Peak activation memory | ~200 MB |
-| **Total** | **~213 MB** |
-This comfortably fits within 3-4 GB mobile RAM.
-### Training (16 GB GPU, Default Config)
-| Item | Memory |
-|---|---|
-| Model parameters (FP32) | 62 MB |
-| Optimizer states (AdamW, 2×) | 124 MB |
-| Gradients | 62 MB |
-| Batch activations (BS=8, 64×64) | ~500 MB |
-| IFT overhead (only last recursion) | ~50 MB |
-| **Total** | **~800 MB** |
-Leaves ample room for larger batch sizes or higher resolution on 16 GB.
 ---
-## 9. Training Curriculum
-### Stage 1: VAE (50K steps)
-- **Data**: ImageNet or COCO (any large image dataset)
-- **Resolution**: 256×256
-- **What to freeze**: Nothing
-- **What to train**: Full VAE
-- **LR**: 1e-4, AdamW, weight_decay=0.01
-- **Key**: Train until L_recon < 0.1
-### Stage 2: Flow Matching — Low Resolution (100K steps)
-- **Data**: Synthetic captions from teacher (SDXL) + LAION-aesthetic subset
-- **Resolution**: 64×64
-- **What to freeze**: VAE
-- **What to train**: Core + Text Encoder
-- **LR**: 1e-4
-- **Key**: Focus on learning composition and prompt adherence
-### Stage 3: Flow Matching — Mid Resolution (200K steps)
-- **Data**: Filtered LAION-aesthetic (score > 6.0) + synthetic
-- **Resolution**: 256×256
-- **What to freeze**: VAE
-- **What to train**: Core + Text Encoder
-- **LR**: 5e-5
-- **Key**: Focus on texture and detail
-### Stage 4: Flow Matching — High Resolution (100K steps)
-- **Data**: High-quality curated + JourneyDB
-- **Resolution**: 512×512
-- **What to freeze**: VAE
-- **What to train**: Core + Text Encoder
-- **LR**: 2e-5
-- **Key**: Focus on fine detail and typography
-### Stage 5: Consistency Distillation (50K steps)
-- **Data**: Same as Stage 4
-- **What to freeze**: VAE + Text Encoder
-- **What to train**: Core only
-- **LR**: 1e-5
-- **Key**: Distill from own multi-step model to 4-step generation
-### Stage 6: Editing Fine-tuning (50K steps)
-- **Data**: InstructPix2Pix + MagicBrush + synthetic edit pairs
-- **What to freeze**: VAE
-- **What to train**: Core + Text Encoder
-- **LR**: 1e-5
-- **Key**: Add image conditioning channel
----
-## 10. Deployment Plan for Mobile
-### Step 1: Quantization
-- INT8 per-channel weight quantization (static)
-- INT8 per-token activation quantization (dynamic)
-- Result: ~4× model size reduction
-### Step 2: Operator Optimization
-- Replace GELU → SiLU throughout (MobileDiffusion finding: GELU causes float16 instability)
-- Fuse norm + activation + linear into single kernels
-- Use CoreML (iOS) or NNAPI (Android) for hardware acceleration
-### Step 3: Step Reduction
-- After consistency distillation: 4 Euler steps sufficient
-- With further adversarial distillation: 1-2 steps possible
-### Step 4: Latent Size Optimization
-- f=16 compression: 1024² → 64×64 latents
-- 32 channels per position
-- Total latent: 64×64×32 = 131,072 values ≈ 0.5 MB
-### Projected Performance
-| Device | Steps | Estimated Time |
-|---|---|---|
-| iPhone 16 Pro (ANE) | 4 | ~0.5-1.0s |
-| Pixel 8 Pro (GPU) | 4 | ~1.0-2.0s |
-| iPhone 14 (GPU) | 8 | ~2.0-3.0s |
----
-## 11. Failure Mode Analysis
-| Failure Mode | Cause | Detection | Fix |
-|---|---|---|---|
-| **Fixed-point non-convergence** | Recursion doesn't converge | Monitor z change per recursion | Damped update (α=0.5), reduce T_inner |
-| **Oversmoothing** | GLA loses high-frequency detail | Blurry outputs, low LPIPS | Increase token-differential λ, add DW-conv skip |
-| **Mode collapse** | Small model capacity | FID increases, low diversity | Increase num_blocks or dim |
-| **Training instability** | IFT gradient approximation error | Loss spikes | Reduce LR, increase warmup, disable IFT temporarily |
-| **Poor text adherence** | Weak cross-attention | Low CLIP score | Increase cross-attention gates, add more cross-attn layers |
-| **VAE artifacts** | Aggressive compression | Reconstruction artifacts | Lower f (use f=8), increase decoder capacity |
-| **CFG artifacts** | High guidance scale | Oversaturated images | Train with 10% unconditional, use CFG 3-5 range |
----
-## 12. Ablation Plan
-### Ablation 1: Recursion Depth vs Quality
-- **Vary**: T_inner ∈ {1, 2, 4, 6, 8}, T_outer ∈ {1, 2, 3}
-- **Measure**: FID, CLIP score, inference time
-- **Hypothesis**: Quality plateaus around T_inner=4-6; diminishing returns beyond T_outer=2
-### Ablation 2: GLA vs Standard Attention
-- **Compare**: GLA blocks vs softmax attention blocks (same dim, same depth)
-- **Measure**: FID, memory, throughput
-- **Hypothesis**: GLA matches attention quality at 3-5× lower memory
-### Ablation 3: Token Differential
-- **Vary**: λ ∈ {0, 0.05, 0.1, 0.2, learned}
-- **Measure**: FID, sharpness metrics (gradient magnitude)
-- **Hypothesis**: λ=0.1 optimal; λ=0 causes oversmoothing
-### Ablation 4: IFT vs Full Backprop
-- **Compare**: IFT training vs full BPTT (at small T for memory comparison)
-- **Measure**: Final FID, training memory, convergence speed
-- **Hypothesis**: IFT within 2% FID of full backprop at 8-16× memory savings
-### Ablation 5: VAE Compression
-- **Vary**: f ∈ {8, 16, 32}, C ∈ {8, 16, 32}
-- **Measure**: rFID, PSNR, generation FID
-- **Hypothesis**: f=16, C=16-32 is the sweet spot for mobile quality
-### Ablation 6: Abstract State (H-module)
-- **Compare**: With/without abstract state update
-- **Measure**: FID, coherence metrics
-- **Hypothesis**: Abstract state improves global composition coherence
----
-## 13. Editing Roadmap
-The LRF architecture is designed for editing-readiness through **additive image conditioning**:
-### Phase 1: Inpainting
-- Add binary mask channel to condition input
-- `z_input = z + z_src * mask + z_noise * (1 - mask)`
-- Train on random masking + MagicBrush data
-### Phase 2: Image-to-Image Translation
-- Source image encoded to latent, added to noisy latent
-- Noise level controls edit strength (low noise = subtle edit)
-- No architectural changes needed
-### Phase 3: Instruction-Based Editing (OmniGen-style)
-- Text encoder receives both instruction AND image description
-- Source image latent added as conditioning
-- Train on InstructPix2Pix + SEED-edit data
-### Phase 4: Super-Resolution
-- Low-res image encoded, upscaled in latent space
-- Decoder generates high-res output
-- Train on paired low/high-res data
-### Phase 5: Style Transfer & Identity Preservation
-- Reference image encoded to separate latent
-- Cross-attention between reference and generation
-- Train on same-identity different-image pairs (GRIT-Entity)
-### Phase 6: Multi-Image Conditioning
-- OmniGen-style interleaved image-text input
-- Multiple source images encoded and concatenated in latent space
-- Enables try-on, compositing, scene editing
-### Why This Works
-The key architectural decisions that enable editing:
-1. **Additive conditioning** preserves spatial correspondence (pixel i in source maps to token i in latent)
-2. **Recursive refinement** naturally handles conditioning — the model can "reason" about how to modify the latent
-3. **Cross-attention to text** at every recursion step allows the model to follow editing instructions progressively
-4. **Same parameter reuse** means editing capability doesn't require new parameters — just new training data
 ---
-## Quick Start
-```python
-# Clone and install
-!pip install torch einops safetensors
-# Use the pipeline
-from lrf.model import LatentRecurrentFlow
-from lrf.pipeline import LRFPipeline
-# Create model
-model = LatentRecurrentFlow(LatentRecurrentFlow.tiny_config())
-pipe = LRFPipeline(model)
-# Generate
-images = pipe("a sunset over the ocean", num_steps=10, height=64, width=64)
-# Or train
-from lrf.training import run_prototype_training
-model, trainer = run_prototype_training(num_vae_steps=100, num_flow_steps=100)
-```
-See `notebook.ipynb` for the full interactive walkthrough.
 ---
-## Citation
-```bibtex
-@software{lrf2026,
-  title={LatentRecurrentFlow: A Novel Mobile-First Image Generation Architecture},
-  author={LRF Research},
-  year={2026},
-  url={https://huggingface.co/krystv/LatentRecurrentFlow}
-}
-```
 ## License
 Apache 2.0

   - recursive-reasoning
   - novel-architecture
   - subquadratic-attention
   - research
 library_name: lrf
 pipeline_tag: text-to-image
 > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
+## 🔥 v2 Training Results (CIFAR-10)
+**Trained end-to-end on CIFAR-10** (50K images, 10 classes) using:
+- **Pre-trained TAESD** (2.4M frozen params) as the VAE — f=8 compression, 32×32 → 4×4×4 latents
+- **1.47M parameter denoising core** with recursive refinement (4 shared blocks × 2 recursions = 8 effective layers)
+- **Rectified flow** matching with SNR-weighted loss and 10% CFG dropout
+- Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999
+| Metric | Value |
+|--------|-------|
+| Final Loss | 0.931 |
+| Training Time | ~70 min (CPU only!) |
+| VAE Recon MSE | 0.068 |
+| All 10 classes produce colorful images | ✅ |
+### Sample Outputs
+VAE Reconstruction (top: original, bottom: TAESD reconstruction):
+![VAE Reconstruction](samples/vae_reconstruction.png)
+Training progression (epoch 5 → 30):
+![Epoch 5](samples/samples_epoch005.png)
+![Epoch 30](samples/samples_epoch030.png)
+Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):
+![Final Samples](samples/final_class_conditional.png)
+Loss curve:
+![Loss](samples/loss.png)
+### Validation: No Grey Images
+Every class produces images with proper variance:
 ```
+airplane    : std=0.383, range=1.908 ✅
+automobile  : std=0.448, range=2.000 ✅
+bird        : std=0.341, range=1.663 ✅
+cat         : std=0.521, range=2.000 ✅
+deer        : std=0.401, range=1.869 ✅
+dog         : std=0.477, range=1.994 ✅
+frog        : std=0.366, range=1.996 ✅
+horse       : std=0.499, range=1.972 ✅
+ship        : std=0.448, range=1.786 ✅
+truck       : std=0.510, range=1.944 ✅
 ```
 ---
+## Architecture Overview
+LRF combines five key innovations into a single coherent architecture:
+| Innovation | Source Inspiration | What It Does |
+|---|---|---|
+| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
+| **Efficient Spatial Mixer** | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) |
+| **Pre-trained TAESD VAE** | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box |
+| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
+| **Additive Image Conditioning** | OmniGen | Same core supports text-to-image AND editing |
+### v2 Architecture (Trained & Validated)
+| Component | Parameters | Description |
+|---|---|---|
+| TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder |
+| Denoising Core | 1.47M | 4 shared blocks × 2 inner recursions |
+| Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 |
+| **Trainable Total** | **1.47M** | |
+### How It Works
+```python
+# 1. Encode image to latent (TAESD, frozen)
+z_0 = vae.encode(image)                    # [B, 4, 4, 4]
+# 2. Add noise (rectified flow)
+z_t = (1-t) * z_0 + t * noise              # Linear interpolation
+# 3. Predict velocity (recursive denoising core)
+v = core(z_t, t, class_label)              # 4 blocks × 2 recursions
+# 4. Training target
+loss = MSE(v, noise - z_0)                 # Velocity matching
+# 5. Sampling (Euler ODE solver, t=1→0)
+for step in timesteps:
+    v = core(z, t, class_label)
+    z = z - dt * v
+# 6. Decode to image (TAESD, frozen)
+image = vae.decode(z)
 ```
 ---
+## Quick Start
+### Generate from trained model:
+```python
+import torch
+from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
+from diffusers import AutoencoderTiny
+# Load
+vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
+ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
+model = LRFv2(ckpt['config'])
+for name, p in model.named_parameters():
+    p.data.copy_(ckpt['ema_params'][name])
+model.eval()
+# Generate (class 3 = cat)
+scheduler = RectifiedFlowScheduler()
+labels = torch.full((4,), 3, dtype=torch.long)
+z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
+images = vae.decode(z).sample.clamp(-1, 1)
 ```
+### Train from scratch:
+```bash
+python lrf/train_v2.py
 ```
 ---
+## Files
+| File | Description |
 |---|---|
+| `lrf/model_v2.py` | Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) |
+| `lrf/train_v2.py` | CIFAR-10 training pipeline with TAESD VAE |
+| `trained/cifar10_checkpoint.pt` | Trained weights (30 epochs, EMA) |
+| `trained/config.json` | Model configuration |
+| `samples/` | Generated sample images at various epochs |
+| `lrf/model.py` | v1 architecture (research prototype) |
+| `lrf/training.py` | v1 training pipeline |
+| `lrf/pipeline.py` | HF-compatible inference pipeline |
+| `notebook.ipynb` | Interactive walkthrough |
 ---
+## Training Curriculum (Full Scale)
+| Stage | Resolution | Data | Freeze | Train | LR | Steps |
+|---|---|---|---|---|---|---|
+| 1. VAE | 256² | ImageNet/COCO | - | VAE | 1e-4 | 50K |
+| 2. Flow (low) | 64² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K |
+| 3. Flow (mid) | 256² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K |
+| 4. Flow (high) | 512² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K |
+| 5. Distill | 512² | Same as 4 | VAE+Text | Core | 1e-5 | 50K |
+| 6. Editing | 512² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K |
+**Shortcut (proven in this repo):** Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.
 ---
+## Relevant Papers (Grouped by Problem)
+### Subquadratic Spatial Mixing
+- PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34× speedup
+- DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
+- ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
+- DyDiLA (2601.13683): Dynamic differential linear attention
+### Recursive Reasoning
+- HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
+- TRM (2510.04871): 7M params → 45% ARC-AGI-1
+### Compact Latent Spaces
+- SANA DC-AE (2410.10629): f=32, PSNR 29.29
+- SnapGen (2412.09619): 1.38M tiny decoder
+- TAESD (madebyollin): 2.4M params, f=8, works immediately
+### Few-Step Generation
+- Consistency Models (2303.01469): One-step from diffusion
+- LCM (2310.04378): 2-4 step via consistency distillation
+### Editing Architectures
+- OmniGen (2409.11340): Unified generation + editing
+- InstructPix2Pix (2211.09800): Text-guided editing
 ---
 ## License
 Apache 2.0