krystv
/

LatentRecurrentFlow

+---
+tags:
+  - image-generation
+  - latent-recurrent-flow
+  - lrf
+  - mobile-first
+  - flow-matching
+  - recursive-reasoning
+  - novel-architecture
+  - subquadratic-attention
+  - gated-linear-attention
+  - research
+library_name: lrf
+pipeline_tag: text-to-image
+license: apache-2.0
+---
+# LatentRecurrentFlow (LRF) — A Novel Mobile-First Image Generation Architecture
+> A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
+---
+## Table of Contents
+1. [Architecture Overview](#1-architecture-overview)
+2. [Shortlist of Most Relevant Papers](#2-shortlist-of-most-relevant-papers)
+3. [Paper Critiques](#3-paper-critiques)
+4. [Full Proposed Architecture](#4-full-proposed-architecture-latentrecurrentflow)
+5. [Module-by-Module Diagram](#5-module-by-module-diagram)
+6. [Mathematical Formulation](#6-mathematical-formulation)
+7. [Training Objective & Losses](#7-training-objective--losses)
+8. [Memory & Compute Budget](#8-memory--compute-budget)
+9. [Training Curriculum](#9-training-curriculum)
+10. [Deployment Plan for Mobile](#10-deployment-plan-for-mobile)
+11. [Failure Mode Analysis](#11-failure-mode-analysis)
+12. [Ablation Plan](#12-ablation-plan)
+13. [Editing Roadmap](#13-editing-roadmap)
+---
+## 1. Architecture Overview
+LRF combines five key innovations into a single coherent architecture:
+| Innovation | Source Inspiration | What It Does |
+|---|---|---|
+| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
+| **Gated Linear Diffusion (GLD)** blocks | ViG/GLA + DyDiLA | O(N) subquadratic spatial mixing replacing O(N²) attention |
+| **Compact f=16 VAE** | SANA DC-AE + SnapGen | 16× spatial compression with ~280K decoder |
+| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
+| **Multimodal Conditioning** | OmniGen | Same core supports text-to-image AND editing via additive image conditioning |
+### Key Numbers (Tiny Config — 5.7M params)
+| Component | Parameters | FP32 Size | INT8 Size |
+|---|---|---|---|
+| VAE Encoder | 777K | 3.0 MB | 0.7 MB |
+| VAE Decoder | 283K | 1.1 MB | 0.3 MB |
+| Text Encoder | 4.5M | 17.3 MB | 4.3 MB |
+| Denoising Core | 102K | 0.4 MB | 0.1 MB |
+| **Total** | **5.7M** | **21.7 MB** | **5.4 MB** |
+### Key Numbers (Default Config — 16.3M params)
+| Component | Parameters | FP32 Size | INT8 Size |
+|---|---|---|---|
+| VAE Encoder | 3.1M | 11.7 MB | 2.9 MB |
+| VAE Decoder | 1.1M | 4.1 MB | 1.0 MB |
+| Text Encoder | 11.5M | 43.9 MB | 11.0 MB |
+| Denoising Core | 651K | 2.5 MB | 0.6 MB |
+| **Total** | **16.3M** | **62.2 MB** | **15.6 MB** |
+---
+## 2. Shortlist of Most Relevant Papers
+### A. Subquadratic Spatial Mixing for Image Generation
+| Paper | arxiv | Key Contribution | FID Result |
+|---|---|---|---|
+| **PDE-SSM-DiT** | 2603.13663 | Fourier PDE operator replaces attention, O(N log N), 34× speedup | 18.36 (CelebA-HQ 256) |
+| **DiMSUM** (NeurIPS 2024) | 2411.04168 | Mamba + wavelet subbands + shared transformer | **2.11** (CelebA-HQ 256) |
+| **ViG/GLA** | 2405.18425 | Gated Linear Attention with 2D locality injection | 90% less memory at 1024² |
+| **DyDiLA** | 2601.13683 | Dynamic differential linear attention | **6.80** (SubIN 256) |
+| **Mamba2D** | 2412.16146 | True 2D SSM with wavefront scan | 84.0% top-1 IN-1K (27M) |
+### B. Recursive/Iterative Reasoning
+| Paper | arxiv | Key Contribution |
+|---|---|---|
+| **HRM** | 2506.21734 | 2-level recurrent fixed-point reasoning, O(1) memory via IFT |
+| **TRM** (6473 ⭐) | 2510.04871 | 7M params → 45% ARC-AGI-1 via deep recursion |
+| **Thinking Pixel** | 2604.25299 | Sparse MoE adapters for recursive visual reasoning in DiT |
+### C. Compact Latent Spaces
+| Paper | arxiv | Compression | Quality |
+|---|---|---|---|
+| **SANA DC-AE** | 2410.10629 | f=32, C=32 → 32×32 latents for 1024² | PSNR 29.29, rFID 0.34 |
+| **SnapGen** | 2412.09619 | 1.38M tiny decoder (35× smaller than SD3) | PSNR 27.85 |
+| **TiTok** | 2406.07550 | 32 tokens per 256² image | gFID 1.97 (IN-256) |
+| **MobileDiffusion** | 2311.16567 | f=8, c=8 VAE, sub-second on iPhone | Better than SD-1.5 at 8 steps |
+### D. Few-Step Generation
+| Paper | arxiv | Key Result |
+|---|---|---|
+| **Consistency Models** | 2303.01469 | One-step generation from diffusion |
+| **LCM** | 2310.04378 | 2-4 step high-quality via consistency distillation |
+| **SD3.5-Flash** | 2509.21318 | Few-step distillation with timestep sharing |
+### E. Unified Generation + Editing
+| Paper | arxiv | Key Contribution |
+|---|---|---|
+| **OmniGen** | 2409.11340 | Single model for T2I + editing + control, interleaved image-text input |
+| **OmniGen2** | 2506.18871 | Dual decoding pathways, decoupled image tokenizer |
+| **InstructPix2Pix** | 2211.09800 | Image editing from text instructions |
+### F. Mobile Deployment
+| Paper | arxiv | Device Performance |
+|---|---|---|
+| **SnapGen** | 2412.09619 | 1.4s on iPhone 15 Pro, 372M UNet |
+| **SnapGen++** | 2601.08303 | 1.8s on iPhone 16, 0.4B sub-DiT |
+| **MobileDiffusion** | 2311.16567 | Sub-second on iPhone, ~400M params |
+---
+## 3. Paper Critiques
+### PDE-SSM (2603.13663) ✅ Borrowed: Physical inductive bias concept
+- **Why it helps**: 34× speedup from FFT-based spatial operator with physically grounded bias
+- **What it fails at**: FID still behind DiMSUM (18.36 vs 2.11); requires FFT which is non-trivial on mobile
+- **Borrowed**: Concept of learnable PDE-style spatial operators; we adapt this to our GLD blocks
+### HRM/TRM (2506.21734, 2510.04871) ✅ Borrowed: Core recursive architecture
+- **Why it helps**: O(1) memory backprop via IFT; extreme parameter efficiency (7M → 45% ARC-AGI)
+- **What it fails at**: Never applied to image generation; fixed-point convergence not guaranteed for images
+- **Borrowed**: Two-level recursion (abstract + detail), IFT training, recursion depth embedding
+### ViG/GLA (2405.18425) ✅ Borrowed: Spatial mixing block
+- **Why it helps**: Hardware-aware, 90% memory savings, bidirectional GLA with locality injection
+- **What it fails at**: Only tested on classification/detection, not generation
+- **Borrowed**: Bidirectional GLA core, depthwise conv locality injection (GaLI), token differential (from DyDiLA)
+### SANA DC-AE (2410.10629) ✅ Borrowed: Latent space design principles
+- **Why it helps**: f=32 achieves similar quality to f=8 but 16× fewer tokens
+- **What it fails at**: Decoder is still large (50M); typography needs decoder-only LLM text encoder
+- **Borrowed**: High-compression VAE principle; we use f=16 as a compromise for fine detail
+### SnapGen (2412.09619) ✅ Borrowed: Tiny decoder architecture
+- **Why it helps**: 35× smaller decoder, 54× faster decode, negligible quality loss
+- **What it fails at**: Proprietary weights; still uses quadratic attention in the UNet backbone
+- **Borrowed**: Attention-free decoder, SepConv, minimal GroupNorm, SiLU instead of GELU
+### TiTok (2406.07550) ❌ Rejected: Too aggressive compression
+- **Why it was considered**: 32 tokens per image is incredibly compact
+- **Why rejected**: rFID=16.2 means visible artifacts; fine detail and typography badly degraded at 32 tokens
+### DiMSUM (2411.04168) ⚠️ Partially borrowed: Wavelet concept
+- **Why it helps**: Best FID (2.11) among SSM-based approaches
+- **What it fails at**: Still uses cross-attention fusion → partially quadratic; complex architecture
+- **Borrowed**: Wavelet decomposition concept for frequency-aware processing
+---
+## 4. Full Proposed Architecture: LatentRecurrentFlow
+### Name: **LatentRecurrentFlow (LRF)**
+LRF is a **recursive flow-matching image generator** that uses:
+- A compact VAE with f=16 compression and a ~280K tiny decoder
+- A **Recursive Latent Refinement (RLR) core** that iteratively refines image latents through shared GLD blocks
+- A **rectified flow** training objective for clean few-step generation
+- **Additive image conditioning** for editing-readiness
+The core insight: **instead of stacking many unique layers, reuse a small set of blocks recursively**. This exploits the observation from HRM/TRM that iterative application of the same function can converge to a fixed point that represents the solution — analogous to how diffusion models iteratively denoise.
+---
+## 5. Module-by-Module Diagram
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    LatentRecurrentFlow                        │
+│                                                              │
+│  ┌─────────────┐    ┌──────────────┐    ┌────────────────┐  │
+│  │   Compact    │    │   Simple     │    │   Rectified    │  │
+│  │     VAE      │    │   Text       │    │   Flow         │  │
+│  │  (f=16)      │    │   Encoder    │    │   Scheduler    │  │
+│  │              │    │              │    │                │  │
+│  │  Encoder ────┤    │  Embed ──────┤    │  t ~ U[0,1]   │  │
+│  │  (3.1M)      │    │  Transformer │    │  z_t = (1-t)  │  │
+│  │              │    │  (11.5M)     │    │    z_0 + tε    │  │
+│  │  Decoder ────┤    │              │    │                │  │
+│  │  (1.1M, tiny)│    │  → text_emb  │    │  v = ε - z_0  │  │
+│  └──────┬───────┘    │  → text_glob │    └────────┬───────┘  │
+│         │            └──────┬───────┘             │          │
+│         │                   │                     │          │
+│  ┌──────▼────────────���──────▼─────────────────────▼──────┐  │
+│  │              Recursive Latent Core (RLR)                │  │
+│  │                                                         │  │
+│  │  ┌─────────────────────────────────────────────────┐   │  │
+│  │  │  OUTER LOOP (j = 1..T_outer)                     │   │  │
+│  │  │                                                   │   │  │
+│  │  │  z_abstract ← f_slow(z, z_pooled)  [H-module]   │   │  │
+│  │  │                                                   │   │  │
+│  │  │  ┌─────────────────────────────────────────┐     │   │  │
+│  │  │  │  INNER LOOP (i = 1..T_inner)             │     │   │  │
+│  │  │  │                                           │     │   │  │
+│  │  │  │  cond = t_emb + text_global + rec_emb    │     │   │  │
+│  │  │  │  z_in = z + z_abstract                   │     │   │  │
+│  │  │  │                                           │     │   │  │
+│  │  │  │  FOR block in GLD_blocks:                │     │   │  │
+│  │  │  │  ┌─────────────────────────────────┐     │     │   │  │
+│  │  │  │  │  GLD Block                       │     │     │   │  │
+│  │  │  │  │                                   │     │     │   │  │
+│  │  │  │  │  1. AdaLN-modulate(z, cond)      │     │     │   │  │
+│  │  │  │  │  2. GLA: BiDir scan + DiffToken  │     │     │   │  │
+│  │  │  │  │     + DW-Conv locality gate      │     │     │   │  │
+│  │  │  │  │  3. Cross-attn to text_emb       │     │     │   │  │
+│  │  │  │  │  4. AdaLN-modulate(z, cond)      │     │     │   │  │
+│  │  │  │  │  5. SwiGLU FFN                   │     │     │   │  │
+│  │  │  │  └─────────────────────────────────┘     │     │   │  │
+│  │  │  │                                           │     │   │  │
+│  │  │  │  z = z + 0.5 * (blocks(z_in) - z)       │     │   │  │
+│  │  │  └─────────────────────────────────────────┘     │   │  │
+│  │  └─────────────────────────────────────────────────┘   │  │
+│  │                                                         │  │
+│  │  v = out_proj(out_norm(z))  ← velocity prediction      │  │
+│  └─────────────────────────────────────────────────────────┘  │
+│                                                              │
+│  Training: IFT backprop (O(1) memory through recursion)      │
+│  Inference: Full recursion (no grad needed)                   │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## 6. Mathematical Formulation
+### Forward Process (Rectified Flow)
+Given clean latent z₀ and noise ε ~ N(0, I):
+```
+z_t = (1 - t) · z₀ + t · ε,    t ∈ [0, 1]
+```
+### Velocity Target
+```
+v* = ε - z₀
+```
+### Denoising Core (RLR)
+Let f_θ denote the shared GLD blocks, and g_φ denote the abstract updater.
+**Initialization:**
+```
+z⁽⁰⁾ = input_proj(flatten(z_t))
+c = time_embed(sinusoidal(t)) + text_global
+z_abs⁽⁰⁾ = mean_pool(z⁽⁰⁾)
+```
+**Outer loop** (j = 1..T_outer):
+```
+z_abs⁽ʲ⁾ = z_abs⁽ʲ⁻¹⁾ + tanh(α) · g_φ([norm(z), mean_pool(z)])
+```
+**Inner loop** (i = 1..T_inner):
+```
+c_step = c + recursion_embed(j · T_inner + i)
+z_in = z + z_abs⁽ʲ⁾
+z ← z + 0.5 · (f_θ(z_in, c_step, text_emb) - z)
+```
+**Output:**
+```
+v_θ(z_t, t, c) = out_proj(out_norm(z))
+```
+### GLA Block (within f_θ)
+```
+Q, K, V = W_qkv · x            (linear projection)
+Q̃ = Q - λ · shift(Q)           (token differential)
+K̃ = K - λ · shift(K)
+Q̃ = φ(Q̃), K̃ = φ(K̃)           where φ(x) = 1 + elu(x)
+Forward scan:  S_i = γ · S_{i-1} + K̃_i^T · V_i;  O_i^fwd = Q̃_i · S_i
+Backward scan: (same in reverse)
+O = O^fwd + O^bwd
+O = sigmoid(W_g · x) · norm(O) · sigmoid(DWConv(W_local · x))
+output = W_out · O
+```
+Complexity: **O(N · d²)** per direction, where d is head dimension and N is token count.
+### IFT Training (O(1) Memory)
+During training, we detach gradients for all but the last recursion:
+```
+with no_grad():
+    for j in range(T_outer - 1):
+        z = recursive_refinement(z, c, text_emb)
+z = recursive_refinement(z, c, text_emb)  # grad only here
+```
+By the Implicit Function Theorem, if z* is a fixed point of f, then:
+```
+∂z*/∂θ = (I - ∂f/∂z)⁻¹ · ∂f/∂θ
+```
+The 1-step gradient approximates this, giving correct gradient direction with O(1) memory.
+---
+## 7. Training Objective & Losses
+### Stage 1: VAE Training
+```
+L_VAE = L_recon + λ_perc · L_perceptual + λ_KL · L_KL
+L_recon = |x - x̂|₁                              (L1 reconstruction)
+L_perceptual = (1/3) Σ_{s=0}^{2} MSE(pool_s(x), pool_s(x̂))  (multi-scale)
+L_KL = -0.5 · E[1 + log(σ²) - μ² - σ²]         (KL divergence)
+λ_perc = 1.0, λ_KL = 1e-6
+```
+### Stage 2: Flow Matching
+```
+L_flow = E_{t,z₀,ε} [ w(t) · ‖v_θ(z_t, t, c) - (ε - z₀)‖² ]
+w(t) = 1 / (t(1-t) + 0.01)    (SNR weighting, normalized)
+With 10% classifier-free guidance dropout:
+P(c = ∅) = 0.1
+```
+### Stage 3: Consistency Distillation
+```
+L_CD = ‖f_θ(z_{t_n}, t_n, c) - sg[f_{teacher}(z_{t_{n-1}}, t_{n-1}, c)]‖²
+where f_teacher uses the trained flow model with one Euler step:
+z_{t_{n-1}} = z_{t_n} - (t_n - t_{n-1}) · v_teacher(z_{t_n}, t_n, c)
+```
+### Stage 4: Editing Fine-tuning
+Same flow matching loss, but with additional image condition:
+```
+v_θ(z_t, t, c, z_src)    where z_src = encode(source_image)
+```
+Additive conditioning: `z_input = z + z_src` before the RLR core.
+---
+## 8. Memory & Compute Budget
+### Inference (1024×1024, Default Config, INT8)
+| Component | Memory |
+|---|---|
+| Text Encoder (INT8) | 11 MB |
+| VAE Decoder (INT8) | 1 MB |
+| Denoising Core (INT8) | 0.6 MB |
+| Latent activations (64×64×32) | 0.5 MB |
+| Peak activation memory | ~200 MB |
+| **Total** | **~213 MB** |
+This comfortably fits within 3-4 GB mobile RAM.
+### Training (16 GB GPU, Default Config)
+| Item | Memory |
+|---|---|
+| Model parameters (FP32) | 62 MB |
+| Optimizer states (AdamW, 2×) | 124 MB |
+| Gradients | 62 MB |
+| Batch activations (BS=8, 64×64) | ~500 MB |
+| IFT overhead (only last recursion) | ~50 MB |
+| **Total** | **~800 MB** |
+Leaves ample room for larger batch sizes or higher resolution on 16 GB.
+---
+## 9. Training Curriculum
+### Stage 1: VAE (50K steps)
+- **Data**: ImageNet or COCO (any large image dataset)
+- **Resolution**: 256×256
+- **What to freeze**: Nothing
+- **What to train**: Full VAE
+- **LR**: 1e-4, AdamW, weight_decay=0.01
+- **Key**: Train until L_recon < 0.1
+### Stage 2: Flow Matching — Low Resolution (100K steps)
+- **Data**: Synthetic captions from teacher (SDXL) + LAION-aesthetic subset
+- **Resolution**: 64×64
+- **What to freeze**: VAE
+- **What to train**: Core + Text Encoder
+- **LR**: 1e-4
+- **Key**: Focus on learning composition and prompt adherence
+### Stage 3: Flow Matching — Mid Resolution (200K steps)
+- **Data**: Filtered LAION-aesthetic (score > 6.0) + synthetic
+- **Resolution**: 256×256
+- **What to freeze**: VAE
+- **What to train**: Core + Text Encoder
+- **LR**: 5e-5
+- **Key**: Focus on texture and detail
+### Stage 4: Flow Matching — High Resolution (100K steps)
+- **Data**: High-quality curated + JourneyDB
+- **Resolution**: 512×512
+- **What to freeze**: VAE
+- **What to train**: Core + Text Encoder
+- **LR**: 2e-5
+- **Key**: Focus on fine detail and typography
+### Stage 5: Consistency Distillation (50K steps)
+- **Data**: Same as Stage 4
+- **What to freeze**: VAE + Text Encoder
+- **What to train**: Core only
+- **LR**: 1e-5
+- **Key**: Distill from own multi-step model to 4-step generation
+### Stage 6: Editing Fine-tuning (50K steps)
+- **Data**: InstructPix2Pix + MagicBrush + synthetic edit pairs
+- **What to freeze**: VAE
+- **What to train**: Core + Text Encoder
+- **LR**: 1e-5
+- **Key**: Add image conditioning channel
+---
+## 10. Deployment Plan for Mobile
+### Step 1: Quantization
+- INT8 per-channel weight quantization (static)
+- INT8 per-token activation quantization (dynamic)
+- Result: ~4× model size reduction
+### Step 2: Operator Optimization
+- Replace GELU → SiLU throughout (MobileDiffusion finding: GELU causes float16 instability)
+- Fuse norm + activation + linear into single kernels
+- Use CoreML (iOS) or NNAPI (Android) for hardware acceleration
+### Step 3: Step Reduction
+- After consistency distillation: 4 Euler steps sufficient
+- With further adversarial distillation: 1-2 steps possible
+### Step 4: Latent Size Optimization
+- f=16 compression: 1024² → 64×64 latents
+- 32 channels per position
+- Total latent: 64×64×32 = 131,072 values ≈ 0.5 MB
+### Projected Performance
+| Device | Steps | Estimated Time |
+|---|---|---|
+| iPhone 16 Pro (ANE) | 4 | ~0.5-1.0s |
+| Pixel 8 Pro (GPU) | 4 | ~1.0-2.0s |
+| iPhone 14 (GPU) | 8 | ~2.0-3.0s |
+---
+## 11. Failure Mode Analysis
+| Failure Mode | Cause | Detection | Fix |
+|---|---|---|---|
+| **Fixed-point non-convergence** | Recursion doesn't converge | Monitor z change per recursion | Damped update (α=0.5), reduce T_inner |
+| **Oversmoothing** | GLA loses high-frequency detail | Blurry outputs, low LPIPS | Increase token-differential λ, add DW-conv skip |
+| **Mode collapse** | Small model capacity | FID increases, low diversity | Increase num_blocks or dim |
+| **Training instability** | IFT gradient approximation error | Loss spikes | Reduce LR, increase warmup, disable IFT temporarily |
+| **Poor text adherence** | Weak cross-attention | Low CLIP score | Increase cross-attention gates, add more cross-attn layers |
+| **VAE artifacts** | Aggressive compression | Reconstruction artifacts | Lower f (use f=8), increase decoder capacity |
+| **CFG artifacts** | High guidance scale | Oversaturated images | Train with 10% unconditional, use CFG 3-5 range |
+---
+## 12. Ablation Plan
+### Ablation 1: Recursion Depth vs Quality
+- **Vary**: T_inner ∈ {1, 2, 4, 6, 8}, T_outer ∈ {1, 2, 3}
+- **Measure**: FID, CLIP score, inference time
+- **Hypothesis**: Quality plateaus around T_inner=4-6; diminishing returns beyond T_outer=2
+### Ablation 2: GLA vs Standard Attention
+- **Compare**: GLA blocks vs softmax attention blocks (same dim, same depth)
+- **Measure**: FID, memory, throughput
+- **Hypothesis**: GLA matches attention quality at 3-5× lower memory
+### Ablation 3: Token Differential
+- **Vary**: λ ∈ {0, 0.05, 0.1, 0.2, learned}
+- **Measure**: FID, sharpness metrics (gradient magnitude)
+- **Hypothesis**: λ=0.1 optimal; λ=0 causes oversmoothing
+### Ablation 4: IFT vs Full Backprop
+- **Compare**: IFT training vs full BPTT (at small T for memory comparison)
+- **Measure**: Final FID, training memory, convergence speed
+- **Hypothesis**: IFT within 2% FID of full backprop at 8-16× memory savings
+### Ablation 5: VAE Compression
+- **Vary**: f ∈ {8, 16, 32}, C ∈ {8, 16, 32}
+- **Measure**: rFID, PSNR, generation FID
+- **Hypothesis**: f=16, C=16-32 is the sweet spot for mobile quality
+### Ablation 6: Abstract State (H-module)
+- **Compare**: With/without abstract state update
+- **Measure**: FID, coherence metrics
+- **Hypothesis**: Abstract state improves global composition coherence
+---
+## 13. Editing Roadmap
+The LRF architecture is designed for editing-readiness through **additive image conditioning**:
+### Phase 1: Inpainting
+- Add binary mask channel to condition input
+- `z_input = z + z_src * mask + z_noise * (1 - mask)`
+- Train on random masking + MagicBrush data
+### Phase 2: Image-to-Image Translation
+- Source image encoded to latent, added to noisy latent
+- Noise level controls edit strength (low noise = subtle edit)
+- No architectural changes needed
+### Phase 3: Instruction-Based Editing (OmniGen-style)
+- Text encoder receives both instruction AND image description
+- Source image latent added as conditioning
+- Train on InstructPix2Pix + SEED-edit data
+### Phase 4: Super-Resolution
+- Low-res image encoded, upscaled in latent space
+- Decoder generates high-res output
+- Train on paired low/high-res data
+### Phase 5: Style Transfer & Identity Preservation
+- Reference image encoded to separate latent
+- Cross-attention between reference and generation
+- Train on same-identity different-image pairs (GRIT-Entity)
+### Phase 6: Multi-Image Conditioning
+- OmniGen-style interleaved image-text input
+- Multiple source images encoded and concatenated in latent space
+- Enables try-on, compositing, scene editing
+### Why This Works
+The key architectural decisions that enable editing:
+1. **Additive conditioning** preserves spatial correspondence (pixel i in source maps to token i in latent)
+2. **Recursive refinement** naturally handles conditioning — the model can "reason" about how to modify the latent
+3. **Cross-attention to text** at every recursion step allows the model to follow editing instructions progressively
+4. **Same parameter reuse** means editing capability doesn't require new parameters — just new training data
+---
+## Quick Start
+```python
+# Clone and install
+!pip install torch einops safetensors
+# Use the pipeline
+from lrf.model import LatentRecurrentFlow
+from lrf.pipeline import LRFPipeline
+# Create model
+model = LatentRecurrentFlow(LatentRecurrentFlow.tiny_config())
+pipe = LRFPipeline(model)
+# Generate
+images = pipe("a sunset over the ocean", num_steps=10, height=64, width=64)
+# Or train
+from lrf.training import run_prototype_training
+model, trainer = run_prototype_training(num_vae_steps=100, num_flow_steps=100)
+```
+See `notebook.ipynb` for the full interactive walkthrough.
+---
+## Citation
+```bibtex
+@software{lrf2026,
+  title={LatentRecurrentFlow: A Novel Mobile-First Image Generation Architecture},
+  author={LRF Research},
+  year={2026},
+  url={https://huggingface.co/krystv/LatentRecurrentFlow}
+}
+```
+## License
+Apache 2.0