Add comprehensive README with full architecture documentation
Browse files
README.md
ADDED
|
@@ -0,0 +1,608 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- image-generation
|
| 4 |
+
- latent-recurrent-flow
|
| 5 |
+
- lrf
|
| 6 |
+
- mobile-first
|
| 7 |
+
- flow-matching
|
| 8 |
+
- recursive-reasoning
|
| 9 |
+
- novel-architecture
|
| 10 |
+
- subquadratic-attention
|
| 11 |
+
- gated-linear-attention
|
| 12 |
+
- research
|
| 13 |
+
library_name: lrf
|
| 14 |
+
pipeline_tag: text-to-image
|
| 15 |
+
license: apache-2.0
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# LatentRecurrentFlow (LRF) — A Novel Mobile-First Image Generation Architecture
|
| 19 |
+
|
| 20 |
+
> A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Table of Contents
|
| 25 |
+
|
| 26 |
+
1. [Architecture Overview](#1-architecture-overview)
|
| 27 |
+
2. [Shortlist of Most Relevant Papers](#2-shortlist-of-most-relevant-papers)
|
| 28 |
+
3. [Paper Critiques](#3-paper-critiques)
|
| 29 |
+
4. [Full Proposed Architecture](#4-full-proposed-architecture-latentrecurrentflow)
|
| 30 |
+
5. [Module-by-Module Diagram](#5-module-by-module-diagram)
|
| 31 |
+
6. [Mathematical Formulation](#6-mathematical-formulation)
|
| 32 |
+
7. [Training Objective & Losses](#7-training-objective--losses)
|
| 33 |
+
8. [Memory & Compute Budget](#8-memory--compute-budget)
|
| 34 |
+
9. [Training Curriculum](#9-training-curriculum)
|
| 35 |
+
10. [Deployment Plan for Mobile](#10-deployment-plan-for-mobile)
|
| 36 |
+
11. [Failure Mode Analysis](#11-failure-mode-analysis)
|
| 37 |
+
12. [Ablation Plan](#12-ablation-plan)
|
| 38 |
+
13. [Editing Roadmap](#13-editing-roadmap)
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 1. Architecture Overview
|
| 43 |
+
|
| 44 |
+
LRF combines five key innovations into a single coherent architecture:
|
| 45 |
+
|
| 46 |
+
| Innovation | Source Inspiration | What It Does |
|
| 47 |
+
|---|---|---|
|
| 48 |
+
| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
|
| 49 |
+
| **Gated Linear Diffusion (GLD)** blocks | ViG/GLA + DyDiLA | O(N) subquadratic spatial mixing replacing O(N²) attention |
|
| 50 |
+
| **Compact f=16 VAE** | SANA DC-AE + SnapGen | 16× spatial compression with ~280K decoder |
|
| 51 |
+
| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
|
| 52 |
+
| **Multimodal Conditioning** | OmniGen | Same core supports text-to-image AND editing via additive image conditioning |
|
| 53 |
+
|
| 54 |
+
### Key Numbers (Tiny Config — 5.7M params)
|
| 55 |
+
|
| 56 |
+
| Component | Parameters | FP32 Size | INT8 Size |
|
| 57 |
+
|---|---|---|---|
|
| 58 |
+
| VAE Encoder | 777K | 3.0 MB | 0.7 MB |
|
| 59 |
+
| VAE Decoder | 283K | 1.1 MB | 0.3 MB |
|
| 60 |
+
| Text Encoder | 4.5M | 17.3 MB | 4.3 MB |
|
| 61 |
+
| Denoising Core | 102K | 0.4 MB | 0.1 MB |
|
| 62 |
+
| **Total** | **5.7M** | **21.7 MB** | **5.4 MB** |
|
| 63 |
+
|
| 64 |
+
### Key Numbers (Default Config — 16.3M params)
|
| 65 |
+
|
| 66 |
+
| Component | Parameters | FP32 Size | INT8 Size |
|
| 67 |
+
|---|---|---|---|
|
| 68 |
+
| VAE Encoder | 3.1M | 11.7 MB | 2.9 MB |
|
| 69 |
+
| VAE Decoder | 1.1M | 4.1 MB | 1.0 MB |
|
| 70 |
+
| Text Encoder | 11.5M | 43.9 MB | 11.0 MB |
|
| 71 |
+
| Denoising Core | 651K | 2.5 MB | 0.6 MB |
|
| 72 |
+
| **Total** | **16.3M** | **62.2 MB** | **15.6 MB** |
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## 2. Shortlist of Most Relevant Papers
|
| 77 |
+
|
| 78 |
+
### A. Subquadratic Spatial Mixing for Image Generation
|
| 79 |
+
|
| 80 |
+
| Paper | arxiv | Key Contribution | FID Result |
|
| 81 |
+
|---|---|---|---|
|
| 82 |
+
| **PDE-SSM-DiT** | 2603.13663 | Fourier PDE operator replaces attention, O(N log N), 34× speedup | 18.36 (CelebA-HQ 256) |
|
| 83 |
+
| **DiMSUM** (NeurIPS 2024) | 2411.04168 | Mamba + wavelet subbands + shared transformer | **2.11** (CelebA-HQ 256) |
|
| 84 |
+
| **ViG/GLA** | 2405.18425 | Gated Linear Attention with 2D locality injection | 90% less memory at 1024² |
|
| 85 |
+
| **DyDiLA** | 2601.13683 | Dynamic differential linear attention | **6.80** (SubIN 256) |
|
| 86 |
+
| **Mamba2D** | 2412.16146 | True 2D SSM with wavefront scan | 84.0% top-1 IN-1K (27M) |
|
| 87 |
+
|
| 88 |
+
### B. Recursive/Iterative Reasoning
|
| 89 |
+
|
| 90 |
+
| Paper | arxiv | Key Contribution |
|
| 91 |
+
|---|---|---|
|
| 92 |
+
| **HRM** | 2506.21734 | 2-level recurrent fixed-point reasoning, O(1) memory via IFT |
|
| 93 |
+
| **TRM** (6473 ⭐) | 2510.04871 | 7M params → 45% ARC-AGI-1 via deep recursion |
|
| 94 |
+
| **Thinking Pixel** | 2604.25299 | Sparse MoE adapters for recursive visual reasoning in DiT |
|
| 95 |
+
|
| 96 |
+
### C. Compact Latent Spaces
|
| 97 |
+
|
| 98 |
+
| Paper | arxiv | Compression | Quality |
|
| 99 |
+
|---|---|---|---|
|
| 100 |
+
| **SANA DC-AE** | 2410.10629 | f=32, C=32 → 32×32 latents for 1024² | PSNR 29.29, rFID 0.34 |
|
| 101 |
+
| **SnapGen** | 2412.09619 | 1.38M tiny decoder (35× smaller than SD3) | PSNR 27.85 |
|
| 102 |
+
| **TiTok** | 2406.07550 | 32 tokens per 256² image | gFID 1.97 (IN-256) |
|
| 103 |
+
| **MobileDiffusion** | 2311.16567 | f=8, c=8 VAE, sub-second on iPhone | Better than SD-1.5 at 8 steps |
|
| 104 |
+
|
| 105 |
+
### D. Few-Step Generation
|
| 106 |
+
|
| 107 |
+
| Paper | arxiv | Key Result |
|
| 108 |
+
|---|---|---|
|
| 109 |
+
| **Consistency Models** | 2303.01469 | One-step generation from diffusion |
|
| 110 |
+
| **LCM** | 2310.04378 | 2-4 step high-quality via consistency distillation |
|
| 111 |
+
| **SD3.5-Flash** | 2509.21318 | Few-step distillation with timestep sharing |
|
| 112 |
+
|
| 113 |
+
### E. Unified Generation + Editing
|
| 114 |
+
|
| 115 |
+
| Paper | arxiv | Key Contribution |
|
| 116 |
+
|---|---|---|
|
| 117 |
+
| **OmniGen** | 2409.11340 | Single model for T2I + editing + control, interleaved image-text input |
|
| 118 |
+
| **OmniGen2** | 2506.18871 | Dual decoding pathways, decoupled image tokenizer |
|
| 119 |
+
| **InstructPix2Pix** | 2211.09800 | Image editing from text instructions |
|
| 120 |
+
|
| 121 |
+
### F. Mobile Deployment
|
| 122 |
+
|
| 123 |
+
| Paper | arxiv | Device Performance |
|
| 124 |
+
|---|---|---|
|
| 125 |
+
| **SnapGen** | 2412.09619 | 1.4s on iPhone 15 Pro, 372M UNet |
|
| 126 |
+
| **SnapGen++** | 2601.08303 | 1.8s on iPhone 16, 0.4B sub-DiT |
|
| 127 |
+
| **MobileDiffusion** | 2311.16567 | Sub-second on iPhone, ~400M params |
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## 3. Paper Critiques
|
| 132 |
+
|
| 133 |
+
### PDE-SSM (2603.13663) ✅ Borrowed: Physical inductive bias concept
|
| 134 |
+
- **Why it helps**: 34× speedup from FFT-based spatial operator with physically grounded bias
|
| 135 |
+
- **What it fails at**: FID still behind DiMSUM (18.36 vs 2.11); requires FFT which is non-trivial on mobile
|
| 136 |
+
- **Borrowed**: Concept of learnable PDE-style spatial operators; we adapt this to our GLD blocks
|
| 137 |
+
|
| 138 |
+
### HRM/TRM (2506.21734, 2510.04871) ✅ Borrowed: Core recursive architecture
|
| 139 |
+
- **Why it helps**: O(1) memory backprop via IFT; extreme parameter efficiency (7M → 45% ARC-AGI)
|
| 140 |
+
- **What it fails at**: Never applied to image generation; fixed-point convergence not guaranteed for images
|
| 141 |
+
- **Borrowed**: Two-level recursion (abstract + detail), IFT training, recursion depth embedding
|
| 142 |
+
|
| 143 |
+
### ViG/GLA (2405.18425) ✅ Borrowed: Spatial mixing block
|
| 144 |
+
- **Why it helps**: Hardware-aware, 90% memory savings, bidirectional GLA with locality injection
|
| 145 |
+
- **What it fails at**: Only tested on classification/detection, not generation
|
| 146 |
+
- **Borrowed**: Bidirectional GLA core, depthwise conv locality injection (GaLI), token differential (from DyDiLA)
|
| 147 |
+
|
| 148 |
+
### SANA DC-AE (2410.10629) ✅ Borrowed: Latent space design principles
|
| 149 |
+
- **Why it helps**: f=32 achieves similar quality to f=8 but 16× fewer tokens
|
| 150 |
+
- **What it fails at**: Decoder is still large (50M); typography needs decoder-only LLM text encoder
|
| 151 |
+
- **Borrowed**: High-compression VAE principle; we use f=16 as a compromise for fine detail
|
| 152 |
+
|
| 153 |
+
### SnapGen (2412.09619) ✅ Borrowed: Tiny decoder architecture
|
| 154 |
+
- **Why it helps**: 35× smaller decoder, 54× faster decode, negligible quality loss
|
| 155 |
+
- **What it fails at**: Proprietary weights; still uses quadratic attention in the UNet backbone
|
| 156 |
+
- **Borrowed**: Attention-free decoder, SepConv, minimal GroupNorm, SiLU instead of GELU
|
| 157 |
+
|
| 158 |
+
### TiTok (2406.07550) ❌ Rejected: Too aggressive compression
|
| 159 |
+
- **Why it was considered**: 32 tokens per image is incredibly compact
|
| 160 |
+
- **Why rejected**: rFID=16.2 means visible artifacts; fine detail and typography badly degraded at 32 tokens
|
| 161 |
+
|
| 162 |
+
### DiMSUM (2411.04168) ⚠️ Partially borrowed: Wavelet concept
|
| 163 |
+
- **Why it helps**: Best FID (2.11) among SSM-based approaches
|
| 164 |
+
- **What it fails at**: Still uses cross-attention fusion → partially quadratic; complex architecture
|
| 165 |
+
- **Borrowed**: Wavelet decomposition concept for frequency-aware processing
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## 4. Full Proposed Architecture: LatentRecurrentFlow
|
| 170 |
+
|
| 171 |
+
### Name: **LatentRecurrentFlow (LRF)**
|
| 172 |
+
|
| 173 |
+
LRF is a **recursive flow-matching image generator** that uses:
|
| 174 |
+
- A compact VAE with f=16 compression and a ~280K tiny decoder
|
| 175 |
+
- A **Recursive Latent Refinement (RLR) core** that iteratively refines image latents through shared GLD blocks
|
| 176 |
+
- A **rectified flow** training objective for clean few-step generation
|
| 177 |
+
- **Additive image conditioning** for editing-readiness
|
| 178 |
+
|
| 179 |
+
The core insight: **instead of stacking many unique layers, reuse a small set of blocks recursively**. This exploits the observation from HRM/TRM that iterative application of the same function can converge to a fixed point that represents the solution — analogous to how diffusion models iteratively denoise.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## 5. Module-by-Module Diagram
|
| 184 |
+
|
| 185 |
+
```
|
| 186 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 187 |
+
│ LatentRecurrentFlow │
|
| 188 |
+
│ │
|
| 189 |
+
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
|
| 190 |
+
│ │ Compact │ │ Simple │ │ Rectified │ │
|
| 191 |
+
│ │ VAE │ │ Text │ │ Flow │ │
|
| 192 |
+
│ │ (f=16) │ │ Encoder │ │ Scheduler │ │
|
| 193 |
+
│ │ │ │ │ │ │ │
|
| 194 |
+
│ │ Encoder ────┤ │ Embed ──────┤ │ t ~ U[0,1] │ │
|
| 195 |
+
│ │ (3.1M) │ │ Transformer │ │ z_t = (1-t) │ │
|
| 196 |
+
│ │ │ │ (11.5M) │ │ z_0 + tε │ │
|
| 197 |
+
│ │ Decoder ────┤ │ │ │ │ │
|
| 198 |
+
│ │ (1.1M, tiny)│ │ → text_emb │ │ v = ε - z_0 │ │
|
| 199 |
+
│ └──────┬───────┘ │ → text_glob │ └────────┬───────┘ │
|
| 200 |
+
│ │ └──────┬───────┘ │ │
|
| 201 |
+
│ │ │ │ │
|
| 202 |
+
│ ┌──────▼────────────���──────▼─────────────────────▼──────┐ │
|
| 203 |
+
│ │ Recursive Latent Core (RLR) │ │
|
| 204 |
+
│ │ │ │
|
| 205 |
+
│ │ ┌─────────────────────────────────────────────────┐ │ │
|
| 206 |
+
│ │ │ OUTER LOOP (j = 1..T_outer) │ │ │
|
| 207 |
+
│ │ │ │ │ │
|
| 208 |
+
│ │ │ z_abstract ← f_slow(z, z_pooled) [H-module] │ │ │
|
| 209 |
+
│ │ │ │ │ │
|
| 210 |
+
│ │ │ ┌─────────────────────────────────────────┐ │ │ │
|
| 211 |
+
│ │ │ │ INNER LOOP (i = 1..T_inner) │ │ │ │
|
| 212 |
+
│ │ │ │ │ │ │ │
|
| 213 |
+
│ │ │ │ cond = t_emb + text_global + rec_emb │ │ │ │
|
| 214 |
+
│ │ │ │ z_in = z + z_abstract │ │ │ │
|
| 215 |
+
│ │ │ │ │ │ │ │
|
| 216 |
+
│ │ │ │ FOR block in GLD_blocks: │ │ │ │
|
| 217 |
+
│ │ │ │ ┌─────────────────────────────────┐ │ │ │ │
|
| 218 |
+
│ │ │ │ │ GLD Block │ │ │ │ │
|
| 219 |
+
│ │ │ │ │ │ │ │ │ │
|
| 220 |
+
│ │ │ │ │ 1. AdaLN-modulate(z, cond) │ │ │ │ │
|
| 221 |
+
│ │ │ │ │ 2. GLA: BiDir scan + DiffToken │ │ │ │ │
|
| 222 |
+
│ │ │ │ │ + DW-Conv locality gate │ │ │ │ │
|
| 223 |
+
│ │ │ │ │ 3. Cross-attn to text_emb │ │ │ │ │
|
| 224 |
+
│ │ │ │ │ 4. AdaLN-modulate(z, cond) │ │ │ │ │
|
| 225 |
+
│ │ │ │ │ 5. SwiGLU FFN │ │ │ │ │
|
| 226 |
+
│ │ │ │ └─────────────────────────────────┘ │ │ │ │
|
| 227 |
+
│ │ │ │ │ │ │ │
|
| 228 |
+
│ │ │ │ z = z + 0.5 * (blocks(z_in) - z) │ │ │ │
|
| 229 |
+
│ │ │ └─────────────────────────────────────────┘ │ │ │
|
| 230 |
+
│ │ └─────────────────────────────────────────────────┘ │ │
|
| 231 |
+
│ │ │ │
|
| 232 |
+
│ │ v = out_proj(out_norm(z)) ← velocity prediction │ │
|
| 233 |
+
│ └─────────────────────────────────────────────────────────┘ │
|
| 234 |
+
│ │
|
| 235 |
+
│ Training: IFT backprop (O(1) memory through recursion) │
|
| 236 |
+
│ Inference: Full recursion (no grad needed) │
|
| 237 |
+
└─────────────────────────────────────────────────────────────┘
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
---
|
| 241 |
+
|
| 242 |
+
## 6. Mathematical Formulation
|
| 243 |
+
|
| 244 |
+
### Forward Process (Rectified Flow)
|
| 245 |
+
|
| 246 |
+
Given clean latent z₀ and noise ε ~ N(0, I):
|
| 247 |
+
|
| 248 |
+
```
|
| 249 |
+
z_t = (1 - t) · z₀ + t · ε, t ∈ [0, 1]
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
+
### Velocity Target
|
| 253 |
+
|
| 254 |
+
```
|
| 255 |
+
v* = ε - z₀
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
### Denoising Core (RLR)
|
| 259 |
+
|
| 260 |
+
Let f_θ denote the shared GLD blocks, and g_φ denote the abstract updater.
|
| 261 |
+
|
| 262 |
+
**Initialization:**
|
| 263 |
+
```
|
| 264 |
+
z⁽⁰⁾ = input_proj(flatten(z_t))
|
| 265 |
+
c = time_embed(sinusoidal(t)) + text_global
|
| 266 |
+
z_abs⁽⁰⁾ = mean_pool(z⁽⁰⁾)
|
| 267 |
+
```
|
| 268 |
+
|
| 269 |
+
**Outer loop** (j = 1..T_outer):
|
| 270 |
+
```
|
| 271 |
+
z_abs⁽ʲ⁾ = z_abs⁽ʲ⁻¹⁾ + tanh(α) · g_φ([norm(z), mean_pool(z)])
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
**Inner loop** (i = 1..T_inner):
|
| 275 |
+
```
|
| 276 |
+
c_step = c + recursion_embed(j · T_inner + i)
|
| 277 |
+
z_in = z + z_abs⁽ʲ⁾
|
| 278 |
+
z ← z + 0.5 · (f_θ(z_in, c_step, text_emb) - z)
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
**Output:**
|
| 282 |
+
```
|
| 283 |
+
v_θ(z_t, t, c) = out_proj(out_norm(z))
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
### GLA Block (within f_θ)
|
| 287 |
+
|
| 288 |
+
```
|
| 289 |
+
Q, K, V = W_qkv · x (linear projection)
|
| 290 |
+
Q̃ = Q - λ · shift(Q) (token differential)
|
| 291 |
+
K̃ = K - λ · shift(K)
|
| 292 |
+
Q̃ = φ(Q̃), K̃ = φ(K̃) where φ(x) = 1 + elu(x)
|
| 293 |
+
|
| 294 |
+
Forward scan: S_i = γ · S_{i-1} + K̃_i^T · V_i; O_i^fwd = Q̃_i · S_i
|
| 295 |
+
Backward scan: (same in reverse)
|
| 296 |
+
|
| 297 |
+
O = O^fwd + O^bwd
|
| 298 |
+
O = sigmoid(W_g · x) · norm(O) · sigmoid(DWConv(W_local · x))
|
| 299 |
+
output = W_out · O
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
Complexity: **O(N · d²)** per direction, where d is head dimension and N is token count.
|
| 303 |
+
|
| 304 |
+
### IFT Training (O(1) Memory)
|
| 305 |
+
|
| 306 |
+
During training, we detach gradients for all but the last recursion:
|
| 307 |
+
```
|
| 308 |
+
with no_grad():
|
| 309 |
+
for j in range(T_outer - 1):
|
| 310 |
+
z = recursive_refinement(z, c, text_emb)
|
| 311 |
+
z = recursive_refinement(z, c, text_emb) # grad only here
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
By the Implicit Function Theorem, if z* is a fixed point of f, then:
|
| 315 |
+
```
|
| 316 |
+
∂z*/∂θ = (I - ∂f/∂z)⁻¹ · ∂f/∂θ
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
The 1-step gradient approximates this, giving correct gradient direction with O(1) memory.
|
| 320 |
+
|
| 321 |
+
---
|
| 322 |
+
|
| 323 |
+
## 7. Training Objective & Losses
|
| 324 |
+
|
| 325 |
+
### Stage 1: VAE Training
|
| 326 |
+
|
| 327 |
+
```
|
| 328 |
+
L_VAE = L_recon + λ_perc · L_perceptual + λ_KL · L_KL
|
| 329 |
+
|
| 330 |
+
L_recon = |x - x̂|₁ (L1 reconstruction)
|
| 331 |
+
L_perceptual = (1/3) Σ_{s=0}^{2} MSE(pool_s(x), pool_s(x̂)) (multi-scale)
|
| 332 |
+
L_KL = -0.5 · E[1 + log(σ²) - μ² - σ²] (KL divergence)
|
| 333 |
+
|
| 334 |
+
λ_perc = 1.0, λ_KL = 1e-6
|
| 335 |
+
```
|
| 336 |
+
|
| 337 |
+
### Stage 2: Flow Matching
|
| 338 |
+
|
| 339 |
+
```
|
| 340 |
+
L_flow = E_{t,z₀,ε} [ w(t) · ‖v_θ(z_t, t, c) - (ε - z₀)‖² ]
|
| 341 |
+
|
| 342 |
+
w(t) = 1 / (t(1-t) + 0.01) (SNR weighting, normalized)
|
| 343 |
+
|
| 344 |
+
With 10% classifier-free guidance dropout:
|
| 345 |
+
P(c = ∅) = 0.1
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
### Stage 3: Consistency Distillation
|
| 349 |
+
|
| 350 |
+
```
|
| 351 |
+
L_CD = ‖f_θ(z_{t_n}, t_n, c) - sg[f_{teacher}(z_{t_{n-1}}, t_{n-1}, c)]‖²
|
| 352 |
+
|
| 353 |
+
where f_teacher uses the trained flow model with one Euler step:
|
| 354 |
+
z_{t_{n-1}} = z_{t_n} - (t_n - t_{n-1}) · v_teacher(z_{t_n}, t_n, c)
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
### Stage 4: Editing Fine-tuning
|
| 358 |
+
|
| 359 |
+
Same flow matching loss, but with additional image condition:
|
| 360 |
+
```
|
| 361 |
+
v_θ(z_t, t, c, z_src) where z_src = encode(source_image)
|
| 362 |
+
```
|
| 363 |
+
|
| 364 |
+
Additive conditioning: `z_input = z + z_src` before the RLR core.
|
| 365 |
+
|
| 366 |
+
---
|
| 367 |
+
|
| 368 |
+
## 8. Memory & Compute Budget
|
| 369 |
+
|
| 370 |
+
### Inference (1024×1024, Default Config, INT8)
|
| 371 |
+
|
| 372 |
+
| Component | Memory |
|
| 373 |
+
|---|---|
|
| 374 |
+
| Text Encoder (INT8) | 11 MB |
|
| 375 |
+
| VAE Decoder (INT8) | 1 MB |
|
| 376 |
+
| Denoising Core (INT8) | 0.6 MB |
|
| 377 |
+
| Latent activations (64×64×32) | 0.5 MB |
|
| 378 |
+
| Peak activation memory | ~200 MB |
|
| 379 |
+
| **Total** | **~213 MB** |
|
| 380 |
+
|
| 381 |
+
This comfortably fits within 3-4 GB mobile RAM.
|
| 382 |
+
|
| 383 |
+
### Training (16 GB GPU, Default Config)
|
| 384 |
+
|
| 385 |
+
| Item | Memory |
|
| 386 |
+
|---|---|
|
| 387 |
+
| Model parameters (FP32) | 62 MB |
|
| 388 |
+
| Optimizer states (AdamW, 2×) | 124 MB |
|
| 389 |
+
| Gradients | 62 MB |
|
| 390 |
+
| Batch activations (BS=8, 64×64) | ~500 MB |
|
| 391 |
+
| IFT overhead (only last recursion) | ~50 MB |
|
| 392 |
+
| **Total** | **~800 MB** |
|
| 393 |
+
|
| 394 |
+
Leaves ample room for larger batch sizes or higher resolution on 16 GB.
|
| 395 |
+
|
| 396 |
+
---
|
| 397 |
+
|
| 398 |
+
## 9. Training Curriculum
|
| 399 |
+
|
| 400 |
+
### Stage 1: VAE (50K steps)
|
| 401 |
+
- **Data**: ImageNet or COCO (any large image dataset)
|
| 402 |
+
- **Resolution**: 256×256
|
| 403 |
+
- **What to freeze**: Nothing
|
| 404 |
+
- **What to train**: Full VAE
|
| 405 |
+
- **LR**: 1e-4, AdamW, weight_decay=0.01
|
| 406 |
+
- **Key**: Train until L_recon < 0.1
|
| 407 |
+
|
| 408 |
+
### Stage 2: Flow Matching — Low Resolution (100K steps)
|
| 409 |
+
- **Data**: Synthetic captions from teacher (SDXL) + LAION-aesthetic subset
|
| 410 |
+
- **Resolution**: 64×64
|
| 411 |
+
- **What to freeze**: VAE
|
| 412 |
+
- **What to train**: Core + Text Encoder
|
| 413 |
+
- **LR**: 1e-4
|
| 414 |
+
- **Key**: Focus on learning composition and prompt adherence
|
| 415 |
+
|
| 416 |
+
### Stage 3: Flow Matching — Mid Resolution (200K steps)
|
| 417 |
+
- **Data**: Filtered LAION-aesthetic (score > 6.0) + synthetic
|
| 418 |
+
- **Resolution**: 256×256
|
| 419 |
+
- **What to freeze**: VAE
|
| 420 |
+
- **What to train**: Core + Text Encoder
|
| 421 |
+
- **LR**: 5e-5
|
| 422 |
+
- **Key**: Focus on texture and detail
|
| 423 |
+
|
| 424 |
+
### Stage 4: Flow Matching — High Resolution (100K steps)
|
| 425 |
+
- **Data**: High-quality curated + JourneyDB
|
| 426 |
+
- **Resolution**: 512×512
|
| 427 |
+
- **What to freeze**: VAE
|
| 428 |
+
- **What to train**: Core + Text Encoder
|
| 429 |
+
- **LR**: 2e-5
|
| 430 |
+
- **Key**: Focus on fine detail and typography
|
| 431 |
+
|
| 432 |
+
### Stage 5: Consistency Distillation (50K steps)
|
| 433 |
+
- **Data**: Same as Stage 4
|
| 434 |
+
- **What to freeze**: VAE + Text Encoder
|
| 435 |
+
- **What to train**: Core only
|
| 436 |
+
- **LR**: 1e-5
|
| 437 |
+
- **Key**: Distill from own multi-step model to 4-step generation
|
| 438 |
+
|
| 439 |
+
### Stage 6: Editing Fine-tuning (50K steps)
|
| 440 |
+
- **Data**: InstructPix2Pix + MagicBrush + synthetic edit pairs
|
| 441 |
+
- **What to freeze**: VAE
|
| 442 |
+
- **What to train**: Core + Text Encoder
|
| 443 |
+
- **LR**: 1e-5
|
| 444 |
+
- **Key**: Add image conditioning channel
|
| 445 |
+
|
| 446 |
+
---
|
| 447 |
+
|
| 448 |
+
## 10. Deployment Plan for Mobile
|
| 449 |
+
|
| 450 |
+
### Step 1: Quantization
|
| 451 |
+
- INT8 per-channel weight quantization (static)
|
| 452 |
+
- INT8 per-token activation quantization (dynamic)
|
| 453 |
+
- Result: ~4× model size reduction
|
| 454 |
+
|
| 455 |
+
### Step 2: Operator Optimization
|
| 456 |
+
- Replace GELU → SiLU throughout (MobileDiffusion finding: GELU causes float16 instability)
|
| 457 |
+
- Fuse norm + activation + linear into single kernels
|
| 458 |
+
- Use CoreML (iOS) or NNAPI (Android) for hardware acceleration
|
| 459 |
+
|
| 460 |
+
### Step 3: Step Reduction
|
| 461 |
+
- After consistency distillation: 4 Euler steps sufficient
|
| 462 |
+
- With further adversarial distillation: 1-2 steps possible
|
| 463 |
+
|
| 464 |
+
### Step 4: Latent Size Optimization
|
| 465 |
+
- f=16 compression: 1024² → 64×64 latents
|
| 466 |
+
- 32 channels per position
|
| 467 |
+
- Total latent: 64×64×32 = 131,072 values ≈ 0.5 MB
|
| 468 |
+
|
| 469 |
+
### Projected Performance
|
| 470 |
+
| Device | Steps | Estimated Time |
|
| 471 |
+
|---|---|---|
|
| 472 |
+
| iPhone 16 Pro (ANE) | 4 | ~0.5-1.0s |
|
| 473 |
+
| Pixel 8 Pro (GPU) | 4 | ~1.0-2.0s |
|
| 474 |
+
| iPhone 14 (GPU) | 8 | ~2.0-3.0s |
|
| 475 |
+
|
| 476 |
+
---
|
| 477 |
+
|
| 478 |
+
## 11. Failure Mode Analysis
|
| 479 |
+
|
| 480 |
+
| Failure Mode | Cause | Detection | Fix |
|
| 481 |
+
|---|---|---|---|
|
| 482 |
+
| **Fixed-point non-convergence** | Recursion doesn't converge | Monitor z change per recursion | Damped update (α=0.5), reduce T_inner |
|
| 483 |
+
| **Oversmoothing** | GLA loses high-frequency detail | Blurry outputs, low LPIPS | Increase token-differential λ, add DW-conv skip |
|
| 484 |
+
| **Mode collapse** | Small model capacity | FID increases, low diversity | Increase num_blocks or dim |
|
| 485 |
+
| **Training instability** | IFT gradient approximation error | Loss spikes | Reduce LR, increase warmup, disable IFT temporarily |
|
| 486 |
+
| **Poor text adherence** | Weak cross-attention | Low CLIP score | Increase cross-attention gates, add more cross-attn layers |
|
| 487 |
+
| **VAE artifacts** | Aggressive compression | Reconstruction artifacts | Lower f (use f=8), increase decoder capacity |
|
| 488 |
+
| **CFG artifacts** | High guidance scale | Oversaturated images | Train with 10% unconditional, use CFG 3-5 range |
|
| 489 |
+
|
| 490 |
+
---
|
| 491 |
+
|
| 492 |
+
## 12. Ablation Plan
|
| 493 |
+
|
| 494 |
+
### Ablation 1: Recursion Depth vs Quality
|
| 495 |
+
- **Vary**: T_inner ∈ {1, 2, 4, 6, 8}, T_outer ∈ {1, 2, 3}
|
| 496 |
+
- **Measure**: FID, CLIP score, inference time
|
| 497 |
+
- **Hypothesis**: Quality plateaus around T_inner=4-6; diminishing returns beyond T_outer=2
|
| 498 |
+
|
| 499 |
+
### Ablation 2: GLA vs Standard Attention
|
| 500 |
+
- **Compare**: GLA blocks vs softmax attention blocks (same dim, same depth)
|
| 501 |
+
- **Measure**: FID, memory, throughput
|
| 502 |
+
- **Hypothesis**: GLA matches attention quality at 3-5× lower memory
|
| 503 |
+
|
| 504 |
+
### Ablation 3: Token Differential
|
| 505 |
+
- **Vary**: λ ∈ {0, 0.05, 0.1, 0.2, learned}
|
| 506 |
+
- **Measure**: FID, sharpness metrics (gradient magnitude)
|
| 507 |
+
- **Hypothesis**: λ=0.1 optimal; λ=0 causes oversmoothing
|
| 508 |
+
|
| 509 |
+
### Ablation 4: IFT vs Full Backprop
|
| 510 |
+
- **Compare**: IFT training vs full BPTT (at small T for memory comparison)
|
| 511 |
+
- **Measure**: Final FID, training memory, convergence speed
|
| 512 |
+
- **Hypothesis**: IFT within 2% FID of full backprop at 8-16× memory savings
|
| 513 |
+
|
| 514 |
+
### Ablation 5: VAE Compression
|
| 515 |
+
- **Vary**: f ∈ {8, 16, 32}, C ∈ {8, 16, 32}
|
| 516 |
+
- **Measure**: rFID, PSNR, generation FID
|
| 517 |
+
- **Hypothesis**: f=16, C=16-32 is the sweet spot for mobile quality
|
| 518 |
+
|
| 519 |
+
### Ablation 6: Abstract State (H-module)
|
| 520 |
+
- **Compare**: With/without abstract state update
|
| 521 |
+
- **Measure**: FID, coherence metrics
|
| 522 |
+
- **Hypothesis**: Abstract state improves global composition coherence
|
| 523 |
+
|
| 524 |
+
---
|
| 525 |
+
|
| 526 |
+
## 13. Editing Roadmap
|
| 527 |
+
|
| 528 |
+
The LRF architecture is designed for editing-readiness through **additive image conditioning**:
|
| 529 |
+
|
| 530 |
+
### Phase 1: Inpainting
|
| 531 |
+
- Add binary mask channel to condition input
|
| 532 |
+
- `z_input = z + z_src * mask + z_noise * (1 - mask)`
|
| 533 |
+
- Train on random masking + MagicBrush data
|
| 534 |
+
|
| 535 |
+
### Phase 2: Image-to-Image Translation
|
| 536 |
+
- Source image encoded to latent, added to noisy latent
|
| 537 |
+
- Noise level controls edit strength (low noise = subtle edit)
|
| 538 |
+
- No architectural changes needed
|
| 539 |
+
|
| 540 |
+
### Phase 3: Instruction-Based Editing (OmniGen-style)
|
| 541 |
+
- Text encoder receives both instruction AND image description
|
| 542 |
+
- Source image latent added as conditioning
|
| 543 |
+
- Train on InstructPix2Pix + SEED-edit data
|
| 544 |
+
|
| 545 |
+
### Phase 4: Super-Resolution
|
| 546 |
+
- Low-res image encoded, upscaled in latent space
|
| 547 |
+
- Decoder generates high-res output
|
| 548 |
+
- Train on paired low/high-res data
|
| 549 |
+
|
| 550 |
+
### Phase 5: Style Transfer & Identity Preservation
|
| 551 |
+
- Reference image encoded to separate latent
|
| 552 |
+
- Cross-attention between reference and generation
|
| 553 |
+
- Train on same-identity different-image pairs (GRIT-Entity)
|
| 554 |
+
|
| 555 |
+
### Phase 6: Multi-Image Conditioning
|
| 556 |
+
- OmniGen-style interleaved image-text input
|
| 557 |
+
- Multiple source images encoded and concatenated in latent space
|
| 558 |
+
- Enables try-on, compositing, scene editing
|
| 559 |
+
|
| 560 |
+
### Why This Works
|
| 561 |
+
The key architectural decisions that enable editing:
|
| 562 |
+
1. **Additive conditioning** preserves spatial correspondence (pixel i in source maps to token i in latent)
|
| 563 |
+
2. **Recursive refinement** naturally handles conditioning — the model can "reason" about how to modify the latent
|
| 564 |
+
3. **Cross-attention to text** at every recursion step allows the model to follow editing instructions progressively
|
| 565 |
+
4. **Same parameter reuse** means editing capability doesn't require new parameters — just new training data
|
| 566 |
+
|
| 567 |
+
---
|
| 568 |
+
|
| 569 |
+
## Quick Start
|
| 570 |
+
|
| 571 |
+
```python
|
| 572 |
+
# Clone and install
|
| 573 |
+
!pip install torch einops safetensors
|
| 574 |
+
|
| 575 |
+
# Use the pipeline
|
| 576 |
+
from lrf.model import LatentRecurrentFlow
|
| 577 |
+
from lrf.pipeline import LRFPipeline
|
| 578 |
+
|
| 579 |
+
# Create model
|
| 580 |
+
model = LatentRecurrentFlow(LatentRecurrentFlow.tiny_config())
|
| 581 |
+
pipe = LRFPipeline(model)
|
| 582 |
+
|
| 583 |
+
# Generate
|
| 584 |
+
images = pipe("a sunset over the ocean", num_steps=10, height=64, width=64)
|
| 585 |
+
|
| 586 |
+
# Or train
|
| 587 |
+
from lrf.training import run_prototype_training
|
| 588 |
+
model, trainer = run_prototype_training(num_vae_steps=100, num_flow_steps=100)
|
| 589 |
+
```
|
| 590 |
+
|
| 591 |
+
See `notebook.ipynb` for the full interactive walkthrough.
|
| 592 |
+
|
| 593 |
+
---
|
| 594 |
+
|
| 595 |
+
## Citation
|
| 596 |
+
|
| 597 |
+
```bibtex
|
| 598 |
+
@software{lrf2026,
|
| 599 |
+
title={LatentRecurrentFlow: A Novel Mobile-First Image Generation Architecture},
|
| 600 |
+
author={LRF Research},
|
| 601 |
+
year={2026},
|
| 602 |
+
url={https://huggingface.co/krystv/LatentRecurrentFlow}
|
| 603 |
+
}
|
| 604 |
+
```
|
| 605 |
+
|
| 606 |
+
## License
|
| 607 |
+
|
| 608 |
+
Apache 2.0
|