krystv commited on
Commit
e80dae2
·
verified ·
1 Parent(s): 46be4ab

Add comprehensive README with full architecture documentation

Browse files
Files changed (1) hide show
  1. README.md +608 -0
README.md ADDED
@@ -0,0 +1,608 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-generation
4
+ - latent-recurrent-flow
5
+ - lrf
6
+ - mobile-first
7
+ - flow-matching
8
+ - recursive-reasoning
9
+ - novel-architecture
10
+ - subquadratic-attention
11
+ - gated-linear-attention
12
+ - research
13
+ library_name: lrf
14
+ pipeline_tag: text-to-image
15
+ license: apache-2.0
16
+ ---
17
+
18
+ # LatentRecurrentFlow (LRF) — A Novel Mobile-First Image Generation Architecture
19
+
20
+ > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
21
+
22
+ ---
23
+
24
+ ## Table of Contents
25
+
26
+ 1. [Architecture Overview](#1-architecture-overview)
27
+ 2. [Shortlist of Most Relevant Papers](#2-shortlist-of-most-relevant-papers)
28
+ 3. [Paper Critiques](#3-paper-critiques)
29
+ 4. [Full Proposed Architecture](#4-full-proposed-architecture-latentrecurrentflow)
30
+ 5. [Module-by-Module Diagram](#5-module-by-module-diagram)
31
+ 6. [Mathematical Formulation](#6-mathematical-formulation)
32
+ 7. [Training Objective & Losses](#7-training-objective--losses)
33
+ 8. [Memory & Compute Budget](#8-memory--compute-budget)
34
+ 9. [Training Curriculum](#9-training-curriculum)
35
+ 10. [Deployment Plan for Mobile](#10-deployment-plan-for-mobile)
36
+ 11. [Failure Mode Analysis](#11-failure-mode-analysis)
37
+ 12. [Ablation Plan](#12-ablation-plan)
38
+ 13. [Editing Roadmap](#13-editing-roadmap)
39
+
40
+ ---
41
+
42
+ ## 1. Architecture Overview
43
+
44
+ LRF combines five key innovations into a single coherent architecture:
45
+
46
+ | Innovation | Source Inspiration | What It Does |
47
+ |---|---|---|
48
+ | **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
49
+ | **Gated Linear Diffusion (GLD)** blocks | ViG/GLA + DyDiLA | O(N) subquadratic spatial mixing replacing O(N²) attention |
50
+ | **Compact f=16 VAE** | SANA DC-AE + SnapGen | 16× spatial compression with ~280K decoder |
51
+ | **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
52
+ | **Multimodal Conditioning** | OmniGen | Same core supports text-to-image AND editing via additive image conditioning |
53
+
54
+ ### Key Numbers (Tiny Config — 5.7M params)
55
+
56
+ | Component | Parameters | FP32 Size | INT8 Size |
57
+ |---|---|---|---|
58
+ | VAE Encoder | 777K | 3.0 MB | 0.7 MB |
59
+ | VAE Decoder | 283K | 1.1 MB | 0.3 MB |
60
+ | Text Encoder | 4.5M | 17.3 MB | 4.3 MB |
61
+ | Denoising Core | 102K | 0.4 MB | 0.1 MB |
62
+ | **Total** | **5.7M** | **21.7 MB** | **5.4 MB** |
63
+
64
+ ### Key Numbers (Default Config — 16.3M params)
65
+
66
+ | Component | Parameters | FP32 Size | INT8 Size |
67
+ |---|---|---|---|
68
+ | VAE Encoder | 3.1M | 11.7 MB | 2.9 MB |
69
+ | VAE Decoder | 1.1M | 4.1 MB | 1.0 MB |
70
+ | Text Encoder | 11.5M | 43.9 MB | 11.0 MB |
71
+ | Denoising Core | 651K | 2.5 MB | 0.6 MB |
72
+ | **Total** | **16.3M** | **62.2 MB** | **15.6 MB** |
73
+
74
+ ---
75
+
76
+ ## 2. Shortlist of Most Relevant Papers
77
+
78
+ ### A. Subquadratic Spatial Mixing for Image Generation
79
+
80
+ | Paper | arxiv | Key Contribution | FID Result |
81
+ |---|---|---|---|
82
+ | **PDE-SSM-DiT** | 2603.13663 | Fourier PDE operator replaces attention, O(N log N), 34× speedup | 18.36 (CelebA-HQ 256) |
83
+ | **DiMSUM** (NeurIPS 2024) | 2411.04168 | Mamba + wavelet subbands + shared transformer | **2.11** (CelebA-HQ 256) |
84
+ | **ViG/GLA** | 2405.18425 | Gated Linear Attention with 2D locality injection | 90% less memory at 1024² |
85
+ | **DyDiLA** | 2601.13683 | Dynamic differential linear attention | **6.80** (SubIN 256) |
86
+ | **Mamba2D** | 2412.16146 | True 2D SSM with wavefront scan | 84.0% top-1 IN-1K (27M) |
87
+
88
+ ### B. Recursive/Iterative Reasoning
89
+
90
+ | Paper | arxiv | Key Contribution |
91
+ |---|---|---|
92
+ | **HRM** | 2506.21734 | 2-level recurrent fixed-point reasoning, O(1) memory via IFT |
93
+ | **TRM** (6473 ⭐) | 2510.04871 | 7M params → 45% ARC-AGI-1 via deep recursion |
94
+ | **Thinking Pixel** | 2604.25299 | Sparse MoE adapters for recursive visual reasoning in DiT |
95
+
96
+ ### C. Compact Latent Spaces
97
+
98
+ | Paper | arxiv | Compression | Quality |
99
+ |---|---|---|---|
100
+ | **SANA DC-AE** | 2410.10629 | f=32, C=32 → 32×32 latents for 1024² | PSNR 29.29, rFID 0.34 |
101
+ | **SnapGen** | 2412.09619 | 1.38M tiny decoder (35× smaller than SD3) | PSNR 27.85 |
102
+ | **TiTok** | 2406.07550 | 32 tokens per 256² image | gFID 1.97 (IN-256) |
103
+ | **MobileDiffusion** | 2311.16567 | f=8, c=8 VAE, sub-second on iPhone | Better than SD-1.5 at 8 steps |
104
+
105
+ ### D. Few-Step Generation
106
+
107
+ | Paper | arxiv | Key Result |
108
+ |---|---|---|
109
+ | **Consistency Models** | 2303.01469 | One-step generation from diffusion |
110
+ | **LCM** | 2310.04378 | 2-4 step high-quality via consistency distillation |
111
+ | **SD3.5-Flash** | 2509.21318 | Few-step distillation with timestep sharing |
112
+
113
+ ### E. Unified Generation + Editing
114
+
115
+ | Paper | arxiv | Key Contribution |
116
+ |---|---|---|
117
+ | **OmniGen** | 2409.11340 | Single model for T2I + editing + control, interleaved image-text input |
118
+ | **OmniGen2** | 2506.18871 | Dual decoding pathways, decoupled image tokenizer |
119
+ | **InstructPix2Pix** | 2211.09800 | Image editing from text instructions |
120
+
121
+ ### F. Mobile Deployment
122
+
123
+ | Paper | arxiv | Device Performance |
124
+ |---|---|---|
125
+ | **SnapGen** | 2412.09619 | 1.4s on iPhone 15 Pro, 372M UNet |
126
+ | **SnapGen++** | 2601.08303 | 1.8s on iPhone 16, 0.4B sub-DiT |
127
+ | **MobileDiffusion** | 2311.16567 | Sub-second on iPhone, ~400M params |
128
+
129
+ ---
130
+
131
+ ## 3. Paper Critiques
132
+
133
+ ### PDE-SSM (2603.13663) ✅ Borrowed: Physical inductive bias concept
134
+ - **Why it helps**: 34× speedup from FFT-based spatial operator with physically grounded bias
135
+ - **What it fails at**: FID still behind DiMSUM (18.36 vs 2.11); requires FFT which is non-trivial on mobile
136
+ - **Borrowed**: Concept of learnable PDE-style spatial operators; we adapt this to our GLD blocks
137
+
138
+ ### HRM/TRM (2506.21734, 2510.04871) ✅ Borrowed: Core recursive architecture
139
+ - **Why it helps**: O(1) memory backprop via IFT; extreme parameter efficiency (7M → 45% ARC-AGI)
140
+ - **What it fails at**: Never applied to image generation; fixed-point convergence not guaranteed for images
141
+ - **Borrowed**: Two-level recursion (abstract + detail), IFT training, recursion depth embedding
142
+
143
+ ### ViG/GLA (2405.18425) ✅ Borrowed: Spatial mixing block
144
+ - **Why it helps**: Hardware-aware, 90% memory savings, bidirectional GLA with locality injection
145
+ - **What it fails at**: Only tested on classification/detection, not generation
146
+ - **Borrowed**: Bidirectional GLA core, depthwise conv locality injection (GaLI), token differential (from DyDiLA)
147
+
148
+ ### SANA DC-AE (2410.10629) ✅ Borrowed: Latent space design principles
149
+ - **Why it helps**: f=32 achieves similar quality to f=8 but 16× fewer tokens
150
+ - **What it fails at**: Decoder is still large (50M); typography needs decoder-only LLM text encoder
151
+ - **Borrowed**: High-compression VAE principle; we use f=16 as a compromise for fine detail
152
+
153
+ ### SnapGen (2412.09619) ✅ Borrowed: Tiny decoder architecture
154
+ - **Why it helps**: 35× smaller decoder, 54× faster decode, negligible quality loss
155
+ - **What it fails at**: Proprietary weights; still uses quadratic attention in the UNet backbone
156
+ - **Borrowed**: Attention-free decoder, SepConv, minimal GroupNorm, SiLU instead of GELU
157
+
158
+ ### TiTok (2406.07550) ❌ Rejected: Too aggressive compression
159
+ - **Why it was considered**: 32 tokens per image is incredibly compact
160
+ - **Why rejected**: rFID=16.2 means visible artifacts; fine detail and typography badly degraded at 32 tokens
161
+
162
+ ### DiMSUM (2411.04168) ⚠️ Partially borrowed: Wavelet concept
163
+ - **Why it helps**: Best FID (2.11) among SSM-based approaches
164
+ - **What it fails at**: Still uses cross-attention fusion → partially quadratic; complex architecture
165
+ - **Borrowed**: Wavelet decomposition concept for frequency-aware processing
166
+
167
+ ---
168
+
169
+ ## 4. Full Proposed Architecture: LatentRecurrentFlow
170
+
171
+ ### Name: **LatentRecurrentFlow (LRF)**
172
+
173
+ LRF is a **recursive flow-matching image generator** that uses:
174
+ - A compact VAE with f=16 compression and a ~280K tiny decoder
175
+ - A **Recursive Latent Refinement (RLR) core** that iteratively refines image latents through shared GLD blocks
176
+ - A **rectified flow** training objective for clean few-step generation
177
+ - **Additive image conditioning** for editing-readiness
178
+
179
+ The core insight: **instead of stacking many unique layers, reuse a small set of blocks recursively**. This exploits the observation from HRM/TRM that iterative application of the same function can converge to a fixed point that represents the solution — analogous to how diffusion models iteratively denoise.
180
+
181
+ ---
182
+
183
+ ## 5. Module-by-Module Diagram
184
+
185
+ ```
186
+ ┌─────────────────────────────────────────────────────────────┐
187
+ │ LatentRecurrentFlow │
188
+ │ │
189
+ │ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
190
+ │ │ Compact │ │ Simple │ │ Rectified │ │
191
+ │ │ VAE │ │ Text │ │ Flow │ │
192
+ │ │ (f=16) │ │ Encoder │ │ Scheduler │ │
193
+ │ │ │ │ │ │ │ │
194
+ │ │ Encoder ────┤ │ Embed ──────┤ │ t ~ U[0,1] │ │
195
+ │ │ (3.1M) │ │ Transformer │ │ z_t = (1-t) │ │
196
+ │ │ │ │ (11.5M) │ │ z_0 + tε │ │
197
+ │ │ Decoder ────┤ │ │ │ │ │
198
+ │ │ (1.1M, tiny)│ │ → text_emb │ │ v = ε - z_0 │ │
199
+ │ └──────┬───────┘ │ → text_glob │ └────────┬───────┘ │
200
+ │ │ └──────┬───────┘ │ │
201
+ │ │ │ │ │
202
+ │ ┌──────▼────────────���──────▼─────────────────────▼──────┐ │
203
+ │ │ Recursive Latent Core (RLR) │ │
204
+ │ │ │ │
205
+ │ │ ┌─────────────────────────────────────────────────┐ │ │
206
+ │ │ │ OUTER LOOP (j = 1..T_outer) │ │ │
207
+ │ │ │ │ │ │
208
+ │ │ │ z_abstract ← f_slow(z, z_pooled) [H-module] │ │ │
209
+ │ │ │ │ │ │
210
+ │ │ │ ┌─────────────────────────────────────────┐ │ │ │
211
+ │ │ │ │ INNER LOOP (i = 1..T_inner) │ │ │ │
212
+ │ │ │ │ │ │ │ │
213
+ │ │ │ │ cond = t_emb + text_global + rec_emb │ │ │ │
214
+ │ │ │ │ z_in = z + z_abstract │ │ │ │
215
+ │ │ │ │ │ │ │ │
216
+ │ │ │ │ FOR block in GLD_blocks: │ │ │ │
217
+ │ │ │ │ ┌─────────────────────────────────┐ │ │ │ │
218
+ │ │ │ │ │ GLD Block │ │ │ │ │
219
+ │ │ │ │ │ │ │ │ │ │
220
+ │ │ │ │ │ 1. AdaLN-modulate(z, cond) │ │ │ │ │
221
+ │ │ │ │ │ 2. GLA: BiDir scan + DiffToken │ │ │ │ │
222
+ │ │ │ │ │ + DW-Conv locality gate │ │ │ │ │
223
+ │ │ │ │ │ 3. Cross-attn to text_emb │ │ │ │ │
224
+ │ │ │ │ │ 4. AdaLN-modulate(z, cond) │ │ │ │ │
225
+ │ │ │ │ │ 5. SwiGLU FFN │ │ │ │ │
226
+ │ │ │ │ └─────────────────────────────────┘ │ │ │ │
227
+ │ │ │ │ │ │ │ │
228
+ │ │ │ │ z = z + 0.5 * (blocks(z_in) - z) │ │ │ │
229
+ │ │ │ └─────────────────────────────────────────┘ │ │ │
230
+ │ │ └─────────────────────────────────────────────────┘ │ │
231
+ │ │ │ │
232
+ │ │ v = out_proj(out_norm(z)) ← velocity prediction │ │
233
+ │ └─────────────────────────────────────────────────────────┘ │
234
+ │ │
235
+ │ Training: IFT backprop (O(1) memory through recursion) │
236
+ │ Inference: Full recursion (no grad needed) │
237
+ └─────────────────────────────────────────────────────────────┘
238
+ ```
239
+
240
+ ---
241
+
242
+ ## 6. Mathematical Formulation
243
+
244
+ ### Forward Process (Rectified Flow)
245
+
246
+ Given clean latent z₀ and noise ε ~ N(0, I):
247
+
248
+ ```
249
+ z_t = (1 - t) · z₀ + t · ε, t ∈ [0, 1]
250
+ ```
251
+
252
+ ### Velocity Target
253
+
254
+ ```
255
+ v* = ε - z₀
256
+ ```
257
+
258
+ ### Denoising Core (RLR)
259
+
260
+ Let f_θ denote the shared GLD blocks, and g_φ denote the abstract updater.
261
+
262
+ **Initialization:**
263
+ ```
264
+ z⁽⁰⁾ = input_proj(flatten(z_t))
265
+ c = time_embed(sinusoidal(t)) + text_global
266
+ z_abs⁽⁰⁾ = mean_pool(z⁽⁰⁾)
267
+ ```
268
+
269
+ **Outer loop** (j = 1..T_outer):
270
+ ```
271
+ z_abs⁽ʲ⁾ = z_abs⁽ʲ⁻¹⁾ + tanh(α) · g_φ([norm(z), mean_pool(z)])
272
+ ```
273
+
274
+ **Inner loop** (i = 1..T_inner):
275
+ ```
276
+ c_step = c + recursion_embed(j · T_inner + i)
277
+ z_in = z + z_abs⁽ʲ⁾
278
+ z ← z + 0.5 · (f_θ(z_in, c_step, text_emb) - z)
279
+ ```
280
+
281
+ **Output:**
282
+ ```
283
+ v_θ(z_t, t, c) = out_proj(out_norm(z))
284
+ ```
285
+
286
+ ### GLA Block (within f_θ)
287
+
288
+ ```
289
+ Q, K, V = W_qkv · x (linear projection)
290
+ Q̃ = Q - λ · shift(Q) (token differential)
291
+ K̃ = K - λ · shift(K)
292
+ Q̃ = φ(Q̃), K̃ = φ(K̃) where φ(x) = 1 + elu(x)
293
+
294
+ Forward scan: S_i = γ · S_{i-1} + K̃_i^T · V_i; O_i^fwd = Q̃_i · S_i
295
+ Backward scan: (same in reverse)
296
+
297
+ O = O^fwd + O^bwd
298
+ O = sigmoid(W_g · x) · norm(O) · sigmoid(DWConv(W_local · x))
299
+ output = W_out · O
300
+ ```
301
+
302
+ Complexity: **O(N · d²)** per direction, where d is head dimension and N is token count.
303
+
304
+ ### IFT Training (O(1) Memory)
305
+
306
+ During training, we detach gradients for all but the last recursion:
307
+ ```
308
+ with no_grad():
309
+ for j in range(T_outer - 1):
310
+ z = recursive_refinement(z, c, text_emb)
311
+ z = recursive_refinement(z, c, text_emb) # grad only here
312
+ ```
313
+
314
+ By the Implicit Function Theorem, if z* is a fixed point of f, then:
315
+ ```
316
+ ∂z*/∂θ = (I - ∂f/∂z)⁻¹ · ∂f/∂θ
317
+ ```
318
+
319
+ The 1-step gradient approximates this, giving correct gradient direction with O(1) memory.
320
+
321
+ ---
322
+
323
+ ## 7. Training Objective & Losses
324
+
325
+ ### Stage 1: VAE Training
326
+
327
+ ```
328
+ L_VAE = L_recon + λ_perc · L_perceptual + λ_KL · L_KL
329
+
330
+ L_recon = |x - x̂|₁ (L1 reconstruction)
331
+ L_perceptual = (1/3) Σ_{s=0}^{2} MSE(pool_s(x), pool_s(x̂)) (multi-scale)
332
+ L_KL = -0.5 · E[1 + log(σ²) - μ² - σ²] (KL divergence)
333
+
334
+ λ_perc = 1.0, λ_KL = 1e-6
335
+ ```
336
+
337
+ ### Stage 2: Flow Matching
338
+
339
+ ```
340
+ L_flow = E_{t,z₀,ε} [ w(t) · ‖v_θ(z_t, t, c) - (ε - z₀)‖² ]
341
+
342
+ w(t) = 1 / (t(1-t) + 0.01) (SNR weighting, normalized)
343
+
344
+ With 10% classifier-free guidance dropout:
345
+ P(c = ∅) = 0.1
346
+ ```
347
+
348
+ ### Stage 3: Consistency Distillation
349
+
350
+ ```
351
+ L_CD = ‖f_θ(z_{t_n}, t_n, c) - sg[f_{teacher}(z_{t_{n-1}}, t_{n-1}, c)]‖²
352
+
353
+ where f_teacher uses the trained flow model with one Euler step:
354
+ z_{t_{n-1}} = z_{t_n} - (t_n - t_{n-1}) · v_teacher(z_{t_n}, t_n, c)
355
+ ```
356
+
357
+ ### Stage 4: Editing Fine-tuning
358
+
359
+ Same flow matching loss, but with additional image condition:
360
+ ```
361
+ v_θ(z_t, t, c, z_src) where z_src = encode(source_image)
362
+ ```
363
+
364
+ Additive conditioning: `z_input = z + z_src` before the RLR core.
365
+
366
+ ---
367
+
368
+ ## 8. Memory & Compute Budget
369
+
370
+ ### Inference (1024×1024, Default Config, INT8)
371
+
372
+ | Component | Memory |
373
+ |---|---|
374
+ | Text Encoder (INT8) | 11 MB |
375
+ | VAE Decoder (INT8) | 1 MB |
376
+ | Denoising Core (INT8) | 0.6 MB |
377
+ | Latent activations (64×64×32) | 0.5 MB |
378
+ | Peak activation memory | ~200 MB |
379
+ | **Total** | **~213 MB** |
380
+
381
+ This comfortably fits within 3-4 GB mobile RAM.
382
+
383
+ ### Training (16 GB GPU, Default Config)
384
+
385
+ | Item | Memory |
386
+ |---|---|
387
+ | Model parameters (FP32) | 62 MB |
388
+ | Optimizer states (AdamW, 2×) | 124 MB |
389
+ | Gradients | 62 MB |
390
+ | Batch activations (BS=8, 64×64) | ~500 MB |
391
+ | IFT overhead (only last recursion) | ~50 MB |
392
+ | **Total** | **~800 MB** |
393
+
394
+ Leaves ample room for larger batch sizes or higher resolution on 16 GB.
395
+
396
+ ---
397
+
398
+ ## 9. Training Curriculum
399
+
400
+ ### Stage 1: VAE (50K steps)
401
+ - **Data**: ImageNet or COCO (any large image dataset)
402
+ - **Resolution**: 256×256
403
+ - **What to freeze**: Nothing
404
+ - **What to train**: Full VAE
405
+ - **LR**: 1e-4, AdamW, weight_decay=0.01
406
+ - **Key**: Train until L_recon < 0.1
407
+
408
+ ### Stage 2: Flow Matching — Low Resolution (100K steps)
409
+ - **Data**: Synthetic captions from teacher (SDXL) + LAION-aesthetic subset
410
+ - **Resolution**: 64×64
411
+ - **What to freeze**: VAE
412
+ - **What to train**: Core + Text Encoder
413
+ - **LR**: 1e-4
414
+ - **Key**: Focus on learning composition and prompt adherence
415
+
416
+ ### Stage 3: Flow Matching — Mid Resolution (200K steps)
417
+ - **Data**: Filtered LAION-aesthetic (score > 6.0) + synthetic
418
+ - **Resolution**: 256×256
419
+ - **What to freeze**: VAE
420
+ - **What to train**: Core + Text Encoder
421
+ - **LR**: 5e-5
422
+ - **Key**: Focus on texture and detail
423
+
424
+ ### Stage 4: Flow Matching — High Resolution (100K steps)
425
+ - **Data**: High-quality curated + JourneyDB
426
+ - **Resolution**: 512×512
427
+ - **What to freeze**: VAE
428
+ - **What to train**: Core + Text Encoder
429
+ - **LR**: 2e-5
430
+ - **Key**: Focus on fine detail and typography
431
+
432
+ ### Stage 5: Consistency Distillation (50K steps)
433
+ - **Data**: Same as Stage 4
434
+ - **What to freeze**: VAE + Text Encoder
435
+ - **What to train**: Core only
436
+ - **LR**: 1e-5
437
+ - **Key**: Distill from own multi-step model to 4-step generation
438
+
439
+ ### Stage 6: Editing Fine-tuning (50K steps)
440
+ - **Data**: InstructPix2Pix + MagicBrush + synthetic edit pairs
441
+ - **What to freeze**: VAE
442
+ - **What to train**: Core + Text Encoder
443
+ - **LR**: 1e-5
444
+ - **Key**: Add image conditioning channel
445
+
446
+ ---
447
+
448
+ ## 10. Deployment Plan for Mobile
449
+
450
+ ### Step 1: Quantization
451
+ - INT8 per-channel weight quantization (static)
452
+ - INT8 per-token activation quantization (dynamic)
453
+ - Result: ~4× model size reduction
454
+
455
+ ### Step 2: Operator Optimization
456
+ - Replace GELU → SiLU throughout (MobileDiffusion finding: GELU causes float16 instability)
457
+ - Fuse norm + activation + linear into single kernels
458
+ - Use CoreML (iOS) or NNAPI (Android) for hardware acceleration
459
+
460
+ ### Step 3: Step Reduction
461
+ - After consistency distillation: 4 Euler steps sufficient
462
+ - With further adversarial distillation: 1-2 steps possible
463
+
464
+ ### Step 4: Latent Size Optimization
465
+ - f=16 compression: 1024² → 64×64 latents
466
+ - 32 channels per position
467
+ - Total latent: 64×64×32 = 131,072 values ≈ 0.5 MB
468
+
469
+ ### Projected Performance
470
+ | Device | Steps | Estimated Time |
471
+ |---|---|---|
472
+ | iPhone 16 Pro (ANE) | 4 | ~0.5-1.0s |
473
+ | Pixel 8 Pro (GPU) | 4 | ~1.0-2.0s |
474
+ | iPhone 14 (GPU) | 8 | ~2.0-3.0s |
475
+
476
+ ---
477
+
478
+ ## 11. Failure Mode Analysis
479
+
480
+ | Failure Mode | Cause | Detection | Fix |
481
+ |---|---|---|---|
482
+ | **Fixed-point non-convergence** | Recursion doesn't converge | Monitor z change per recursion | Damped update (α=0.5), reduce T_inner |
483
+ | **Oversmoothing** | GLA loses high-frequency detail | Blurry outputs, low LPIPS | Increase token-differential λ, add DW-conv skip |
484
+ | **Mode collapse** | Small model capacity | FID increases, low diversity | Increase num_blocks or dim |
485
+ | **Training instability** | IFT gradient approximation error | Loss spikes | Reduce LR, increase warmup, disable IFT temporarily |
486
+ | **Poor text adherence** | Weak cross-attention | Low CLIP score | Increase cross-attention gates, add more cross-attn layers |
487
+ | **VAE artifacts** | Aggressive compression | Reconstruction artifacts | Lower f (use f=8), increase decoder capacity |
488
+ | **CFG artifacts** | High guidance scale | Oversaturated images | Train with 10% unconditional, use CFG 3-5 range |
489
+
490
+ ---
491
+
492
+ ## 12. Ablation Plan
493
+
494
+ ### Ablation 1: Recursion Depth vs Quality
495
+ - **Vary**: T_inner ∈ {1, 2, 4, 6, 8}, T_outer ∈ {1, 2, 3}
496
+ - **Measure**: FID, CLIP score, inference time
497
+ - **Hypothesis**: Quality plateaus around T_inner=4-6; diminishing returns beyond T_outer=2
498
+
499
+ ### Ablation 2: GLA vs Standard Attention
500
+ - **Compare**: GLA blocks vs softmax attention blocks (same dim, same depth)
501
+ - **Measure**: FID, memory, throughput
502
+ - **Hypothesis**: GLA matches attention quality at 3-5× lower memory
503
+
504
+ ### Ablation 3: Token Differential
505
+ - **Vary**: λ ∈ {0, 0.05, 0.1, 0.2, learned}
506
+ - **Measure**: FID, sharpness metrics (gradient magnitude)
507
+ - **Hypothesis**: λ=0.1 optimal; λ=0 causes oversmoothing
508
+
509
+ ### Ablation 4: IFT vs Full Backprop
510
+ - **Compare**: IFT training vs full BPTT (at small T for memory comparison)
511
+ - **Measure**: Final FID, training memory, convergence speed
512
+ - **Hypothesis**: IFT within 2% FID of full backprop at 8-16× memory savings
513
+
514
+ ### Ablation 5: VAE Compression
515
+ - **Vary**: f ∈ {8, 16, 32}, C ∈ {8, 16, 32}
516
+ - **Measure**: rFID, PSNR, generation FID
517
+ - **Hypothesis**: f=16, C=16-32 is the sweet spot for mobile quality
518
+
519
+ ### Ablation 6: Abstract State (H-module)
520
+ - **Compare**: With/without abstract state update
521
+ - **Measure**: FID, coherence metrics
522
+ - **Hypothesis**: Abstract state improves global composition coherence
523
+
524
+ ---
525
+
526
+ ## 13. Editing Roadmap
527
+
528
+ The LRF architecture is designed for editing-readiness through **additive image conditioning**:
529
+
530
+ ### Phase 1: Inpainting
531
+ - Add binary mask channel to condition input
532
+ - `z_input = z + z_src * mask + z_noise * (1 - mask)`
533
+ - Train on random masking + MagicBrush data
534
+
535
+ ### Phase 2: Image-to-Image Translation
536
+ - Source image encoded to latent, added to noisy latent
537
+ - Noise level controls edit strength (low noise = subtle edit)
538
+ - No architectural changes needed
539
+
540
+ ### Phase 3: Instruction-Based Editing (OmniGen-style)
541
+ - Text encoder receives both instruction AND image description
542
+ - Source image latent added as conditioning
543
+ - Train on InstructPix2Pix + SEED-edit data
544
+
545
+ ### Phase 4: Super-Resolution
546
+ - Low-res image encoded, upscaled in latent space
547
+ - Decoder generates high-res output
548
+ - Train on paired low/high-res data
549
+
550
+ ### Phase 5: Style Transfer & Identity Preservation
551
+ - Reference image encoded to separate latent
552
+ - Cross-attention between reference and generation
553
+ - Train on same-identity different-image pairs (GRIT-Entity)
554
+
555
+ ### Phase 6: Multi-Image Conditioning
556
+ - OmniGen-style interleaved image-text input
557
+ - Multiple source images encoded and concatenated in latent space
558
+ - Enables try-on, compositing, scene editing
559
+
560
+ ### Why This Works
561
+ The key architectural decisions that enable editing:
562
+ 1. **Additive conditioning** preserves spatial correspondence (pixel i in source maps to token i in latent)
563
+ 2. **Recursive refinement** naturally handles conditioning — the model can "reason" about how to modify the latent
564
+ 3. **Cross-attention to text** at every recursion step allows the model to follow editing instructions progressively
565
+ 4. **Same parameter reuse** means editing capability doesn't require new parameters — just new training data
566
+
567
+ ---
568
+
569
+ ## Quick Start
570
+
571
+ ```python
572
+ # Clone and install
573
+ !pip install torch einops safetensors
574
+
575
+ # Use the pipeline
576
+ from lrf.model import LatentRecurrentFlow
577
+ from lrf.pipeline import LRFPipeline
578
+
579
+ # Create model
580
+ model = LatentRecurrentFlow(LatentRecurrentFlow.tiny_config())
581
+ pipe = LRFPipeline(model)
582
+
583
+ # Generate
584
+ images = pipe("a sunset over the ocean", num_steps=10, height=64, width=64)
585
+
586
+ # Or train
587
+ from lrf.training import run_prototype_training
588
+ model, trainer = run_prototype_training(num_vae_steps=100, num_flow_steps=100)
589
+ ```
590
+
591
+ See `notebook.ipynb` for the full interactive walkthrough.
592
+
593
+ ---
594
+
595
+ ## Citation
596
+
597
+ ```bibtex
598
+ @software{lrf2026,
599
+ title={LatentRecurrentFlow: A Novel Mobile-First Image Generation Architecture},
600
+ author={LRF Research},
601
+ year={2026},
602
+ url={https://huggingface.co/krystv/LatentRecurrentFlow}
603
+ }
604
+ ```
605
+
606
+ ## License
607
+
608
+ Apache 2.0