krystv commited on
Commit
cfaa4f6
·
verified ·
1 Parent(s): 1fd6fa3

Add v2 training results with CIFAR-10 validation

Browse files
Files changed (1) hide show
  1. README.md +127 -526
README.md CHANGED
@@ -8,7 +8,6 @@ tags:
8
  - recursive-reasoning
9
  - novel-architecture
10
  - subquadratic-attention
11
- - gated-linear-attention
12
  - research
13
  library_name: lrf
14
  pipeline_tag: text-to-image
@@ -19,590 +18,192 @@ license: apache-2.0
19
 
20
  > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
21
 
22
- ---
23
-
24
- ## Table of Contents
25
-
26
- 1. [Architecture Overview](#1-architecture-overview)
27
- 2. [Shortlist of Most Relevant Papers](#2-shortlist-of-most-relevant-papers)
28
- 3. [Paper Critiques](#3-paper-critiques)
29
- 4. [Full Proposed Architecture](#4-full-proposed-architecture-latentrecurrentflow)
30
- 5. [Module-by-Module Diagram](#5-module-by-module-diagram)
31
- 6. [Mathematical Formulation](#6-mathematical-formulation)
32
- 7. [Training Objective & Losses](#7-training-objective--losses)
33
- 8. [Memory & Compute Budget](#8-memory--compute-budget)
34
- 9. [Training Curriculum](#9-training-curriculum)
35
- 10. [Deployment Plan for Mobile](#10-deployment-plan-for-mobile)
36
- 11. [Failure Mode Analysis](#11-failure-mode-analysis)
37
- 12. [Ablation Plan](#12-ablation-plan)
38
- 13. [Editing Roadmap](#13-editing-roadmap)
39
-
40
- ---
41
-
42
- ## 1. Architecture Overview
43
-
44
- LRF combines five key innovations into a single coherent architecture:
45
-
46
- | Innovation | Source Inspiration | What It Does |
47
- |---|---|---|
48
- | **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
49
- | **Gated Linear Diffusion (GLD)** blocks | ViG/GLA + DyDiLA | O(N) subquadratic spatial mixing replacing O(N²) attention |
50
- | **Compact f=16 VAE** | SANA DC-AE + SnapGen | 16× spatial compression with ~280K decoder |
51
- | **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
52
- | **Multimodal Conditioning** | OmniGen | Same core supports text-to-image AND editing via additive image conditioning |
53
 
54
- ### Key Numbers (Tiny Config 5.7M params)
 
 
 
 
55
 
56
- | Component | Parameters | FP32 Size | INT8 Size |
57
- |---|---|---|---|
58
- | VAE Encoder | 777K | 3.0 MB | 0.7 MB |
59
- | VAE Decoder | 283K | 1.1 MB | 0.3 MB |
60
- | Text Encoder | 4.5M | 17.3 MB | 4.3 MB |
61
- | Denoising Core | 102K | 0.4 MB | 0.1 MB |
62
- | **Total** | **5.7M** | **21.7 MB** | **5.4 MB** |
63
 
64
- ### Key Numbers (Default Config — 16.3M params)
65
 
66
- | Component | Parameters | FP32 Size | INT8 Size |
67
- |---|---|---|---|
68
- | VAE Encoder | 3.1M | 11.7 MB | 2.9 MB |
69
- | VAE Decoder | 1.1M | 4.1 MB | 1.0 MB |
70
- | Text Encoder | 11.5M | 43.9 MB | 11.0 MB |
71
- | Denoising Core | 651K | 2.5 MB | 0.6 MB |
72
- | **Total** | **16.3M** | **62.2 MB** | **15.6 MB** |
73
 
74
- ---
75
 
76
- ## 2. Shortlist of Most Relevant Papers
77
 
78
- ### A. Subquadratic Spatial Mixing for Image Generation
 
79
 
80
- | Paper | arxiv | Key Contribution | FID Result |
81
- |---|---|---|---|
82
- | **PDE-SSM-DiT** | 2603.13663 | Fourier PDE operator replaces attention, O(N log N), 34× speedup | 18.36 (CelebA-HQ 256) |
83
- | **DiMSUM** (NeurIPS 2024) | 2411.04168 | Mamba + wavelet subbands + shared transformer | **2.11** (CelebA-HQ 256) |
84
- | **ViG/GLA** | 2405.18425 | Gated Linear Attention with 2D locality injection | 90% less memory at 1024² |
85
- | **DyDiLA** | 2601.13683 | Dynamic differential linear attention | **6.80** (SubIN 256) |
86
- | **Mamba2D** | 2412.16146 | True 2D SSM with wavefront scan | 84.0% top-1 IN-1K (27M) |
87
 
88
- ### B. Recursive/Iterative Reasoning
89
 
90
- | Paper | arxiv | Key Contribution |
91
- |---|---|---|
92
- | **HRM** | 2506.21734 | 2-level recurrent fixed-point reasoning, O(1) memory via IFT |
93
- | **TRM** (6473 ⭐) | 2510.04871 | 7M params → 45% ARC-AGI-1 via deep recursion |
94
- | **Thinking Pixel** | 2604.25299 | Sparse MoE adapters for recursive visual reasoning in DiT |
95
 
96
- ### C. Compact Latent Spaces
97
-
98
- | Paper | arxiv | Compression | Quality |
99
- |---|---|---|---|
100
- | **SANA DC-AE** | 2410.10629 | f=32, C=32 → 32×32 latents for 1024² | PSNR 29.29, rFID 0.34 |
101
- | **SnapGen** | 2412.09619 | 1.38M tiny decoder (35× smaller than SD3) | PSNR 27.85 |
102
- | **TiTok** | 2406.07550 | 32 tokens per 256² image | gFID 1.97 (IN-256) |
103
- | **MobileDiffusion** | 2311.16567 | f=8, c=8 VAE, sub-second on iPhone | Better than SD-1.5 at 8 steps |
104
-
105
- ### D. Few-Step Generation
106
-
107
- | Paper | arxiv | Key Result |
108
- |---|---|---|
109
- | **Consistency Models** | 2303.01469 | One-step generation from diffusion |
110
- | **LCM** | 2310.04378 | 2-4 step high-quality via consistency distillation |
111
- | **SD3.5-Flash** | 2509.21318 | Few-step distillation with timestep sharing |
112
-
113
- ### E. Unified Generation + Editing
114
-
115
- | Paper | arxiv | Key Contribution |
116
- |---|---|---|
117
- | **OmniGen** | 2409.11340 | Single model for T2I + editing + control, interleaved image-text input |
118
- | **OmniGen2** | 2506.18871 | Dual decoding pathways, decoupled image tokenizer |
119
- | **InstructPix2Pix** | 2211.09800 | Image editing from text instructions |
120
-
121
- ### F. Mobile Deployment
122
-
123
- | Paper | arxiv | Device Performance |
124
- |---|---|---|
125
- | **SnapGen** | 2412.09619 | 1.4s on iPhone 15 Pro, 372M UNet |
126
- | **SnapGen++** | 2601.08303 | 1.8s on iPhone 16, 0.4B sub-DiT |
127
- | **MobileDiffusion** | 2311.16567 | Sub-second on iPhone, ~400M params |
128
-
129
- ---
130
-
131
- ## 3. Paper Critiques
132
-
133
- ### PDE-SSM (2603.13663) ✅ Borrowed: Physical inductive bias concept
134
- - **Why it helps**: 34× speedup from FFT-based spatial operator with physically grounded bias
135
- - **What it fails at**: FID still behind DiMSUM (18.36 vs 2.11); requires FFT which is non-trivial on mobile
136
- - **Borrowed**: Concept of learnable PDE-style spatial operators; we adapt this to our GLD blocks
137
-
138
- ### HRM/TRM (2506.21734, 2510.04871) ✅ Borrowed: Core recursive architecture
139
- - **Why it helps**: O(1) memory backprop via IFT; extreme parameter efficiency (7M → 45% ARC-AGI)
140
- - **What it fails at**: Never applied to image generation; fixed-point convergence not guaranteed for images
141
- - **Borrowed**: Two-level recursion (abstract + detail), IFT training, recursion depth embedding
142
-
143
- ### ViG/GLA (2405.18425) ✅ Borrowed: Spatial mixing block
144
- - **Why it helps**: Hardware-aware, 90% memory savings, bidirectional GLA with locality injection
145
- - **What it fails at**: Only tested on classification/detection, not generation
146
- - **Borrowed**: Bidirectional GLA core, depthwise conv locality injection (GaLI), token differential (from DyDiLA)
147
-
148
- ### SANA DC-AE (2410.10629) ✅ Borrowed: Latent space design principles
149
- - **Why it helps**: f=32 achieves similar quality to f=8 but 16× fewer tokens
150
- - **What it fails at**: Decoder is still large (50M); typography needs decoder-only LLM text encoder
151
- - **Borrowed**: High-compression VAE principle; we use f=16 as a compromise for fine detail
152
-
153
- ### SnapGen (2412.09619) ✅ Borrowed: Tiny decoder architecture
154
- - **Why it helps**: 35× smaller decoder, 54× faster decode, negligible quality loss
155
- - **What it fails at**: Proprietary weights; still uses quadratic attention in the UNet backbone
156
- - **Borrowed**: Attention-free decoder, SepConv, minimal GroupNorm, SiLU instead of GELU
157
-
158
- ### TiTok (2406.07550) ❌ Rejected: Too aggressive compression
159
- - **Why it was considered**: 32 tokens per image is incredibly compact
160
- - **Why rejected**: rFID=16.2 means visible artifacts; fine detail and typography badly degraded at 32 tokens
161
-
162
- ### DiMSUM (2411.04168) ⚠️ Partially borrowed: Wavelet concept
163
- - **Why it helps**: Best FID (2.11) among SSM-based approaches
164
- - **What it fails at**: Still uses cross-attention fusion → partially quadratic; complex architecture
165
- - **Borrowed**: Wavelet decomposition concept for frequency-aware processing
166
-
167
- ---
168
-
169
- ## 4. Full Proposed Architecture: LatentRecurrentFlow
170
-
171
- ### Name: **LatentRecurrentFlow (LRF)**
172
-
173
- LRF is a **recursive flow-matching image generator** that uses:
174
- - A compact VAE with f=16 compression and a ~280K tiny decoder
175
- - A **Recursive Latent Refinement (RLR) core** that iteratively refines image latents through shared GLD blocks
176
- - A **rectified flow** training objective for clean few-step generation
177
- - **Additive image conditioning** for editing-readiness
178
-
179
- The core insight: **instead of stacking many unique layers, reuse a small set of blocks recursively**. This exploits the observation from HRM/TRM that iterative application of the same function can converge to a fixed point that represents the solution — analogous to how diffusion models iteratively denoise.
180
-
181
- ---
182
-
183
- ## 5. Module-by-Module Diagram
184
 
 
 
185
  ```
186
- ┌─────────────────────────────────────────────────────────────┐
187
- │ LatentRecurrentFlow │
188
- │ │
189
- │ ┌─────────────┐ ┌──────────────┐ ┌────────────────┐ │
190
- │ │ Compact │ │ Simple │ │ Rectified │ │
191
- │ │ VAE │ │ Text ��� │ Flow │ │
192
- │ │ (f=16) │ │ Encoder │ │ Scheduler │ │
193
- │ │ │ │ │ │ │ │
194
- │ │ Encoder ────┤ │ Embed ──────┤ │ t ~ U[0,1] │ │
195
- │ │ (3.1M) │ │ Transformer │ │ z_t = (1-t) │ │
196
- │ │ │ │ (11.5M) │ │ z_0 + tε │ │
197
- │ │ Decoder ────┤ │ │ │ │ │
198
- │ │ (1.1M, tiny)│ │ → text_emb │ │ v = ε - z_0 │ │
199
- │ └──────┬───────┘ │ → text_glob │ └────────┬───────┘ │
200
- │ │ └──────┬───────┘ │ │
201
- │ │ │ │ │
202
- │ ┌──────▼───────────────────▼─────────────────────▼──────┐ │
203
- │ │ Recursive Latent Core (RLR) │ │
204
- │ │ │ │
205
- │ │ ┌─────────────────────────────────────────────────┐ │ │
206
- │ │ │ OUTER LOOP (j = 1..T_outer) │ │ │
207
- │ │ │ │ │ │
208
- │ │ │ z_abstract ← f_slow(z, z_pooled) [H-module] │ │ │
209
- │ │ │ │ │ │
210
- │ │ │ ┌─────────────────────────────────────────┐ │ │ │
211
- │ │ │ │ INNER LOOP (i = 1..T_inner) │ │ │ │
212
- │ │ │ │ │ │ │ │
213
- │ │ │ │ cond = t_emb + text_global + rec_emb │ │ │ │
214
- │ │ │ │ z_in = z + z_abstract │ │ │ │
215
- │ │ │ │ │ │ │ │
216
- │ │ │ │ FOR block in GLD_blocks: │ │ │ │
217
- │ │ │ │ ┌─────────────────────────────────┐ │ │ │ │
218
- │ │ │ │ │ GLD Block │ │ │ │ │
219
- │ │ │ │ │ │ │ │ │ │
220
- │ │ │ │ │ 1. AdaLN-modulate(z, cond) │ │ │ │ │
221
- │ │ │ │ │ 2. GLA: BiDir scan + DiffToken │ │ │ │ │
222
- │ │ │ │ │ + DW-Conv locality gate │ │ │ │ │
223
- │ │ │ │ │ 3. Cross-attn to text_emb │ │ │ │ │
224
- │ │ │ │ │ 4. AdaLN-modulate(z, cond) │ │ │ │ │
225
- │ │ │ │ │ 5. SwiGLU FFN │ │ │ │ │
226
- │ │ │ │ └─────────────────────────────────┘ │ │ │ │
227
- │ │ │ │ │ │ │ │
228
- │ │ │ │ z = z + 0.5 * (blocks(z_in) - z) │ │ │ │
229
- │ │ │ └─────────────────────────────────────────┘ │ │ │
230
- │ │ └─────────────────────────────────────────────────┘ │ │
231
- │ │ │ │
232
- │ │ v = out_proj(out_norm(z)) ← velocity prediction │ │
233
- │ └─────────────────────────────────────────────────────────┘ │
234
- │ │
235
- │ Training: IFT backprop (O(1) memory through recursion) │
236
- │ Inference: Full recursion (no grad needed) │
237
- └─────────────────────────────────────────────────────────────┘
238
  ```
239
 
240
  ---
241
 
242
- ## 6. Mathematical Formulation
243
-
244
- ### Forward Process (Rectified Flow)
245
-
246
- Given clean latent z₀ and noise ε ~ N(0, I):
247
-
248
- ```
249
- z_t = (1 - t) · z₀ + t · ε, t ∈ [0, 1]
250
- ```
251
-
252
- ### Velocity Target
253
-
254
- ```
255
- v* = ε - z₀
256
- ```
257
-
258
- ### Denoising Core (RLR)
259
-
260
- Let f_θ denote the shared GLD blocks, and g_φ denote the abstract updater.
261
-
262
- **Initialization:**
263
- ```
264
- z⁽⁰⁾ = input_proj(flatten(z_t))
265
- c = time_embed(sinusoidal(t)) + text_global
266
- z_abs⁽⁰⁾ = mean_pool(z⁽⁰⁾)
267
- ```
268
 
269
- **Outer loop** (j = 1..T_outer):
270
- ```
271
- z_abs⁽ʲ⁾ = z_abs⁽ʲ⁻¹⁾ + tanh(α) · g_φ([norm(z), mean_pool(z)])
272
- ```
273
 
274
- **Inner loop** (i = 1..T_inner):
275
- ```
276
- c_step = c + recursion_embed(j · T_inner + i)
277
- z_in = z + z_abs⁽ʲ⁾
278
- z z + 0.5 · (f_θ(z_in, c_step, text_emb) - z)
279
- ```
 
280
 
281
- **Output:**
282
- ```
283
- v_θ(z_t, t, c) = out_proj(out_norm(z))
284
- ```
285
 
286
- ### GLA Block (within f_θ)
 
 
 
 
 
287
 
288
- ```
289
- Q, K, V = W_qkv · x (linear projection)
290
- Q̃ = Q - λ · shift(Q) (token differential)
291
- K̃ = K - λ · shift(K)
292
- Q̃ = φ(Q̃), K̃ = φ(K̃) where φ(x) = 1 + elu(x)
293
 
294
- Forward scan: S_i = γ · S_{i-1} + K̃_i^T · V_i; O_i^fwd = Q̃_i · S_i
295
- Backward scan: (same in reverse)
 
296
 
297
- O = O^fwd + O^bwd
298
- O = sigmoid(W_g · x) · norm(O) · sigmoid(DWConv(W_local · x))
299
- output = W_out · O
300
- ```
301
 
302
- Complexity: **O(N · d²)** per direction, where d is head dimension and N is token count.
 
303
 
304
- ### IFT Training (O(1) Memory)
 
305
 
306
- During training, we detach gradients for all but the last recursion:
307
- ```
308
- with no_grad():
309
- for j in range(T_outer - 1):
310
- z = recursive_refinement(z, c, text_emb)
311
- z = recursive_refinement(z, c, text_emb) # grad only here
312
- ```
313
 
314
- By the Implicit Function Theorem, if z* is a fixed point of f, then:
 
315
  ```
316
- ∂z*/∂θ = (I - ∂f/∂z)⁻¹ · ∂f/∂θ
317
- ```
318
-
319
- The 1-step gradient approximates this, giving correct gradient direction with O(1) memory.
320
 
321
  ---
322
 
323
- ## 7. Training Objective & Losses
324
-
325
- ### Stage 1: VAE Training
326
-
327
- ```
328
- L_VAE = L_recon + λ_perc · L_perceptual + λ_KL · L_KL
329
-
330
- L_recon = |x - x̂|₁ (L1 reconstruction)
331
- L_perceptual = (1/3) Σ_{s=0}^{2} MSE(pool_s(x), pool_s(x̂)) (multi-scale)
332
- L_KL = -0.5 · E[1 + log(σ²) - μ² - σ²] (KL divergence)
333
-
334
- λ_perc = 1.0, λ_KL = 1e-6
335
- ```
336
-
337
- ### Stage 2: Flow Matching
338
-
339
- ```
340
- L_flow = E_{t,z₀,ε} [ w(t) · ‖v_θ(z_t, t, c) - (ε - z₀)‖² ]
341
-
342
- w(t) = 1 / (t(1-t) + 0.01) (SNR weighting, normalized)
343
-
344
- With 10% classifier-free guidance dropout:
345
- P(c = ∅) = 0.1
346
- ```
347
 
348
- ### Stage 3: Consistency Distillation
 
 
 
 
349
 
350
- ```
351
- L_CD = ‖f_θ(z_{t_n}, t_n, c) - sg[f_{teacher}(z_{t_{n-1}}, t_{n-1}, c)]‖²
 
 
 
 
 
352
 
353
- where f_teacher uses the trained flow model with one Euler step:
354
- z_{t_{n-1}} = z_{t_n} - (t_n - t_{n-1}) · v_teacher(z_{t_n}, t_n, c)
 
 
 
355
  ```
356
 
357
- ### Stage 4: Editing Fine-tuning
358
-
359
- Same flow matching loss, but with additional image condition:
360
  ```
361
- v_θ(z_t, t, c, z_src) where z_src = encode(source_image)
362
- ```
363
-
364
- Additive conditioning: `z_input = z + z_src` before the RLR core.
365
 
366
  ---
367
 
368
- ## 8. Memory & Compute Budget
369
 
370
- ### Inference (1024×1024, Default Config, INT8)
371
-
372
- | Component | Memory |
373
  |---|---|
374
- | Text Encoder (INT8) | 11 MB |
375
- | VAE Decoder (INT8) | 1 MB |
376
- | Denoising Core (INT8) | 0.6 MB |
377
- | Latent activations (64×64×32) | 0.5 MB |
378
- | Peak activation memory | ~200 MB |
379
- | **Total** | **~213 MB** |
380
-
381
- This comfortably fits within 3-4 GB mobile RAM.
382
-
383
- ### Training (16 GB GPU, Default Config)
384
-
385
- | Item | Memory |
386
- |---|---|
387
- | Model parameters (FP32) | 62 MB |
388
- | Optimizer states (AdamW, 2×) | 124 MB |
389
- | Gradients | 62 MB |
390
- | Batch activations (BS=8, 64×64) | ~500 MB |
391
- | IFT overhead (only last recursion) | ~50 MB |
392
- | **Total** | **~800 MB** |
393
-
394
- Leaves ample room for larger batch sizes or higher resolution on 16 GB.
395
 
396
  ---
397
 
398
- ## 9. Training Curriculum
399
-
400
- ### Stage 1: VAE (50K steps)
401
- - **Data**: ImageNet or COCO (any large image dataset)
402
- - **Resolution**: 256×256
403
- - **What to freeze**: Nothing
404
- - **What to train**: Full VAE
405
- - **LR**: 1e-4, AdamW, weight_decay=0.01
406
- - **Key**: Train until L_recon < 0.1
407
-
408
- ### Stage 2: Flow Matching — Low Resolution (100K steps)
409
- - **Data**: Synthetic captions from teacher (SDXL) + LAION-aesthetic subset
410
- - **Resolution**: 64×64
411
- - **What to freeze**: VAE
412
- - **What to train**: Core + Text Encoder
413
- - **LR**: 1e-4
414
- - **Key**: Focus on learning composition and prompt adherence
415
-
416
- ### Stage 3: Flow Matching — Mid Resolution (200K steps)
417
- - **Data**: Filtered LAION-aesthetic (score > 6.0) + synthetic
418
- - **Resolution**: 256×256
419
- - **What to freeze**: VAE
420
- - **What to train**: Core + Text Encoder
421
- - **LR**: 5e-5
422
- - **Key**: Focus on texture and detail
423
-
424
- ### Stage 4: Flow Matching — High Resolution (100K steps)
425
- - **Data**: High-quality curated + JourneyDB
426
- - **Resolution**: 512×512
427
- - **What to freeze**: VAE
428
- - **What to train**: Core + Text Encoder
429
- - **LR**: 2e-5
430
- - **Key**: Focus on fine detail and typography
431
-
432
- ### Stage 5: Consistency Distillation (50K steps)
433
- - **Data**: Same as Stage 4
434
- - **What to freeze**: VAE + Text Encoder
435
- - **What to train**: Core only
436
- - **LR**: 1e-5
437
- - **Key**: Distill from own multi-step model to 4-step generation
438
-
439
- ### Stage 6: Editing Fine-tuning (50K steps)
440
- - **Data**: InstructPix2Pix + MagicBrush + synthetic edit pairs
441
- - **What to freeze**: VAE
442
- - **What to train**: Core + Text Encoder
443
- - **LR**: 1e-5
444
- - **Key**: Add image conditioning channel
445
-
446
- ---
447
-
448
- ## 10. Deployment Plan for Mobile
449
-
450
- ### Step 1: Quantization
451
- - INT8 per-channel weight quantization (static)
452
- - INT8 per-token activation quantization (dynamic)
453
- - Result: ~4× model size reduction
454
 
455
- ### Step 2: Operator Optimization
456
- - Replace GELU → SiLU throughout (MobileDiffusion finding: GELU causes float16 instability)
457
- - Fuse norm + activation + linear into single kernels
458
- - Use CoreML (iOS) or NNAPI (Android) for hardware acceleration
 
 
 
 
459
 
460
- ### Step 3: Step Reduction
461
- - After consistency distillation: 4 Euler steps sufficient
462
- - With further adversarial distillation: 1-2 steps possible
463
-
464
- ### Step 4: Latent Size Optimization
465
- - f=16 compression: 1024² → 64×64 latents
466
- - 32 channels per position
467
- - Total latent: 64×64×32 = 131,072 values ≈ 0.5 MB
468
-
469
- ### Projected Performance
470
- | Device | Steps | Estimated Time |
471
- |---|---|---|
472
- | iPhone 16 Pro (ANE) | 4 | ~0.5-1.0s |
473
- | Pixel 8 Pro (GPU) | 4 | ~1.0-2.0s |
474
- | iPhone 14 (GPU) | 8 | ~2.0-3.0s |
475
-
476
- ---
477
-
478
- ## 11. Failure Mode Analysis
479
-
480
- | Failure Mode | Cause | Detection | Fix |
481
- |---|---|---|---|
482
- | **Fixed-point non-convergence** | Recursion doesn't converge | Monitor z change per recursion | Damped update (α=0.5), reduce T_inner |
483
- | **Oversmoothing** | GLA loses high-frequency detail | Blurry outputs, low LPIPS | Increase token-differential λ, add DW-conv skip |
484
- | **Mode collapse** | Small model capacity | FID increases, low diversity | Increase num_blocks or dim |
485
- | **Training instability** | IFT gradient approximation error | Loss spikes | Reduce LR, increase warmup, disable IFT temporarily |
486
- | **Poor text adherence** | Weak cross-attention | Low CLIP score | Increase cross-attention gates, add more cross-attn layers |
487
- | **VAE artifacts** | Aggressive compression | Reconstruction artifacts | Lower f (use f=8), increase decoder capacity |
488
- | **CFG artifacts** | High guidance scale | Oversaturated images | Train with 10% unconditional, use CFG 3-5 range |
489
-
490
- ---
491
-
492
- ## 12. Ablation Plan
493
-
494
- ### Ablation 1: Recursion Depth vs Quality
495
- - **Vary**: T_inner ∈ {1, 2, 4, 6, 8}, T_outer ∈ {1, 2, 3}
496
- - **Measure**: FID, CLIP score, inference time
497
- - **Hypothesis**: Quality plateaus around T_inner=4-6; diminishing returns beyond T_outer=2
498
-
499
- ### Ablation 2: GLA vs Standard Attention
500
- - **Compare**: GLA blocks vs softmax attention blocks (same dim, same depth)
501
- - **Measure**: FID, memory, throughput
502
- - **Hypothesis**: GLA matches attention quality at 3-5× lower memory
503
-
504
- ### Ablation 3: Token Differential
505
- - **Vary**: λ ∈ {0, 0.05, 0.1, 0.2, learned}
506
- - **Measure**: FID, sharpness metrics (gradient magnitude)
507
- - **Hypothesis**: λ=0.1 optimal; λ=0 causes oversmoothing
508
-
509
- ### Ablation 4: IFT vs Full Backprop
510
- - **Compare**: IFT training vs full BPTT (at small T for memory comparison)
511
- - **Measure**: Final FID, training memory, convergence speed
512
- - **Hypothesis**: IFT within 2% FID of full backprop at 8-16× memory savings
513
-
514
- ### Ablation 5: VAE Compression
515
- - **Vary**: f ∈ {8, 16, 32}, C ∈ {8, 16, 32}
516
- - **Measure**: rFID, PSNR, generation FID
517
- - **Hypothesis**: f=16, C=16-32 is the sweet spot for mobile quality
518
-
519
- ### Ablation 6: Abstract State (H-module)
520
- - **Compare**: With/without abstract state update
521
- - **Measure**: FID, coherence metrics
522
- - **Hypothesis**: Abstract state improves global composition coherence
523
-
524
- ---
525
-
526
- ## 13. Editing Roadmap
527
-
528
- The LRF architecture is designed for editing-readiness through **additive image conditioning**:
529
-
530
- ### Phase 1: Inpainting
531
- - Add binary mask channel to condition input
532
- - `z_input = z + z_src * mask + z_noise * (1 - mask)`
533
- - Train on random masking + MagicBrush data
534
-
535
- ### Phase 2: Image-to-Image Translation
536
- - Source image encoded to latent, added to noisy latent
537
- - Noise level controls edit strength (low noise = subtle edit)
538
- - No architectural changes needed
539
-
540
- ### Phase 3: Instruction-Based Editing (OmniGen-style)
541
- - Text encoder receives both instruction AND image description
542
- - Source image latent added as conditioning
543
- - Train on InstructPix2Pix + SEED-edit data
544
-
545
- ### Phase 4: Super-Resolution
546
- - Low-res image encoded, upscaled in latent space
547
- - Decoder generates high-res output
548
- - Train on paired low/high-res data
549
-
550
- ### Phase 5: Style Transfer & Identity Preservation
551
- - Reference image encoded to separate latent
552
- - Cross-attention between reference and generation
553
- - Train on same-identity different-image pairs (GRIT-Entity)
554
-
555
- ### Phase 6: Multi-Image Conditioning
556
- - OmniGen-style interleaved image-text input
557
- - Multiple source images encoded and concatenated in latent space
558
- - Enables try-on, compositing, scene editing
559
-
560
- ### Why This Works
561
- The key architectural decisions that enable editing:
562
- 1. **Additive conditioning** preserves spatial correspondence (pixel i in source maps to token i in latent)
563
- 2. **Recursive refinement** naturally handles conditioning — the model can "reason" about how to modify the latent
564
- 3. **Cross-attention to text** at every recursion step allows the model to follow editing instructions progressively
565
- 4. **Same parameter reuse** means editing capability doesn't require new parameters — just new training data
566
 
567
  ---
568
 
569
- ## Quick Start
570
-
571
- ```python
572
- # Clone and install
573
- !pip install torch einops safetensors
574
 
575
- # Use the pipeline
576
- from lrf.model import LatentRecurrentFlow
577
- from lrf.pipeline import LRFPipeline
 
 
578
 
579
- # Create model
580
- model = LatentRecurrentFlow(LatentRecurrentFlow.tiny_config())
581
- pipe = LRFPipeline(model)
582
 
583
- # Generate
584
- images = pipe("a sunset over the ocean", num_steps=10, height=64, width=64)
 
 
585
 
586
- # Or train
587
- from lrf.training import run_prototype_training
588
- model, trainer = run_prototype_training(num_vae_steps=100, num_flow_steps=100)
589
- ```
590
 
591
- See `notebook.ipynb` for the full interactive walkthrough.
 
 
592
 
593
  ---
594
 
595
- ## Citation
596
-
597
- ```bibtex
598
- @software{lrf2026,
599
- title={LatentRecurrentFlow: A Novel Mobile-First Image Generation Architecture},
600
- author={LRF Research},
601
- year={2026},
602
- url={https://huggingface.co/krystv/LatentRecurrentFlow}
603
- }
604
- ```
605
-
606
  ## License
607
 
608
  Apache 2.0
 
8
  - recursive-reasoning
9
  - novel-architecture
10
  - subquadratic-attention
 
11
  - research
12
  library_name: lrf
13
  pipeline_tag: text-to-image
 
18
 
19
  > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
20
 
21
+ ## 🔥 v2 Training Results (CIFAR-10)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ **Trained end-to-end on CIFAR-10** (50K images, 10 classes) using:
24
+ - **Pre-trained TAESD** (2.4M frozen params) as the VAE — f=8 compression, 32×32 → 4×4×4 latents
25
+ - **1.47M parameter denoising core** with recursive refinement (4 shared blocks × 2 recursions = 8 effective layers)
26
+ - **Rectified flow** matching with SNR-weighted loss and 10% CFG dropout
27
+ - Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999
28
 
29
+ | Metric | Value |
30
+ |--------|-------|
31
+ | Final Loss | 0.931 |
32
+ | Training Time | ~70 min (CPU only!) |
33
+ | VAE Recon MSE | 0.068 |
34
+ | All 10 classes produce colorful images | |
 
35
 
36
+ ### Sample Outputs
37
 
38
+ VAE Reconstruction (top: original, bottom: TAESD reconstruction):
 
 
 
 
 
 
39
 
40
+ ![VAE Reconstruction](samples/vae_reconstruction.png)
41
 
42
+ Training progression (epoch 5 30):
43
 
44
+ ![Epoch 5](samples/samples_epoch005.png)
45
+ ![Epoch 30](samples/samples_epoch030.png)
46
 
47
+ Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):
 
 
 
 
 
 
48
 
49
+ ![Final Samples](samples/final_class_conditional.png)
50
 
51
+ Loss curve:
 
 
 
 
52
 
53
+ ![Loss](samples/loss.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
+ ### Validation: No Grey Images
56
+ Every class produces images with proper variance:
57
  ```
58
+ airplane : std=0.383, range=1.908 ✅
59
+ automobile : std=0.448, range=2.000 ✅
60
+ bird : std=0.341, range=1.663 ✅
61
+ cat : std=0.521, range=2.000 ✅
62
+ deer : std=0.401, range=1.869 ✅
63
+ dog : std=0.477, range=1.994 ✅
64
+ frog : std=0.366, range=1.996 ✅
65
+ horse : std=0.499, range=1.972 ✅
66
+ ship : std=0.448, range=1.786 ✅
67
+ truck : std=0.510, range=1.944 ✅
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ```
69
 
70
  ---
71
 
72
+ ## Architecture Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ LRF combines five key innovations into a single coherent architecture:
 
 
 
75
 
76
+ | Innovation | Source Inspiration | What It Does |
77
+ |---|---|---|
78
+ | **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
79
+ | **Efficient Spatial Mixer** | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) |
80
+ | **Pre-trained TAESD VAE** | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box |
81
+ | **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
82
+ | **Additive Image Conditioning** | OmniGen | Same core supports text-to-image AND editing |
83
 
84
+ ### v2 Architecture (Trained & Validated)
 
 
 
85
 
86
+ | Component | Parameters | Description |
87
+ |---|---|---|
88
+ | TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder |
89
+ | Denoising Core | 1.47M | 4 shared blocks × 2 inner recursions |
90
+ | Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 |
91
+ | **Trainable Total** | **1.47M** | |
92
 
93
+ ### How It Works
 
 
 
 
94
 
95
+ ```python
96
+ # 1. Encode image to latent (TAESD, frozen)
97
+ z_0 = vae.encode(image) # [B, 4, 4, 4]
98
 
99
+ # 2. Add noise (rectified flow)
100
+ z_t = (1-t) * z_0 + t * noise # Linear interpolation
 
 
101
 
102
+ # 3. Predict velocity (recursive denoising core)
103
+ v = core(z_t, t, class_label) # 4 blocks × 2 recursions
104
 
105
+ # 4. Training target
106
+ loss = MSE(v, noise - z_0) # Velocity matching
107
 
108
+ # 5. Sampling (Euler ODE solver, t=1→0)
109
+ for step in timesteps:
110
+ v = core(z, t, class_label)
111
+ z = z - dt * v
 
 
 
112
 
113
+ # 6. Decode to image (TAESD, frozen)
114
+ image = vae.decode(z)
115
  ```
 
 
 
 
116
 
117
  ---
118
 
119
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
+ ### Generate from trained model:
122
+ ```python
123
+ import torch
124
+ from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
125
+ from diffusers import AutoencoderTiny
126
 
127
+ # Load
128
+ vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
129
+ ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
130
+ model = LRFv2(ckpt['config'])
131
+ for name, p in model.named_parameters():
132
+ p.data.copy_(ckpt['ema_params'][name])
133
+ model.eval()
134
 
135
+ # Generate (class 3 = cat)
136
+ scheduler = RectifiedFlowScheduler()
137
+ labels = torch.full((4,), 3, dtype=torch.long)
138
+ z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
139
+ images = vae.decode(z).sample.clamp(-1, 1)
140
  ```
141
 
142
+ ### Train from scratch:
143
+ ```bash
144
+ python lrf/train_v2.py
145
  ```
 
 
 
 
146
 
147
  ---
148
 
149
+ ## Files
150
 
151
+ | File | Description |
 
 
152
  |---|---|
153
+ | `lrf/model_v2.py` | Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) |
154
+ | `lrf/train_v2.py` | CIFAR-10 training pipeline with TAESD VAE |
155
+ | `trained/cifar10_checkpoint.pt` | Trained weights (30 epochs, EMA) |
156
+ | `trained/config.json` | Model configuration |
157
+ | `samples/` | Generated sample images at various epochs |
158
+ | `lrf/model.py` | v1 architecture (research prototype) |
159
+ | `lrf/training.py` | v1 training pipeline |
160
+ | `lrf/pipeline.py` | HF-compatible inference pipeline |
161
+ | `notebook.ipynb` | Interactive walkthrough |
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
  ---
164
 
165
+ ## Training Curriculum (Full Scale)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
+ | Stage | Resolution | Data | Freeze | Train | LR | Steps |
168
+ |---|---|---|---|---|---|---|
169
+ | 1. VAE | 256² | ImageNet/COCO | - | VAE | 1e-4 | 50K |
170
+ | 2. Flow (low) | 64² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K |
171
+ | 3. Flow (mid) | 256² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K |
172
+ | 4. Flow (high) | 512² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K |
173
+ | 5. Distill | 512² | Same as 4 | VAE+Text | Core | 1e-5 | 50K |
174
+ | 6. Editing | 512² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K |
175
 
176
+ **Shortcut (proven in this repo):** Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
  ---
179
 
180
+ ## Relevant Papers (Grouped by Problem)
 
 
 
 
181
 
182
+ ### Subquadratic Spatial Mixing
183
+ - PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34× speedup
184
+ - DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
185
+ - ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
186
+ - DyDiLA (2601.13683): Dynamic differential linear attention
187
 
188
+ ### Recursive Reasoning
189
+ - HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
190
+ - TRM (2510.04871): 7M params → 45% ARC-AGI-1
191
 
192
+ ### Compact Latent Spaces
193
+ - SANA DC-AE (2410.10629): f=32, PSNR 29.29
194
+ - SnapGen (2412.09619): 1.38M tiny decoder
195
+ - TAESD (madebyollin): 2.4M params, f=8, works immediately
196
 
197
+ ### Few-Step Generation
198
+ - Consistency Models (2303.01469): One-step from diffusion
199
+ - LCM (2310.04378): 2-4 step via consistency distillation
 
200
 
201
+ ### Editing Architectures
202
+ - OmniGen (2409.11340): Unified generation + editing
203
+ - InstructPix2Pix (2211.09800): Text-guided editing
204
 
205
  ---
206
 
 
 
 
 
 
 
 
 
 
 
 
207
  ## License
208
 
209
  Apache 2.0