Add v2 training results with CIFAR-10 validation
Browse files
README.md
CHANGED
|
@@ -8,7 +8,6 @@ tags:
|
|
| 8 |
- recursive-reasoning
|
| 9 |
- novel-architecture
|
| 10 |
- subquadratic-attention
|
| 11 |
-
- gated-linear-attention
|
| 12 |
- research
|
| 13 |
library_name: lrf
|
| 14 |
pipeline_tag: text-to-image
|
|
@@ -19,590 +18,192 @@ license: apache-2.0
|
|
| 19 |
|
| 20 |
> A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
|
| 21 |
|
| 22 |
-
-
|
| 23 |
-
|
| 24 |
-
## Table of Contents
|
| 25 |
-
|
| 26 |
-
1. [Architecture Overview](#1-architecture-overview)
|
| 27 |
-
2. [Shortlist of Most Relevant Papers](#2-shortlist-of-most-relevant-papers)
|
| 28 |
-
3. [Paper Critiques](#3-paper-critiques)
|
| 29 |
-
4. [Full Proposed Architecture](#4-full-proposed-architecture-latentrecurrentflow)
|
| 30 |
-
5. [Module-by-Module Diagram](#5-module-by-module-diagram)
|
| 31 |
-
6. [Mathematical Formulation](#6-mathematical-formulation)
|
| 32 |
-
7. [Training Objective & Losses](#7-training-objective--losses)
|
| 33 |
-
8. [Memory & Compute Budget](#8-memory--compute-budget)
|
| 34 |
-
9. [Training Curriculum](#9-training-curriculum)
|
| 35 |
-
10. [Deployment Plan for Mobile](#10-deployment-plan-for-mobile)
|
| 36 |
-
11. [Failure Mode Analysis](#11-failure-mode-analysis)
|
| 37 |
-
12. [Ablation Plan](#12-ablation-plan)
|
| 38 |
-
13. [Editing Roadmap](#13-editing-roadmap)
|
| 39 |
-
|
| 40 |
-
---
|
| 41 |
-
|
| 42 |
-
## 1. Architecture Overview
|
| 43 |
-
|
| 44 |
-
LRF combines five key innovations into a single coherent architecture:
|
| 45 |
-
|
| 46 |
-
| Innovation | Source Inspiration | What It Does |
|
| 47 |
-
|---|---|---|
|
| 48 |
-
| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
|
| 49 |
-
| **Gated Linear Diffusion (GLD)** blocks | ViG/GLA + DyDiLA | O(N) subquadratic spatial mixing replacing O(N²) attention |
|
| 50 |
-
| **Compact f=16 VAE** | SANA DC-AE + SnapGen | 16× spatial compression with ~280K decoder |
|
| 51 |
-
| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
|
| 52 |
-
| **Multimodal Conditioning** | OmniGen | Same core supports text-to-image AND editing via additive image conditioning |
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
|
| 57 |
-
|---
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
| **Total** | **5.7M** | **21.7 MB** | **5.4 MB** |
|
| 63 |
|
| 64 |
-
###
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|---|---|---|---|
|
| 68 |
-
| VAE Encoder | 3.1M | 11.7 MB | 2.9 MB |
|
| 69 |
-
| VAE Decoder | 1.1M | 4.1 MB | 1.0 MB |
|
| 70 |
-
| Text Encoder | 11.5M | 43.9 MB | 11.0 MB |
|
| 71 |
-
| Denoising Core | 651K | 2.5 MB | 0.6 MB |
|
| 72 |
-
| **Total** | **16.3M** | **62.2 MB** | **15.6 MB** |
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
|
|
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|---|---|---|---|
|
| 82 |
-
| **PDE-SSM-DiT** | 2603.13663 | Fourier PDE operator replaces attention, O(N log N), 34× speedup | 18.36 (CelebA-HQ 256) |
|
| 83 |
-
| **DiMSUM** (NeurIPS 2024) | 2411.04168 | Mamba + wavelet subbands + shared transformer | **2.11** (CelebA-HQ 256) |
|
| 84 |
-
| **ViG/GLA** | 2405.18425 | Gated Linear Attention with 2D locality injection | 90% less memory at 1024² |
|
| 85 |
-
| **DyDiLA** | 2601.13683 | Dynamic differential linear attention | **6.80** (SubIN 256) |
|
| 86 |
-
| **Mamba2D** | 2412.16146 | True 2D SSM with wavefront scan | 84.0% top-1 IN-1K (27M) |
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|---|---|---|
|
| 92 |
-
| **HRM** | 2506.21734 | 2-level recurrent fixed-point reasoning, O(1) memory via IFT |
|
| 93 |
-
| **TRM** (6473 ⭐) | 2510.04871 | 7M params → 45% ARC-AGI-1 via deep recursion |
|
| 94 |
-
| **Thinking Pixel** | 2604.25299 | Sparse MoE adapters for recursive visual reasoning in DiT |
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
| Paper | arxiv | Compression | Quality |
|
| 99 |
-
|---|---|---|---|
|
| 100 |
-
| **SANA DC-AE** | 2410.10629 | f=32, C=32 → 32×32 latents for 1024² | PSNR 29.29, rFID 0.34 |
|
| 101 |
-
| **SnapGen** | 2412.09619 | 1.38M tiny decoder (35× smaller than SD3) | PSNR 27.85 |
|
| 102 |
-
| **TiTok** | 2406.07550 | 32 tokens per 256² image | gFID 1.97 (IN-256) |
|
| 103 |
-
| **MobileDiffusion** | 2311.16567 | f=8, c=8 VAE, sub-second on iPhone | Better than SD-1.5 at 8 steps |
|
| 104 |
-
|
| 105 |
-
### D. Few-Step Generation
|
| 106 |
-
|
| 107 |
-
| Paper | arxiv | Key Result |
|
| 108 |
-
|---|---|---|
|
| 109 |
-
| **Consistency Models** | 2303.01469 | One-step generation from diffusion |
|
| 110 |
-
| **LCM** | 2310.04378 | 2-4 step high-quality via consistency distillation |
|
| 111 |
-
| **SD3.5-Flash** | 2509.21318 | Few-step distillation with timestep sharing |
|
| 112 |
-
|
| 113 |
-
### E. Unified Generation + Editing
|
| 114 |
-
|
| 115 |
-
| Paper | arxiv | Key Contribution |
|
| 116 |
-
|---|---|---|
|
| 117 |
-
| **OmniGen** | 2409.11340 | Single model for T2I + editing + control, interleaved image-text input |
|
| 118 |
-
| **OmniGen2** | 2506.18871 | Dual decoding pathways, decoupled image tokenizer |
|
| 119 |
-
| **InstructPix2Pix** | 2211.09800 | Image editing from text instructions |
|
| 120 |
-
|
| 121 |
-
### F. Mobile Deployment
|
| 122 |
-
|
| 123 |
-
| Paper | arxiv | Device Performance |
|
| 124 |
-
|---|---|---|
|
| 125 |
-
| **SnapGen** | 2412.09619 | 1.4s on iPhone 15 Pro, 372M UNet |
|
| 126 |
-
| **SnapGen++** | 2601.08303 | 1.8s on iPhone 16, 0.4B sub-DiT |
|
| 127 |
-
| **MobileDiffusion** | 2311.16567 | Sub-second on iPhone, ~400M params |
|
| 128 |
-
|
| 129 |
-
---
|
| 130 |
-
|
| 131 |
-
## 3. Paper Critiques
|
| 132 |
-
|
| 133 |
-
### PDE-SSM (2603.13663) ✅ Borrowed: Physical inductive bias concept
|
| 134 |
-
- **Why it helps**: 34× speedup from FFT-based spatial operator with physically grounded bias
|
| 135 |
-
- **What it fails at**: FID still behind DiMSUM (18.36 vs 2.11); requires FFT which is non-trivial on mobile
|
| 136 |
-
- **Borrowed**: Concept of learnable PDE-style spatial operators; we adapt this to our GLD blocks
|
| 137 |
-
|
| 138 |
-
### HRM/TRM (2506.21734, 2510.04871) ✅ Borrowed: Core recursive architecture
|
| 139 |
-
- **Why it helps**: O(1) memory backprop via IFT; extreme parameter efficiency (7M → 45% ARC-AGI)
|
| 140 |
-
- **What it fails at**: Never applied to image generation; fixed-point convergence not guaranteed for images
|
| 141 |
-
- **Borrowed**: Two-level recursion (abstract + detail), IFT training, recursion depth embedding
|
| 142 |
-
|
| 143 |
-
### ViG/GLA (2405.18425) ✅ Borrowed: Spatial mixing block
|
| 144 |
-
- **Why it helps**: Hardware-aware, 90% memory savings, bidirectional GLA with locality injection
|
| 145 |
-
- **What it fails at**: Only tested on classification/detection, not generation
|
| 146 |
-
- **Borrowed**: Bidirectional GLA core, depthwise conv locality injection (GaLI), token differential (from DyDiLA)
|
| 147 |
-
|
| 148 |
-
### SANA DC-AE (2410.10629) ✅ Borrowed: Latent space design principles
|
| 149 |
-
- **Why it helps**: f=32 achieves similar quality to f=8 but 16× fewer tokens
|
| 150 |
-
- **What it fails at**: Decoder is still large (50M); typography needs decoder-only LLM text encoder
|
| 151 |
-
- **Borrowed**: High-compression VAE principle; we use f=16 as a compromise for fine detail
|
| 152 |
-
|
| 153 |
-
### SnapGen (2412.09619) ✅ Borrowed: Tiny decoder architecture
|
| 154 |
-
- **Why it helps**: 35× smaller decoder, 54× faster decode, negligible quality loss
|
| 155 |
-
- **What it fails at**: Proprietary weights; still uses quadratic attention in the UNet backbone
|
| 156 |
-
- **Borrowed**: Attention-free decoder, SepConv, minimal GroupNorm, SiLU instead of GELU
|
| 157 |
-
|
| 158 |
-
### TiTok (2406.07550) ❌ Rejected: Too aggressive compression
|
| 159 |
-
- **Why it was considered**: 32 tokens per image is incredibly compact
|
| 160 |
-
- **Why rejected**: rFID=16.2 means visible artifacts; fine detail and typography badly degraded at 32 tokens
|
| 161 |
-
|
| 162 |
-
### DiMSUM (2411.04168) ⚠️ Partially borrowed: Wavelet concept
|
| 163 |
-
- **Why it helps**: Best FID (2.11) among SSM-based approaches
|
| 164 |
-
- **What it fails at**: Still uses cross-attention fusion → partially quadratic; complex architecture
|
| 165 |
-
- **Borrowed**: Wavelet decomposition concept for frequency-aware processing
|
| 166 |
-
|
| 167 |
-
---
|
| 168 |
-
|
| 169 |
-
## 4. Full Proposed Architecture: LatentRecurrentFlow
|
| 170 |
-
|
| 171 |
-
### Name: **LatentRecurrentFlow (LRF)**
|
| 172 |
-
|
| 173 |
-
LRF is a **recursive flow-matching image generator** that uses:
|
| 174 |
-
- A compact VAE with f=16 compression and a ~280K tiny decoder
|
| 175 |
-
- A **Recursive Latent Refinement (RLR) core** that iteratively refines image latents through shared GLD blocks
|
| 176 |
-
- A **rectified flow** training objective for clean few-step generation
|
| 177 |
-
- **Additive image conditioning** for editing-readiness
|
| 178 |
-
|
| 179 |
-
The core insight: **instead of stacking many unique layers, reuse a small set of blocks recursively**. This exploits the observation from HRM/TRM that iterative application of the same function can converge to a fixed point that represents the solution — analogous to how diffusion models iteratively denoise.
|
| 180 |
-
|
| 181 |
-
---
|
| 182 |
-
|
| 183 |
-
## 5. Module-by-Module Diagram
|
| 184 |
|
|
|
|
|
|
|
| 185 |
```
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
│ │ │ │ (11.5M) │ │ z_0 + tε │ │
|
| 197 |
-
│ │ Decoder ────┤ │ │ │ │ │
|
| 198 |
-
│ │ (1.1M, tiny)│ │ → text_emb │ │ v = ε - z_0 │ │
|
| 199 |
-
│ └──────┬───────┘ │ → text_glob │ └────────┬───────┘ │
|
| 200 |
-
│ │ └──────┬───────┘ │ │
|
| 201 |
-
│ │ │ │ │
|
| 202 |
-
│ ┌──────▼───────────────────▼─────────────────────▼──────┐ │
|
| 203 |
-
│ │ Recursive Latent Core (RLR) │ │
|
| 204 |
-
│ │ │ │
|
| 205 |
-
│ │ ┌─────────────────────────────────────────────────┐ │ │
|
| 206 |
-
│ │ │ OUTER LOOP (j = 1..T_outer) │ │ │
|
| 207 |
-
│ │ │ │ │ │
|
| 208 |
-
│ │ │ z_abstract ← f_slow(z, z_pooled) [H-module] │ │ │
|
| 209 |
-
│ │ │ │ │ │
|
| 210 |
-
│ │ │ ┌─────────────────────────────────────────┐ │ │ │
|
| 211 |
-
│ │ │ │ INNER LOOP (i = 1..T_inner) │ │ │ │
|
| 212 |
-
│ │ │ │ │ │ │ │
|
| 213 |
-
│ │ │ │ cond = t_emb + text_global + rec_emb │ │ │ │
|
| 214 |
-
│ │ │ │ z_in = z + z_abstract │ │ │ │
|
| 215 |
-
│ │ │ │ │ │ │ │
|
| 216 |
-
│ │ │ │ FOR block in GLD_blocks: │ │ │ │
|
| 217 |
-
│ │ │ │ ┌─────────────────────────────────┐ │ │ │ │
|
| 218 |
-
│ │ │ │ │ GLD Block │ │ │ │ │
|
| 219 |
-
│ │ │ │ │ │ │ │ │ │
|
| 220 |
-
│ │ │ │ │ 1. AdaLN-modulate(z, cond) │ │ │ │ │
|
| 221 |
-
│ │ │ │ │ 2. GLA: BiDir scan + DiffToken │ │ │ │ │
|
| 222 |
-
│ │ │ │ │ + DW-Conv locality gate │ │ │ │ │
|
| 223 |
-
│ │ │ │ │ 3. Cross-attn to text_emb │ │ │ │ │
|
| 224 |
-
│ │ │ │ │ 4. AdaLN-modulate(z, cond) │ │ │ │ │
|
| 225 |
-
│ │ │ │ │ 5. SwiGLU FFN │ │ │ │ │
|
| 226 |
-
│ │ │ │ └─────────────────────────────────┘ │ │ │ │
|
| 227 |
-
│ │ │ │ │ │ │ │
|
| 228 |
-
│ │ │ │ z = z + 0.5 * (blocks(z_in) - z) │ │ │ │
|
| 229 |
-
│ │ │ └─────────────────────────────────────────┘ │ │ │
|
| 230 |
-
│ │ └─────────────────────────────────────────────────┘ │ │
|
| 231 |
-
│ │ │ │
|
| 232 |
-
│ │ v = out_proj(out_norm(z)) ← velocity prediction │ │
|
| 233 |
-
│ └─────────────────────────────────────────────────────────┘ │
|
| 234 |
-
│ │
|
| 235 |
-
│ Training: IFT backprop (O(1) memory through recursion) │
|
| 236 |
-
│ Inference: Full recursion (no grad needed) │
|
| 237 |
-
└─────────────────────────────────────────────────────────────┘
|
| 238 |
```
|
| 239 |
|
| 240 |
---
|
| 241 |
|
| 242 |
-
##
|
| 243 |
-
|
| 244 |
-
### Forward Process (Rectified Flow)
|
| 245 |
-
|
| 246 |
-
Given clean latent z₀ and noise ε ~ N(0, I):
|
| 247 |
-
|
| 248 |
-
```
|
| 249 |
-
z_t = (1 - t) · z₀ + t · ε, t ∈ [0, 1]
|
| 250 |
-
```
|
| 251 |
-
|
| 252 |
-
### Velocity Target
|
| 253 |
-
|
| 254 |
-
```
|
| 255 |
-
v* = ε - z₀
|
| 256 |
-
```
|
| 257 |
-
|
| 258 |
-
### Denoising Core (RLR)
|
| 259 |
-
|
| 260 |
-
Let f_θ denote the shared GLD blocks, and g_φ denote the abstract updater.
|
| 261 |
-
|
| 262 |
-
**Initialization:**
|
| 263 |
-
```
|
| 264 |
-
z⁽⁰⁾ = input_proj(flatten(z_t))
|
| 265 |
-
c = time_embed(sinusoidal(t)) + text_global
|
| 266 |
-
z_abs⁽⁰⁾ = mean_pool(z⁽⁰⁾)
|
| 267 |
-
```
|
| 268 |
|
| 269 |
-
|
| 270 |
-
```
|
| 271 |
-
z_abs⁽ʲ⁾ = z_abs⁽ʲ⁻¹⁾ + tanh(α) · g_φ([norm(z), mean_pool(z)])
|
| 272 |
-
```
|
| 273 |
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
|
|
|
| 280 |
|
| 281 |
-
|
| 282 |
-
```
|
| 283 |
-
v_θ(z_t, t, c) = out_proj(out_norm(z))
|
| 284 |
-
```
|
| 285 |
|
| 286 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
|
| 288 |
-
|
| 289 |
-
Q, K, V = W_qkv · x (linear projection)
|
| 290 |
-
Q̃ = Q - λ · shift(Q) (token differential)
|
| 291 |
-
K̃ = K - λ · shift(K)
|
| 292 |
-
Q̃ = φ(Q̃), K̃ = φ(K̃) where φ(x) = 1 + elu(x)
|
| 293 |
|
| 294 |
-
|
| 295 |
-
|
|
|
|
| 296 |
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
output = W_out · O
|
| 300 |
-
```
|
| 301 |
|
| 302 |
-
|
|
|
|
| 303 |
|
| 304 |
-
#
|
|
|
|
| 305 |
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
z = recursive_refinement(z, c, text_emb)
|
| 311 |
-
z = recursive_refinement(z, c, text_emb) # grad only here
|
| 312 |
-
```
|
| 313 |
|
| 314 |
-
|
|
|
|
| 315 |
```
|
| 316 |
-
∂z*/∂θ = (I - ∂f/∂z)⁻¹ · ∂f/∂θ
|
| 317 |
-
```
|
| 318 |
-
|
| 319 |
-
The 1-step gradient approximates this, giving correct gradient direction with O(1) memory.
|
| 320 |
|
| 321 |
---
|
| 322 |
|
| 323 |
-
##
|
| 324 |
-
|
| 325 |
-
### Stage 1: VAE Training
|
| 326 |
-
|
| 327 |
-
```
|
| 328 |
-
L_VAE = L_recon + λ_perc · L_perceptual + λ_KL · L_KL
|
| 329 |
-
|
| 330 |
-
L_recon = |x - x̂|₁ (L1 reconstruction)
|
| 331 |
-
L_perceptual = (1/3) Σ_{s=0}^{2} MSE(pool_s(x), pool_s(x̂)) (multi-scale)
|
| 332 |
-
L_KL = -0.5 · E[1 + log(σ²) - μ² - σ²] (KL divergence)
|
| 333 |
-
|
| 334 |
-
λ_perc = 1.0, λ_KL = 1e-6
|
| 335 |
-
```
|
| 336 |
-
|
| 337 |
-
### Stage 2: Flow Matching
|
| 338 |
-
|
| 339 |
-
```
|
| 340 |
-
L_flow = E_{t,z₀,ε} [ w(t) · ‖v_θ(z_t, t, c) - (ε - z₀)‖² ]
|
| 341 |
-
|
| 342 |
-
w(t) = 1 / (t(1-t) + 0.01) (SNR weighting, normalized)
|
| 343 |
-
|
| 344 |
-
With 10% classifier-free guidance dropout:
|
| 345 |
-
P(c = ∅) = 0.1
|
| 346 |
-
```
|
| 347 |
|
| 348 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
|
| 350 |
-
|
| 351 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
|
| 353 |
-
|
| 354 |
-
|
|
|
|
|
|
|
|
|
|
| 355 |
```
|
| 356 |
|
| 357 |
-
###
|
| 358 |
-
|
| 359 |
-
|
| 360 |
```
|
| 361 |
-
v_θ(z_t, t, c, z_src) where z_src = encode(source_image)
|
| 362 |
-
```
|
| 363 |
-
|
| 364 |
-
Additive conditioning: `z_input = z + z_src` before the RLR core.
|
| 365 |
|
| 366 |
---
|
| 367 |
|
| 368 |
-
##
|
| 369 |
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
| Component | Memory |
|
| 373 |
|---|---|
|
| 374 |
-
|
|
| 375 |
-
|
|
| 376 |
-
|
|
| 377 |
-
|
|
| 378 |
-
|
|
| 379 |
-
|
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
### Training (16 GB GPU, Default Config)
|
| 384 |
-
|
| 385 |
-
| Item | Memory |
|
| 386 |
-
|---|---|
|
| 387 |
-
| Model parameters (FP32) | 62 MB |
|
| 388 |
-
| Optimizer states (AdamW, 2×) | 124 MB |
|
| 389 |
-
| Gradients | 62 MB |
|
| 390 |
-
| Batch activations (BS=8, 64×64) | ~500 MB |
|
| 391 |
-
| IFT overhead (only last recursion) | ~50 MB |
|
| 392 |
-
| **Total** | **~800 MB** |
|
| 393 |
-
|
| 394 |
-
Leaves ample room for larger batch sizes or higher resolution on 16 GB.
|
| 395 |
|
| 396 |
---
|
| 397 |
|
| 398 |
-
##
|
| 399 |
-
|
| 400 |
-
### Stage 1: VAE (50K steps)
|
| 401 |
-
- **Data**: ImageNet or COCO (any large image dataset)
|
| 402 |
-
- **Resolution**: 256×256
|
| 403 |
-
- **What to freeze**: Nothing
|
| 404 |
-
- **What to train**: Full VAE
|
| 405 |
-
- **LR**: 1e-4, AdamW, weight_decay=0.01
|
| 406 |
-
- **Key**: Train until L_recon < 0.1
|
| 407 |
-
|
| 408 |
-
### Stage 2: Flow Matching — Low Resolution (100K steps)
|
| 409 |
-
- **Data**: Synthetic captions from teacher (SDXL) + LAION-aesthetic subset
|
| 410 |
-
- **Resolution**: 64×64
|
| 411 |
-
- **What to freeze**: VAE
|
| 412 |
-
- **What to train**: Core + Text Encoder
|
| 413 |
-
- **LR**: 1e-4
|
| 414 |
-
- **Key**: Focus on learning composition and prompt adherence
|
| 415 |
-
|
| 416 |
-
### Stage 3: Flow Matching — Mid Resolution (200K steps)
|
| 417 |
-
- **Data**: Filtered LAION-aesthetic (score > 6.0) + synthetic
|
| 418 |
-
- **Resolution**: 256×256
|
| 419 |
-
- **What to freeze**: VAE
|
| 420 |
-
- **What to train**: Core + Text Encoder
|
| 421 |
-
- **LR**: 5e-5
|
| 422 |
-
- **Key**: Focus on texture and detail
|
| 423 |
-
|
| 424 |
-
### Stage 4: Flow Matching — High Resolution (100K steps)
|
| 425 |
-
- **Data**: High-quality curated + JourneyDB
|
| 426 |
-
- **Resolution**: 512×512
|
| 427 |
-
- **What to freeze**: VAE
|
| 428 |
-
- **What to train**: Core + Text Encoder
|
| 429 |
-
- **LR**: 2e-5
|
| 430 |
-
- **Key**: Focus on fine detail and typography
|
| 431 |
-
|
| 432 |
-
### Stage 5: Consistency Distillation (50K steps)
|
| 433 |
-
- **Data**: Same as Stage 4
|
| 434 |
-
- **What to freeze**: VAE + Text Encoder
|
| 435 |
-
- **What to train**: Core only
|
| 436 |
-
- **LR**: 1e-5
|
| 437 |
-
- **Key**: Distill from own multi-step model to 4-step generation
|
| 438 |
-
|
| 439 |
-
### Stage 6: Editing Fine-tuning (50K steps)
|
| 440 |
-
- **Data**: InstructPix2Pix + MagicBrush + synthetic edit pairs
|
| 441 |
-
- **What to freeze**: VAE
|
| 442 |
-
- **What to train**: Core + Text Encoder
|
| 443 |
-
- **LR**: 1e-5
|
| 444 |
-
- **Key**: Add image conditioning channel
|
| 445 |
-
|
| 446 |
-
---
|
| 447 |
-
|
| 448 |
-
## 10. Deployment Plan for Mobile
|
| 449 |
-
|
| 450 |
-
### Step 1: Quantization
|
| 451 |
-
- INT8 per-channel weight quantization (static)
|
| 452 |
-
- INT8 per-token activation quantization (dynamic)
|
| 453 |
-
- Result: ~4× model size reduction
|
| 454 |
|
| 455 |
-
|
| 456 |
-
-
|
| 457 |
-
|
| 458 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 459 |
|
| 460 |
-
|
| 461 |
-
- After consistency distillation: 4 Euler steps sufficient
|
| 462 |
-
- With further adversarial distillation: 1-2 steps possible
|
| 463 |
-
|
| 464 |
-
### Step 4: Latent Size Optimization
|
| 465 |
-
- f=16 compression: 1024² → 64×64 latents
|
| 466 |
-
- 32 channels per position
|
| 467 |
-
- Total latent: 64×64×32 = 131,072 values ≈ 0.5 MB
|
| 468 |
-
|
| 469 |
-
### Projected Performance
|
| 470 |
-
| Device | Steps | Estimated Time |
|
| 471 |
-
|---|---|---|
|
| 472 |
-
| iPhone 16 Pro (ANE) | 4 | ~0.5-1.0s |
|
| 473 |
-
| Pixel 8 Pro (GPU) | 4 | ~1.0-2.0s |
|
| 474 |
-
| iPhone 14 (GPU) | 8 | ~2.0-3.0s |
|
| 475 |
-
|
| 476 |
-
---
|
| 477 |
-
|
| 478 |
-
## 11. Failure Mode Analysis
|
| 479 |
-
|
| 480 |
-
| Failure Mode | Cause | Detection | Fix |
|
| 481 |
-
|---|---|---|---|
|
| 482 |
-
| **Fixed-point non-convergence** | Recursion doesn't converge | Monitor z change per recursion | Damped update (α=0.5), reduce T_inner |
|
| 483 |
-
| **Oversmoothing** | GLA loses high-frequency detail | Blurry outputs, low LPIPS | Increase token-differential λ, add DW-conv skip |
|
| 484 |
-
| **Mode collapse** | Small model capacity | FID increases, low diversity | Increase num_blocks or dim |
|
| 485 |
-
| **Training instability** | IFT gradient approximation error | Loss spikes | Reduce LR, increase warmup, disable IFT temporarily |
|
| 486 |
-
| **Poor text adherence** | Weak cross-attention | Low CLIP score | Increase cross-attention gates, add more cross-attn layers |
|
| 487 |
-
| **VAE artifacts** | Aggressive compression | Reconstruction artifacts | Lower f (use f=8), increase decoder capacity |
|
| 488 |
-
| **CFG artifacts** | High guidance scale | Oversaturated images | Train with 10% unconditional, use CFG 3-5 range |
|
| 489 |
-
|
| 490 |
-
---
|
| 491 |
-
|
| 492 |
-
## 12. Ablation Plan
|
| 493 |
-
|
| 494 |
-
### Ablation 1: Recursion Depth vs Quality
|
| 495 |
-
- **Vary**: T_inner ∈ {1, 2, 4, 6, 8}, T_outer ∈ {1, 2, 3}
|
| 496 |
-
- **Measure**: FID, CLIP score, inference time
|
| 497 |
-
- **Hypothesis**: Quality plateaus around T_inner=4-6; diminishing returns beyond T_outer=2
|
| 498 |
-
|
| 499 |
-
### Ablation 2: GLA vs Standard Attention
|
| 500 |
-
- **Compare**: GLA blocks vs softmax attention blocks (same dim, same depth)
|
| 501 |
-
- **Measure**: FID, memory, throughput
|
| 502 |
-
- **Hypothesis**: GLA matches attention quality at 3-5× lower memory
|
| 503 |
-
|
| 504 |
-
### Ablation 3: Token Differential
|
| 505 |
-
- **Vary**: λ ∈ {0, 0.05, 0.1, 0.2, learned}
|
| 506 |
-
- **Measure**: FID, sharpness metrics (gradient magnitude)
|
| 507 |
-
- **Hypothesis**: λ=0.1 optimal; λ=0 causes oversmoothing
|
| 508 |
-
|
| 509 |
-
### Ablation 4: IFT vs Full Backprop
|
| 510 |
-
- **Compare**: IFT training vs full BPTT (at small T for memory comparison)
|
| 511 |
-
- **Measure**: Final FID, training memory, convergence speed
|
| 512 |
-
- **Hypothesis**: IFT within 2% FID of full backprop at 8-16× memory savings
|
| 513 |
-
|
| 514 |
-
### Ablation 5: VAE Compression
|
| 515 |
-
- **Vary**: f ∈ {8, 16, 32}, C ∈ {8, 16, 32}
|
| 516 |
-
- **Measure**: rFID, PSNR, generation FID
|
| 517 |
-
- **Hypothesis**: f=16, C=16-32 is the sweet spot for mobile quality
|
| 518 |
-
|
| 519 |
-
### Ablation 6: Abstract State (H-module)
|
| 520 |
-
- **Compare**: With/without abstract state update
|
| 521 |
-
- **Measure**: FID, coherence metrics
|
| 522 |
-
- **Hypothesis**: Abstract state improves global composition coherence
|
| 523 |
-
|
| 524 |
-
---
|
| 525 |
-
|
| 526 |
-
## 13. Editing Roadmap
|
| 527 |
-
|
| 528 |
-
The LRF architecture is designed for editing-readiness through **additive image conditioning**:
|
| 529 |
-
|
| 530 |
-
### Phase 1: Inpainting
|
| 531 |
-
- Add binary mask channel to condition input
|
| 532 |
-
- `z_input = z + z_src * mask + z_noise * (1 - mask)`
|
| 533 |
-
- Train on random masking + MagicBrush data
|
| 534 |
-
|
| 535 |
-
### Phase 2: Image-to-Image Translation
|
| 536 |
-
- Source image encoded to latent, added to noisy latent
|
| 537 |
-
- Noise level controls edit strength (low noise = subtle edit)
|
| 538 |
-
- No architectural changes needed
|
| 539 |
-
|
| 540 |
-
### Phase 3: Instruction-Based Editing (OmniGen-style)
|
| 541 |
-
- Text encoder receives both instruction AND image description
|
| 542 |
-
- Source image latent added as conditioning
|
| 543 |
-
- Train on InstructPix2Pix + SEED-edit data
|
| 544 |
-
|
| 545 |
-
### Phase 4: Super-Resolution
|
| 546 |
-
- Low-res image encoded, upscaled in latent space
|
| 547 |
-
- Decoder generates high-res output
|
| 548 |
-
- Train on paired low/high-res data
|
| 549 |
-
|
| 550 |
-
### Phase 5: Style Transfer & Identity Preservation
|
| 551 |
-
- Reference image encoded to separate latent
|
| 552 |
-
- Cross-attention between reference and generation
|
| 553 |
-
- Train on same-identity different-image pairs (GRIT-Entity)
|
| 554 |
-
|
| 555 |
-
### Phase 6: Multi-Image Conditioning
|
| 556 |
-
- OmniGen-style interleaved image-text input
|
| 557 |
-
- Multiple source images encoded and concatenated in latent space
|
| 558 |
-
- Enables try-on, compositing, scene editing
|
| 559 |
-
|
| 560 |
-
### Why This Works
|
| 561 |
-
The key architectural decisions that enable editing:
|
| 562 |
-
1. **Additive conditioning** preserves spatial correspondence (pixel i in source maps to token i in latent)
|
| 563 |
-
2. **Recursive refinement** naturally handles conditioning — the model can "reason" about how to modify the latent
|
| 564 |
-
3. **Cross-attention to text** at every recursion step allows the model to follow editing instructions progressively
|
| 565 |
-
4. **Same parameter reuse** means editing capability doesn't require new parameters — just new training data
|
| 566 |
|
| 567 |
---
|
| 568 |
|
| 569 |
-
##
|
| 570 |
-
|
| 571 |
-
```python
|
| 572 |
-
# Clone and install
|
| 573 |
-
!pip install torch einops safetensors
|
| 574 |
|
| 575 |
-
#
|
| 576 |
-
|
| 577 |
-
|
|
|
|
|
|
|
| 578 |
|
| 579 |
-
#
|
| 580 |
-
|
| 581 |
-
|
| 582 |
|
| 583 |
-
#
|
| 584 |
-
|
|
|
|
|
|
|
| 585 |
|
| 586 |
-
#
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
```
|
| 590 |
|
| 591 |
-
|
|
|
|
|
|
|
| 592 |
|
| 593 |
---
|
| 594 |
|
| 595 |
-
## Citation
|
| 596 |
-
|
| 597 |
-
```bibtex
|
| 598 |
-
@software{lrf2026,
|
| 599 |
-
title={LatentRecurrentFlow: A Novel Mobile-First Image Generation Architecture},
|
| 600 |
-
author={LRF Research},
|
| 601 |
-
year={2026},
|
| 602 |
-
url={https://huggingface.co/krystv/LatentRecurrentFlow}
|
| 603 |
-
}
|
| 604 |
-
```
|
| 605 |
-
|
| 606 |
## License
|
| 607 |
|
| 608 |
Apache 2.0
|
|
|
|
| 8 |
- recursive-reasoning
|
| 9 |
- novel-architecture
|
| 10 |
- subquadratic-attention
|
|
|
|
| 11 |
- research
|
| 12 |
library_name: lrf
|
| 13 |
pipeline_tag: text-to-image
|
|
|
|
| 18 |
|
| 19 |
> A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.
|
| 20 |
|
| 21 |
+
## 🔥 v2 Training Results (CIFAR-10)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
**Trained end-to-end on CIFAR-10** (50K images, 10 classes) using:
|
| 24 |
+
- **Pre-trained TAESD** (2.4M frozen params) as the VAE — f=8 compression, 32×32 → 4×4×4 latents
|
| 25 |
+
- **1.47M parameter denoising core** with recursive refinement (4 shared blocks × 2 recursions = 8 effective layers)
|
| 26 |
+
- **Rectified flow** matching with SNR-weighted loss and 10% CFG dropout
|
| 27 |
+
- Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999
|
| 28 |
|
| 29 |
+
| Metric | Value |
|
| 30 |
+
|--------|-------|
|
| 31 |
+
| Final Loss | 0.931 |
|
| 32 |
+
| Training Time | ~70 min (CPU only!) |
|
| 33 |
+
| VAE Recon MSE | 0.068 |
|
| 34 |
+
| All 10 classes produce colorful images | ✅ |
|
|
|
|
| 35 |
|
| 36 |
+
### Sample Outputs
|
| 37 |
|
| 38 |
+
VAE Reconstruction (top: original, bottom: TAESD reconstruction):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+

|
| 41 |
|
| 42 |
+
Training progression (epoch 5 → 30):
|
| 43 |
|
| 44 |
+

|
| 45 |
+

|
| 46 |
|
| 47 |
+
Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+

|
| 50 |
|
| 51 |
+
Loss curve:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
+
### Validation: No Grey Images
|
| 56 |
+
Every class produces images with proper variance:
|
| 57 |
```
|
| 58 |
+
airplane : std=0.383, range=1.908 ✅
|
| 59 |
+
automobile : std=0.448, range=2.000 ✅
|
| 60 |
+
bird : std=0.341, range=1.663 ✅
|
| 61 |
+
cat : std=0.521, range=2.000 ✅
|
| 62 |
+
deer : std=0.401, range=1.869 ✅
|
| 63 |
+
dog : std=0.477, range=1.994 ✅
|
| 64 |
+
frog : std=0.366, range=1.996 ✅
|
| 65 |
+
horse : std=0.499, range=1.972 ✅
|
| 66 |
+
ship : std=0.448, range=1.786 ✅
|
| 67 |
+
truck : std=0.510, range=1.944 ✅
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
```
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
+
## Architecture Overview
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
LRF combines five key innovations into a single coherent architecture:
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
| Innovation | Source Inspiration | What It Does |
|
| 77 |
+
|---|---|---|
|
| 78 |
+
| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
|
| 79 |
+
| **Efficient Spatial Mixer** | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) |
|
| 80 |
+
| **Pre-trained TAESD VAE** | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box |
|
| 81 |
+
| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
|
| 82 |
+
| **Additive Image Conditioning** | OmniGen | Same core supports text-to-image AND editing |
|
| 83 |
|
| 84 |
+
### v2 Architecture (Trained & Validated)
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
+
| Component | Parameters | Description |
|
| 87 |
+
|---|---|---|
|
| 88 |
+
| TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder |
|
| 89 |
+
| Denoising Core | 1.47M | 4 shared blocks × 2 inner recursions |
|
| 90 |
+
| Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 |
|
| 91 |
+
| **Trainable Total** | **1.47M** | |
|
| 92 |
|
| 93 |
+
### How It Works
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
+
```python
|
| 96 |
+
# 1. Encode image to latent (TAESD, frozen)
|
| 97 |
+
z_0 = vae.encode(image) # [B, 4, 4, 4]
|
| 98 |
|
| 99 |
+
# 2. Add noise (rectified flow)
|
| 100 |
+
z_t = (1-t) * z_0 + t * noise # Linear interpolation
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
# 3. Predict velocity (recursive denoising core)
|
| 103 |
+
v = core(z_t, t, class_label) # 4 blocks × 2 recursions
|
| 104 |
|
| 105 |
+
# 4. Training target
|
| 106 |
+
loss = MSE(v, noise - z_0) # Velocity matching
|
| 107 |
|
| 108 |
+
# 5. Sampling (Euler ODE solver, t=1→0)
|
| 109 |
+
for step in timesteps:
|
| 110 |
+
v = core(z, t, class_label)
|
| 111 |
+
z = z - dt * v
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
# 6. Decode to image (TAESD, frozen)
|
| 114 |
+
image = vae.decode(z)
|
| 115 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
+
### Generate from trained model:
|
| 122 |
+
```python
|
| 123 |
+
import torch
|
| 124 |
+
from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
|
| 125 |
+
from diffusers import AutoencoderTiny
|
| 126 |
|
| 127 |
+
# Load
|
| 128 |
+
vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
|
| 129 |
+
ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
|
| 130 |
+
model = LRFv2(ckpt['config'])
|
| 131 |
+
for name, p in model.named_parameters():
|
| 132 |
+
p.data.copy_(ckpt['ema_params'][name])
|
| 133 |
+
model.eval()
|
| 134 |
|
| 135 |
+
# Generate (class 3 = cat)
|
| 136 |
+
scheduler = RectifiedFlowScheduler()
|
| 137 |
+
labels = torch.full((4,), 3, dtype=torch.long)
|
| 138 |
+
z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
|
| 139 |
+
images = vae.decode(z).sample.clamp(-1, 1)
|
| 140 |
```
|
| 141 |
|
| 142 |
+
### Train from scratch:
|
| 143 |
+
```bash
|
| 144 |
+
python lrf/train_v2.py
|
| 145 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
---
|
| 148 |
|
| 149 |
+
## Files
|
| 150 |
|
| 151 |
+
| File | Description |
|
|
|
|
|
|
|
| 152 |
|---|---|
|
| 153 |
+
| `lrf/model_v2.py` | Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) |
|
| 154 |
+
| `lrf/train_v2.py` | CIFAR-10 training pipeline with TAESD VAE |
|
| 155 |
+
| `trained/cifar10_checkpoint.pt` | Trained weights (30 epochs, EMA) |
|
| 156 |
+
| `trained/config.json` | Model configuration |
|
| 157 |
+
| `samples/` | Generated sample images at various epochs |
|
| 158 |
+
| `lrf/model.py` | v1 architecture (research prototype) |
|
| 159 |
+
| `lrf/training.py` | v1 training pipeline |
|
| 160 |
+
| `lrf/pipeline.py` | HF-compatible inference pipeline |
|
| 161 |
+
| `notebook.ipynb` | Interactive walkthrough |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
---
|
| 164 |
|
| 165 |
+
## Training Curriculum (Full Scale)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
+
| Stage | Resolution | Data | Freeze | Train | LR | Steps |
|
| 168 |
+
|---|---|---|---|---|---|---|
|
| 169 |
+
| 1. VAE | 256² | ImageNet/COCO | - | VAE | 1e-4 | 50K |
|
| 170 |
+
| 2. Flow (low) | 64² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K |
|
| 171 |
+
| 3. Flow (mid) | 256² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K |
|
| 172 |
+
| 4. Flow (high) | 512² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K |
|
| 173 |
+
| 5. Distill | 512² | Same as 4 | VAE+Text | Core | 1e-5 | 50K |
|
| 174 |
+
| 6. Editing | 512² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K |
|
| 175 |
|
| 176 |
+
**Shortcut (proven in this repo):** Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
---
|
| 179 |
|
| 180 |
+
## Relevant Papers (Grouped by Problem)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
+
### Subquadratic Spatial Mixing
|
| 183 |
+
- PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34× speedup
|
| 184 |
+
- DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
|
| 185 |
+
- ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
|
| 186 |
+
- DyDiLA (2601.13683): Dynamic differential linear attention
|
| 187 |
|
| 188 |
+
### Recursive Reasoning
|
| 189 |
+
- HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
|
| 190 |
+
- TRM (2510.04871): 7M params → 45% ARC-AGI-1
|
| 191 |
|
| 192 |
+
### Compact Latent Spaces
|
| 193 |
+
- SANA DC-AE (2410.10629): f=32, PSNR 29.29
|
| 194 |
+
- SnapGen (2412.09619): 1.38M tiny decoder
|
| 195 |
+
- TAESD (madebyollin): 2.4M params, f=8, works immediately
|
| 196 |
|
| 197 |
+
### Few-Step Generation
|
| 198 |
+
- Consistency Models (2303.01469): One-step from diffusion
|
| 199 |
+
- LCM (2310.04378): 2-4 step via consistency distillation
|
|
|
|
| 200 |
|
| 201 |
+
### Editing Architectures
|
| 202 |
+
- OmniGen (2409.11340): Unified generation + editing
|
| 203 |
+
- InstructPix2Pix (2211.09800): Text-guided editing
|
| 204 |
|
| 205 |
---
|
| 206 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
## License
|
| 208 |
|
| 209 |
Apache 2.0
|