File size: 6,704 Bytes
e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 cfaa4f6 e80dae2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
tags:
- image-generation
- latent-recurrent-flow
- lrf
- mobile-first
- flow-matching
- recursive-reasoning
- novel-architecture
- subquadratic-attention
- research
library_name: lrf
pipeline_tag: text-to-image
license: apache-2.0
---
# LatentRecurrentFlow (LRF) β A Novel Mobile-First Image Generation Architecture
> A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3β4 GB RAM, trained on 16 GB budgets.
## π₯ v2 Training Results (CIFAR-10)
**Trained end-to-end on CIFAR-10** (50K images, 10 classes) using:
- **Pre-trained TAESD** (2.4M frozen params) as the VAE β f=8 compression, 32Γ32 β 4Γ4Γ4 latents
- **1.47M parameter denoising core** with recursive refinement (4 shared blocks Γ 2 recursions = 8 effective layers)
- **Rectified flow** matching with SNR-weighted loss and 10% CFG dropout
- Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999
| Metric | Value |
|--------|-------|
| Final Loss | 0.931 |
| Training Time | ~70 min (CPU only!) |
| VAE Recon MSE | 0.068 |
| All 10 classes produce colorful images | β
|
### Sample Outputs
VAE Reconstruction (top: original, bottom: TAESD reconstruction):

Training progression (epoch 5 β 30):


Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):

Loss curve:

### Validation: No Grey Images
Every class produces images with proper variance:
```
airplane : std=0.383, range=1.908 β
automobile : std=0.448, range=2.000 β
bird : std=0.341, range=1.663 β
cat : std=0.521, range=2.000 β
deer : std=0.401, range=1.869 β
dog : std=0.477, range=1.994 β
frog : std=0.366, range=1.996 β
horse : std=0.499, range=1.972 β
ship : std=0.448, range=1.786 β
truck : std=0.510, range=1.944 β
```
---
## Architecture Overview
LRF combines five key innovations into a single coherent architecture:
| Innovation | Source Inspiration | What It Does |
|---|---|---|
| **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
| **Efficient Spatial Mixer** | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) |
| **Pre-trained TAESD VAE** | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box |
| **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
| **Additive Image Conditioning** | OmniGen | Same core supports text-to-image AND editing |
### v2 Architecture (Trained & Validated)
| Component | Parameters | Description |
|---|---|---|
| TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder |
| Denoising Core | 1.47M | 4 shared blocks Γ 2 inner recursions |
| Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 |
| **Trainable Total** | **1.47M** | |
### How It Works
```python
# 1. Encode image to latent (TAESD, frozen)
z_0 = vae.encode(image) # [B, 4, 4, 4]
# 2. Add noise (rectified flow)
z_t = (1-t) * z_0 + t * noise # Linear interpolation
# 3. Predict velocity (recursive denoising core)
v = core(z_t, t, class_label) # 4 blocks Γ 2 recursions
# 4. Training target
loss = MSE(v, noise - z_0) # Velocity matching
# 5. Sampling (Euler ODE solver, t=1β0)
for step in timesteps:
v = core(z, t, class_label)
z = z - dt * v
# 6. Decode to image (TAESD, frozen)
image = vae.decode(z)
```
---
## Quick Start
### Generate from trained model:
```python
import torch
from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
from diffusers import AutoencoderTiny
# Load
vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
model = LRFv2(ckpt['config'])
for name, p in model.named_parameters():
p.data.copy_(ckpt['ema_params'][name])
model.eval()
# Generate (class 3 = cat)
scheduler = RectifiedFlowScheduler()
labels = torch.full((4,), 3, dtype=torch.long)
z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
images = vae.decode(z).sample.clamp(-1, 1)
```
### Train from scratch:
```bash
python lrf/train_v2.py
```
---
## Files
| File | Description |
|---|---|
| `lrf/model_v2.py` | Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) |
| `lrf/train_v2.py` | CIFAR-10 training pipeline with TAESD VAE |
| `trained/cifar10_checkpoint.pt` | Trained weights (30 epochs, EMA) |
| `trained/config.json` | Model configuration |
| `samples/` | Generated sample images at various epochs |
| `lrf/model.py` | v1 architecture (research prototype) |
| `lrf/training.py` | v1 training pipeline |
| `lrf/pipeline.py` | HF-compatible inference pipeline |
| `notebook.ipynb` | Interactive walkthrough |
---
## Training Curriculum (Full Scale)
| Stage | Resolution | Data | Freeze | Train | LR | Steps |
|---|---|---|---|---|---|---|
| 1. VAE | 256Β² | ImageNet/COCO | - | VAE | 1e-4 | 50K |
| 2. Flow (low) | 64Β² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K |
| 3. Flow (mid) | 256Β² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K |
| 4. Flow (high) | 512Β² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K |
| 5. Distill | 512Β² | Same as 4 | VAE+Text | Core | 1e-5 | 50K |
| 6. Editing | 512Β² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K |
**Shortcut (proven in this repo):** Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.
---
## Relevant Papers (Grouped by Problem)
### Subquadratic Spatial Mixing
- PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34Γ speedup
- DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
- ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
- DyDiLA (2601.13683): Dynamic differential linear attention
### Recursive Reasoning
- HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
- TRM (2510.04871): 7M params β 45% ARC-AGI-1
### Compact Latent Spaces
- SANA DC-AE (2410.10629): f=32, PSNR 29.29
- SnapGen (2412.09619): 1.38M tiny decoder
- TAESD (madebyollin): 2.4M params, f=8, works immediately
### Few-Step Generation
- Consistency Models (2303.01469): One-step from diffusion
- LCM (2310.04378): 2-4 step via consistency distillation
### Editing Architectures
- OmniGen (2409.11340): Unified generation + editing
- InstructPix2Pix (2211.09800): Text-guided editing
---
## License
Apache 2.0
|