| --- |
| tags: |
| - image-generation |
| - latent-recurrent-flow |
| - lrf |
| - mobile-first |
| - flow-matching |
| - recursive-reasoning |
| - novel-architecture |
| - subquadratic-attention |
| - research |
| library_name: lrf |
| pipeline_tag: text-to-image |
| license: apache-2.0 |
| --- |
| |
| # LatentRecurrentFlow (LRF) — A Novel Mobile-First Image Generation Architecture |
|
|
| > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets. |
|
|
| ## 🔥 v2 Training Results (CIFAR-10) |
|
|
| **Trained end-to-end on CIFAR-10** (50K images, 10 classes) using: |
| - **Pre-trained TAESD** (2.4M frozen params) as the VAE — f=8 compression, 32×32 → 4×4×4 latents |
| - **1.47M parameter denoising core** with recursive refinement (4 shared blocks × 2 recursions = 8 effective layers) |
| - **Rectified flow** matching with SNR-weighted loss and 10% CFG dropout |
| - Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999 |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Final Loss | 0.931 | |
| | Training Time | ~70 min (CPU only!) | |
| | VAE Recon MSE | 0.068 | |
| | All 10 classes produce colorful images | ✅ | |
|
|
| ### Sample Outputs |
|
|
| VAE Reconstruction (top: original, bottom: TAESD reconstruction): |
|
|
|  |
|
|
| Training progression (epoch 5 → 30): |
|
|
|  |
|  |
|
|
| Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck): |
|
|
|  |
|
|
| Loss curve: |
|
|
|  |
|
|
| ### Validation: No Grey Images |
| Every class produces images with proper variance: |
| ``` |
| airplane : std=0.383, range=1.908 ✅ |
| automobile : std=0.448, range=2.000 ✅ |
| bird : std=0.341, range=1.663 ✅ |
| cat : std=0.521, range=2.000 ✅ |
| deer : std=0.401, range=1.869 ✅ |
| dog : std=0.477, range=1.994 ✅ |
| frog : std=0.366, range=1.996 ✅ |
| horse : std=0.499, range=1.972 ✅ |
| ship : std=0.448, range=1.786 ✅ |
| truck : std=0.510, range=1.944 ✅ |
| ``` |
|
|
| --- |
|
|
| ## Architecture Overview |
|
|
| LRF combines five key innovations into a single coherent architecture: |
|
|
| | Innovation | Source Inspiration | What It Does | |
| |---|---|---| |
| | **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop | |
| | **Efficient Spatial Mixer** | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) | |
| | **Pre-trained TAESD VAE** | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box | |
| | **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling | |
| | **Additive Image Conditioning** | OmniGen | Same core supports text-to-image AND editing | |
|
|
| ### v2 Architecture (Trained & Validated) |
|
|
| | Component | Parameters | Description | |
| |---|---|---| |
| | TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder | |
| | Denoising Core | 1.47M | 4 shared blocks × 2 inner recursions | |
| | Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 | |
| | **Trainable Total** | **1.47M** | | |
|
|
| ### How It Works |
|
|
| ```python |
| # 1. Encode image to latent (TAESD, frozen) |
| z_0 = vae.encode(image) # [B, 4, 4, 4] |
| |
| # 2. Add noise (rectified flow) |
| z_t = (1-t) * z_0 + t * noise # Linear interpolation |
| |
| # 3. Predict velocity (recursive denoising core) |
| v = core(z_t, t, class_label) # 4 blocks × 2 recursions |
| |
| # 4. Training target |
| loss = MSE(v, noise - z_0) # Velocity matching |
| |
| # 5. Sampling (Euler ODE solver, t=1→0) |
| for step in timesteps: |
| v = core(z, t, class_label) |
| z = z - dt * v |
| |
| # 6. Decode to image (TAESD, frozen) |
| image = vae.decode(z) |
| ``` |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### Generate from trained model: |
| ```python |
| import torch |
| from lrf.model_v2 import LRFv2, RectifiedFlowScheduler |
| from diffusers import AutoencoderTiny |
| |
| # Load |
| vae = AutoencoderTiny.from_pretrained('madebyollin/taesd') |
| ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False) |
| model = LRFv2(ckpt['config']) |
| for name, p in model.named_parameters(): |
| p.data.copy_(ckpt['ema_params'][name]) |
| model.eval() |
| |
| # Generate (class 3 = cat) |
| scheduler = RectifiedFlowScheduler() |
| labels = torch.full((4,), 3, dtype=torch.long) |
| z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0) |
| images = vae.decode(z).sample.clamp(-1, 1) |
| ``` |
|
|
| ### Train from scratch: |
| ```bash |
| python lrf/train_v2.py |
| ``` |
|
|
| --- |
|
|
| ## Files |
|
|
| | File | Description | |
| |---|---| |
| | `lrf/model_v2.py` | Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) | |
| | `lrf/train_v2.py` | CIFAR-10 training pipeline with TAESD VAE | |
| | `trained/cifar10_checkpoint.pt` | Trained weights (30 epochs, EMA) | |
| | `trained/config.json` | Model configuration | |
| | `samples/` | Generated sample images at various epochs | |
| | `lrf/model.py` | v1 architecture (research prototype) | |
| | `lrf/training.py` | v1 training pipeline | |
| | `lrf/pipeline.py` | HF-compatible inference pipeline | |
| | `notebook.ipynb` | Interactive walkthrough | |
|
|
| --- |
|
|
| ## Training Curriculum (Full Scale) |
|
|
| | Stage | Resolution | Data | Freeze | Train | LR | Steps | |
| |---|---|---|---|---|---|---| |
| | 1. VAE | 256² | ImageNet/COCO | - | VAE | 1e-4 | 50K | |
| | 2. Flow (low) | 64² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K | |
| | 3. Flow (mid) | 256² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K | |
| | 4. Flow (high) | 512² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K | |
| | 5. Distill | 512² | Same as 4 | VAE+Text | Core | 1e-5 | 50K | |
| | 6. Editing | 512² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K | |
|
|
| **Shortcut (proven in this repo):** Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2. |
|
|
| --- |
|
|
| ## Relevant Papers (Grouped by Problem) |
|
|
| ### Subquadratic Spatial Mixing |
| - PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34× speedup |
| - DiMSUM (2411.04168): Mamba + wavelet, FID 2.11 |
| - ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings |
| - DyDiLA (2601.13683): Dynamic differential linear attention |
|
|
| ### Recursive Reasoning |
| - HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT |
| - TRM (2510.04871): 7M params → 45% ARC-AGI-1 |
|
|
| ### Compact Latent Spaces |
| - SANA DC-AE (2410.10629): f=32, PSNR 29.29 |
| - SnapGen (2412.09619): 1.38M tiny decoder |
| - TAESD (madebyollin): 2.4M params, f=8, works immediately |
|
|
| ### Few-Step Generation |
| - Consistency Models (2303.01469): One-step from diffusion |
| - LCM (2310.04378): 2-4 step via consistency distillation |
|
|
| ### Editing Architectures |
| - OmniGen (2409.11340): Unified generation + editing |
| - InstructPix2Pix (2211.09800): Text-guided editing |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|