--- tags: - image-generation - latent-recurrent-flow - lrf - mobile-first - flow-matching - recursive-reasoning - novel-architecture - subquadratic-attention - research library_name: lrf pipeline_tag: text-to-image license: apache-2.0 --- # LatentRecurrentFlow (LRF) — A Novel Mobile-First Image Generation Architecture > A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets. ## 🔥 v2 Training Results (CIFAR-10) **Trained end-to-end on CIFAR-10** (50K images, 10 classes) using: - **Pre-trained TAESD** (2.4M frozen params) as the VAE — f=8 compression, 32×32 → 4×4×4 latents - **1.47M parameter denoising core** with recursive refinement (4 shared blocks × 2 recursions = 8 effective layers) - **Rectified flow** matching with SNR-weighted loss and 10% CFG dropout - Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999 | Metric | Value | |--------|-------| | Final Loss | 0.931 | | Training Time | ~70 min (CPU only!) | | VAE Recon MSE | 0.068 | | All 10 classes produce colorful images | ✅ | ### Sample Outputs VAE Reconstruction (top: original, bottom: TAESD reconstruction): ![VAE Reconstruction](samples/vae_reconstruction.png) Training progression (epoch 5 → 30): ![Epoch 5](samples/samples_epoch005.png) ![Epoch 30](samples/samples_epoch030.png) Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck): ![Final Samples](samples/final_class_conditional.png) Loss curve: ![Loss](samples/loss.png) ### Validation: No Grey Images Every class produces images with proper variance: ``` airplane : std=0.383, range=1.908 ✅ automobile : std=0.448, range=2.000 ✅ bird : std=0.341, range=1.663 ✅ cat : std=0.521, range=2.000 ✅ deer : std=0.401, range=1.869 ✅ dog : std=0.477, range=1.994 ✅ frog : std=0.366, range=1.996 ✅ horse : std=0.499, range=1.972 ✅ ship : std=0.448, range=1.786 ✅ truck : std=0.510, range=1.944 ✅ ``` --- ## Architecture Overview LRF combines five key innovations into a single coherent architecture: | Innovation | Source Inspiration | What It Does | |---|---|---| | **Recursive Latent Refinement (RLR)** | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop | | **Efficient Spatial Mixer** | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) | | **Pre-trained TAESD VAE** | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box | | **Rectified Flow** objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling | | **Additive Image Conditioning** | OmniGen | Same core supports text-to-image AND editing | ### v2 Architecture (Trained & Validated) | Component | Parameters | Description | |---|---|---| | TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder | | Denoising Core | 1.47M | 4 shared blocks × 2 inner recursions | | Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 | | **Trainable Total** | **1.47M** | | ### How It Works ```python # 1. Encode image to latent (TAESD, frozen) z_0 = vae.encode(image) # [B, 4, 4, 4] # 2. Add noise (rectified flow) z_t = (1-t) * z_0 + t * noise # Linear interpolation # 3. Predict velocity (recursive denoising core) v = core(z_t, t, class_label) # 4 blocks × 2 recursions # 4. Training target loss = MSE(v, noise - z_0) # Velocity matching # 5. Sampling (Euler ODE solver, t=1→0) for step in timesteps: v = core(z, t, class_label) z = z - dt * v # 6. Decode to image (TAESD, frozen) image = vae.decode(z) ``` --- ## Quick Start ### Generate from trained model: ```python import torch from lrf.model_v2 import LRFv2, RectifiedFlowScheduler from diffusers import AutoencoderTiny # Load vae = AutoencoderTiny.from_pretrained('madebyollin/taesd') ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False) model = LRFv2(ckpt['config']) for name, p in model.named_parameters(): p.data.copy_(ckpt['ema_params'][name]) model.eval() # Generate (class 3 = cat) scheduler = RectifiedFlowScheduler() labels = torch.full((4,), 3, dtype=torch.long) z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0) images = vae.decode(z).sample.clamp(-1, 1) ``` ### Train from scratch: ```bash python lrf/train_v2.py ``` --- ## Files | File | Description | |---|---| | `lrf/model_v2.py` | Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) | | `lrf/train_v2.py` | CIFAR-10 training pipeline with TAESD VAE | | `trained/cifar10_checkpoint.pt` | Trained weights (30 epochs, EMA) | | `trained/config.json` | Model configuration | | `samples/` | Generated sample images at various epochs | | `lrf/model.py` | v1 architecture (research prototype) | | `lrf/training.py` | v1 training pipeline | | `lrf/pipeline.py` | HF-compatible inference pipeline | | `notebook.ipynb` | Interactive walkthrough | --- ## Training Curriculum (Full Scale) | Stage | Resolution | Data | Freeze | Train | LR | Steps | |---|---|---|---|---|---|---| | 1. VAE | 256² | ImageNet/COCO | - | VAE | 1e-4 | 50K | | 2. Flow (low) | 64² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K | | 3. Flow (mid) | 256² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K | | 4. Flow (high) | 512² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K | | 5. Distill | 512² | Same as 4 | VAE+Text | Core | 1e-5 | 50K | | 6. Editing | 512² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K | **Shortcut (proven in this repo):** Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2. --- ## Relevant Papers (Grouped by Problem) ### Subquadratic Spatial Mixing - PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34× speedup - DiMSUM (2411.04168): Mamba + wavelet, FID 2.11 - ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings - DyDiLA (2601.13683): Dynamic differential linear attention ### Recursive Reasoning - HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT - TRM (2510.04871): 7M params → 45% ARC-AGI-1 ### Compact Latent Spaces - SANA DC-AE (2410.10629): f=32, PSNR 29.29 - SnapGen (2412.09619): 1.38M tiny decoder - TAESD (madebyollin): 2.4M params, f=8, works immediately ### Few-Step Generation - Consistency Models (2303.01469): One-step from diffusion - LCM (2310.04378): 2-4 step via consistency distillation ### Editing Architectures - OmniGen (2409.11340): Unified generation + editing - InstructPix2Pix (2211.09800): Text-guided editing --- ## License Apache 2.0