ArtFlow / ARCHITECTURE.md

Add complete architecture specification (1000+ lines)

f0d55ac verified 18 days ago

preview code

raw

history blame contribute delete

48.7 kB

🎨 ArtFlow: Reasoning-Native Artistic Image Generation for Mobile Devices

A Novel Architecture for Intelligent, Lightweight Illustration Generation

Version: 1.0
Status: Architecture Specification + Prototype Implementation
Target: 2-4GB RAM, 1024px native generation, anime/illustration focus

Executive Summary
Research Foundations & Inspirations
Architecture Overview
Module 1: Latent Codec (Pretrained VAE)
Module 2: WaveMamba Denoising Backbone
Module 3: ArtStyle Matrix Encoder
Module 4: Concept Reasoning Engine (CRE)
Module 5: Mood & Philosophy Controller
Module 6: Text Understanding with Tiny Encoder
Mathematical Foundations
Training Pipeline
Datasets & Data Strategy
Inference Pipeline
Memory & Compute Analysis
Comparison with Existing Models

1. Executive Summary

ArtFlow is a novel image generation architecture designed from first principles to solve a specific problem: generating high-quality artistic/illustration images on mobile devices (2-4GB RAM) with native reasoning capabilities about art concepts, styles, moods, and composition.

Key Innovations

WaveMamba Denoising Core: A hybrid architecture combining wavelet-decomposed multi-scale processing with Selective State Space Models (Mamba) instead of transformer self-attention. Achieves O(n) complexity instead of O(n²) while maintaining global context awareness through the SSM hidden state. Inspired by DiMSUM [arXiv:2411.04168] and ZigMa [arXiv:2403.13802] but redesigned with a UNet topology and wavelet frequency routing.
Recursive Latent Reasoning (RLR): Borrowed from TRM/HRM [arXiv:2511.16886] — the denoising backbone performs iterative latent refinement where a "working memory" state z_L and "current solution" state z_H are updated recursively. This gives the model native reasoning about image content without increasing parameters. Each denoising step internally performs 2-3 reasoning recursions, letting the network "think" about composition, spatial relationships, and artistic coherence.
Disentangled Art Modules: Instead of a monolithic backbone, we decompose generation into:
- ArtStyle Matrix (S ∈ ℝ^{k×d}): Learned style vectors in a continuous style space. New styles = new vectors/matrices. Users can interpolate, combine, or invent entirely new styles by manipulating these compact representations.
- Concept Graph Embeddings: A lightweight module that encodes scene concepts (character poses, spatial relationships, object interactions) as graph-structured latent codes.
- Mood Controller: A small MLP that modulates generation based on emotional/atmospheric parameters (warm/cold, serene/chaotic, melancholic/joyful).
Flow Matching Training: We use rectified flow with logit-normal timestep sampling (from SD3/FLUX) for stable, fast convergence. Combined with a novel "Art-Aware Velocity Scaling" that weights the loss differently for high-frequency artistic details vs low-frequency composition.
Extreme Efficiency: Total denoising backbone ~250M parameters. With DC-AE [arXiv:2410.10733] f32 compression, we operate on tiny 32×32 latent maps for 1024px images. Combined with Mamba's O(n) complexity, inference requires <2GB VRAM and generates 1024px images in 4-8 steps.

Parameter Budget

Component	Parameters	RAM (fp16)
DC-AE f32 Decoder	~40M	~80MB
WaveMamba Backbone	~250M	~500MB
ArtStyle Matrix	~5M	~10MB
Concept Reasoning	~15M	~30MB
Mood Controller	~2M	~4MB
Text Encoder (TinyBERT)	~67M	~134MB
Total	~379M	~758MB

Peak inference RAM at 1024px: ~1.5-2.0 GB (including activations)

2. Research Foundations & Inspirations

2.1 Efficient Mobile Diffusion (What We Learned)

MobileDiffusion [arXiv:2311.16567]: Key insight — transformers are expensive at high resolution. They moved transformers to the UNet bottleneck only (16×16), used separable convolutions elsewhere, shared K-V projections, replaced softmax→ReLU for linear attention, replaced GELU→SiLU for mobile compatibility. Achieved 400M params, sub-second on mobile.

SnapGen [arXiv:2412.09619]: 372M params, FID 2.06 on ImageNet. Key techniques: removed self-attention from high-res stages, used expanded separable convolutions (UIB blocks), Multi-Query Attention (MQA), injected conditions from the very first stage with cross-attention (no self-attention), 2D RoPE, QK RMSNorm. Tiny 1.38M decoder.

DreamLite [arXiv:2603.28713]: 390M unified gen+edit model. In-context spatial concatenation for editing. Task-progressive joint pretraining. RLHF post-training. 4-step generation via adversarial distillation.

Our takeaway: UNet topology > pure ViT for mobile. Move heavy compute to lowest resolution. Separable convolutions for spatial blocks. Cross-attention is cheap and essential; self-attention is expensive and can be removed at high-res.

2.2 State Space Models for Vision (Our Core Innovation)

ZigMa [arXiv:2403.13802]: First successful Mamba-based diffusion. Used DiT-style architecture with zigzag scan patterns that maintain spatial continuity. Key finding: spatial continuity in scan order is critical — naive raster scan loses spatial relationships. Zigzag scan with heterogeneous layer-wise patterns adds zero memory overhead.

DiMSUM [arXiv:2411.04168]: Combined Mamba with wavelet decomposition. Wavelet transform decomposes images into frequency subbands, then each subband is processed by Mamba blocks. This gives Mamba local structure awareness (via high-frequency wavelets) while maintaining global context (via the SSM state). Outperformed DiT and DIFFUSSM.

Mamba2D [arXiv:2412.16146]: Native 2D state space model using a single 2D scan direction instead of multiple 1D scans. Better captures spatial dependencies.

Vision Mamba [arXiv:2401.09417]: Bidirectional Mamba blocks for vision. Outperformed DeiT with fewer parameters and better scaling to high-res.

Our synthesis: We combine the UNet topology (from MobileDiffusion/SnapGen efficiency findings) with Mamba-based processing at all resolutions. Instead of transformer self-attention blocks, we use WaveMamba blocks that perform wavelet decomposition → Mamba processing per subband → wavelet reconstruction. This gives O(n) global context at every resolution level while maintaining frequency-aware local processing.

2.3 Recursive Latent Reasoning (Our Reasoning Innovation)

TRM (Tiny Recursive Models) [Jolicoeur-Martineau 2025]: A single tiny transformer that recursively refines two latent states: z_H (current solution, directly supervised) and z_L (working memory/reasoning scratchpad, indirectly supervised). With just 2-layer transformers and ~1M params, achieved near-SOTA on ARC-AGI reasoning benchmarks. Key insight: z_L naturally becomes a "chain-of-thought" in latent space because it's only supervised through its effect on z_H.

HRM (Hierarchical Reasoning Models) [Wang et al. 2025]: Two recurrent networks at different update frequencies. Low-level module updates n times per high-level update. Deep supervision with detached states enables hundreds of effective layers from tiny models.

Deep Improvement Supervision (DIS) [arXiv:2511.16886]: Reframed TRM as policy improvement — each recursion step produces a reference policy and improved policy. Training each supervision step toward progressively less-corrupted targets reduced forward passes by 18× while maintaining performance.

LatentSeek [arXiv:2505.13308]: Test-time reasoning via policy gradient in latent space. No training needed — adapts pre-trained models at inference time.

Our application to image generation: We apply the TRM recursive reasoning principle directly to the denoising process. Each denoising step doesn't just predict noise once — it performs 2-3 internal recursions where:

z_L (working memory) processes the composition, spatial layout, and concept consistency
z_H (current image estimate) gets progressively refined by z_L's reasoning
This effectively gives the model a "thinking" capability about what it's generating, without any extra parameters

This is fundamentally different from simply running more denoising steps. The recursion happens within a single denoising step, using the same weights but different states.

2.4 Liquid Neural Networks & Continuous Dynamics

Liquid Time-Constant Networks [arXiv:2006.04439]: ODE-based neural networks with input-dependent time constants. The dynamics adapt to the input signal, making them extremely expressive per parameter. The key equation:

dx/dt = -[1/τ(x,I)] ⊙ x + [f(x,I)/τ(x,I)]

where τ is a learned, input-dependent time constant.

Neural ODEs [arXiv:1806.07366]: Continuous-depth models. Memory efficient via adjoint method. Adaptive evaluation speed.

Our application: We use a liquid-time-constant formulation for the Mood Controller — emotional/atmospheric parameters are encoded as time constants that modulate the dynamics of generation. A "serene" mood produces slow, smooth dynamics; a "chaotic" mood produces fast, turbulent dynamics. This is physics-inspired: mood literally changes the dynamics of how the image forms in latent space.

2.5 Art Style Disentanglement

USO [arXiv:2508.18966]: Unified style and subject generation via disentangled learning. Content-style decomposition training + style reward learning. State-of-the-art in both style similarity and subject consistency.

StyleGAN StyleSpace [arXiv:2011.12799]: Highly disentangled style control through channel-wise style parameters.

Illustrious [arXiv:2409.19946]: Anime model trained on Danbooru with: no-dropout tokens for sensitive content control, cosine annealing, quasi-register tokens for unknown concepts, multi-level score-based quality tags, resolution-specific training stages.

Our application: We create a learnable ArtStyle Matrix S ∈ ℝ^{K×d} where K is the number of base styles and d is the style dimension. Each style is a vector that modulates the Mamba SSM parameters (A, B, C, Δ). New styles are just new rows in the matrix. Interpolation between styles = interpolation between rows. This is like a "style periodic table" — atomic style elements that combine to form complex styles.

2.6 Wavelet Multi-Scale Processing

DiMSUM [arXiv:2411.04168]: Wavelet decomposition for Mamba-based diffusion.

WaveMix [arXiv:2203.03689]: 2D DWT for token mixing, competitive with ViTs/CNNs with fewer resources.

Wavelet Diffusion [arXiv:2211.16152]: Wavelet-based diffusion operating on frequency subbands.

Our synthesis: Wavelets are a perfect match for our architecture because:

They naturally decompose images into local frequency bands — the high-frequency bands capture artistic line work and details, low-frequency bands capture composition and color masses
Each subband is much smaller than the full image, so Mamba processing each subband is extremely efficient
We can apply different art-style modulation strengths to different frequency bands (e.g., strong style influence on line quality, moderate on color)
Wavelet transform/inverse is O(n) and parameter-free

2.7 Kolmogorov-Arnold Networks

KAN [arXiv:2404.19756]: Learnable activation functions on edges instead of fixed activations on nodes. More expressive per parameter for smooth functions. Good for learning scientific/mathematical relationships.

KA-Attention [arXiv:2503.10632]: KAN-based attention in ViTs showed competitive performance with learnable attention kernels.

Our application: We use KAN-inspired learnable activation functions in the Concept Reasoning Engine — the module that reasons about spatial relationships and scene composition. The idea is that compositional rules (rule of thirds, golden ratio, balance) are smooth mathematical functions that KAN can capture more efficiently than MLPs.

2.8 DC-AE for Extreme Latent Compression

DC-AE [arXiv:2410.10733]: Deep Compression Autoencoder achieving f32 and f64 compression ratios (vs f8 in SD). Key technique: Residual Autoencoding — non-parametric space-to-channel shortcuts that let the neural network learn residuals on top of a simple pixel shuffle. With Decoupled High-Resolution Adaptation, handles 1024px without quality loss.

DC-AE 1.5 [arXiv:2508.00413]: Structured Latent Space for even better diffusion model convergence.

Our application: We use DC-AE f32 as our frozen latent codec. A 1024×1024 image → 32×32×32 latent (32,768 values). This is 32× smaller sequence length than SD's 128×128. With Mamba's O(n) complexity, processing this tiny latent is extremely fast and memory-efficient.

3. Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                      ArtFlow Pipeline                             │
│                                                                    │
│  Text ──→ [TinyTextEnc] ──→ text_emb ──────────────────┐         │
│                                                          │         │
│  Style ──→ [ArtStyleMatrix] ──→ style_mod ──────────┐   │         │
│                                                      │   │         │
│  Mood ──→ [MoodController] ──→ mood_dyn ────────┐   │   │         │
│                                                  │   │   │         │
│  z_noise ──→ ┌─────────────────────────────────┐ │   │   │         │
│              │  WaveMamba UNet + RLR Reasoning  │◄┘   │   │         │
│              │                                  │◄────┘   │         │
│              │  [Down] → [Mid+Reason] → [Up]    │◄────────┘         │
│              │                                  │                   │
│              │  Internal per-step:              │                   │
│              │  for r in 1..R:                  │                   │
│              │    z_L = f(z_L + x + z_H)       │                   │
│              │    z_H = g(z_L + z_H)           │                   │
│              └──────────┬──────────────────────┘                   │
│                         │                                          │
│                    z_denoised                                      │
│                         │                                          │
│              ┌──────────┴──────────┐                               │
│              │  DC-AE f32 Decoder  │                               │
│              └──────────┬──────────┘                               │
│                         │                                          │
│                   1024×1024 Image                                  │
└──────────────────────────────────────────────────────────────────┘

Core Data Flow

Text → TinyTextEncoder → text_emb ∈ ℝ^{L×768} (L=77 tokens)
Art Style → ArtStyle Matrix lookup/interpolation → style_mod ∈ ℝ^d
Mood → Mood Controller → mood_dyn ∈ ℝ^d (time constants for liquid dynamics)
Noise z_t ∈ ℝ^{32×32×32} (from DC-AE f32 latent space)
Denoising: 4-8 flow matching steps, each with R=2 internal reasoning recursions
Decode: DC-AE decoder → 1024×1024×3 image

4. Module 1: Latent Codec (Pretrained DC-AE)

We use a pretrained, frozen DC-AE with spatial compression factor f=32 and channel dimension c=32.

Why DC-AE f32?

Codec	Spatial Factor	Latent Size (1024px)	Sequence Length	rFID
SD-VAE f8	8×	128×128×4	16,384	0.51
SD3-VAE f8	8×	128×128×16	16,384	0.28
DC-AE f32	32×	32×32×32	1,024	0.35
DC-AE f64	64×	16×16×128	256	0.50

f32 is the sweet spot: 16× fewer tokens than SD-VAE (1024 vs 16384), with comparable reconstruction quality. For our Mamba backbone with O(n) complexity, sequence length directly determines speed. 1024 tokens is trivially fast even on mobile.

Tiny Decoder Optimization

Following SnapGen [arXiv:2412.09619], we can optionally replace the full DC-AE decoder with a tiny ~1.4M parameter decoder that uses:

Single-layer ConvNeXt blocks instead of ResNet blocks
No attention in the decoder (purely convolutional upsampling)
Trained with a combination of L1 + perceptual (LPIPS) + GAN loss

This reduces decoder RAM from ~80MB to ~3MB while maintaining visual quality for illustration/anime styles (which have less fine texture detail than photorealistic images).

5. Module 2: WaveMamba Denoising Backbone (~250M params)

This is the core innovation. A UNet-shaped denoising network where every processing block uses WaveMamba instead of transformers.

5.1 UNet Topology

Input: z_t ∈ ℝ^{32×32×C_latent}    [C_latent=32 from DC-AE]

Encoder:
  Stage 1 (32×32): SepConv + CrossAttn(text)         [channels: 256]
  Stage 2 (16×16): WaveMamba + CrossAttn(text)        [channels: 512]  ← downsample 2×
  Stage 3 (8×8):   WaveMamba + CrossAttn(text)        [channels: 768]  ← downsample 2×

Bottleneck (8×8):
  WaveMamba × 4 + CrossAttn(text) + RecursiveReasoning  [channels: 768]

Decoder:
  Stage 3 (8×8→16×16):  WaveMamba + CrossAttn(text) + Skip  [channels: 512]
  Stage 2 (16×16→32×32): WaveMamba + CrossAttn(text) + Skip [channels: 256]
  Stage 1 (32×32):       SepConv + CrossAttn(text) + Skip    [channels: 256]

Output: v_predicted ∈ ℝ^{32×32×C_latent}

Key design decisions (informed by MobileDiffusion + SnapGen research):

No self-attention at 32×32 — too expensive; use SepConv only (with cross-attention for text)
WaveMamba at 16×16 and 8×8 — Mamba is efficient enough here, and we need global context
Heavy bottleneck — 4 WaveMamba blocks + recursive reasoning at 8×8 (only 64 tokens!)
Cross-attention everywhere — it's cheap (text is only 77 tokens) and crucial for prompt adherence
Skip connections — standard UNet skip connections for preserving details

5.2 WaveMamba Block

The core building block that replaces transformer self-attention:

Input: x ∈ ℝ^{H×W×C}

1. Wavelet Decomposition (parameter-free):
   x_LL, x_LH, x_HL, x_HH = DWT2D(x)
   # Each subband: ℝ^{H/2 × W/2 × C}

2. Flatten to sequences (zigzag scan for spatial continuity):
   seq_LL = zigzag_flatten(x_LL)  # ∈ ℝ^{HW/4 × C}
   seq_LH = zigzag_flatten(x_LH)
   seq_HL = zigzag_flatten(x_HL)
   seq_HH = zigzag_flatten(x_HH)

3. Selective SSM processing (Mamba) per subband:
   out_LL = Mamba(seq_LL, style_mod)  # Style modulates SSM parameters
   out_LH = Mamba(seq_LH, style_mod)
   out_HL = Mamba(seq_HL, style_mod)
   out_HH = Mamba(seq_HH, style_mod)

4. Inverse zigzag + Wavelet Reconstruction:
   out_LL = zigzag_unflatten(out_LL, H/2, W/2)
   ... (same for others)
   y = IDWT2D(out_LL, out_LH, out_HL, out_HH)

5. Residual + Norm:
   output = LayerNorm(x + y)

Why wavelets + Mamba?

The wavelet transform splits the signal into 4 subbands, each at half resolution → 4× less work per subband
Low-frequency (LL) captures composition; high-frequency (LH, HL, HH) captures line work and details
Each subband is processed independently by Mamba, so we get O(n) per subband, total O(n)
Style modulation can apply differently to each subband (strong in HH for line style, subtle in LL for composition)
Zigzag scan (from ZigMa) maintains spatial continuity within each subband

5.3 Style-Modulated Mamba

Standard Mamba has parameters (A, B, C, Δ) that are input-dependent. We add style-dependence:

Standard Mamba:
  B_t = Linear(x_t)
  C_t = Linear(x_t)  
  Δ_t = softplus(Linear(x_t))

Style-Modulated Mamba:
  B_t = Linear(x_t) + Linear_B(style_mod)     # Additive style bias
  C_t = Linear(x_t) + Linear_C(style_mod)
  Δ_t = softplus(Linear(x_t) * σ(Linear_Δ(style_mod)))  # Multiplicative time scale

The style vector modulates:

B (input projection): How much each input token contributes to the hidden state → controls what details the model attends to
C (output projection): What information to read from the hidden state → controls what features are expressed
Δ (time step): How quickly the hidden state evolves → controls the "rhythm" of the style (detailed vs smooth)

This is inspired by Liquid Neural Networks where the time constant τ modulates dynamics. Here, style acts as the time constant for how the image forms.

5.4 Expanded Separable Convolution Block (for Stage 1)

At 32×32 resolution, we use purely convolutional blocks (no Mamba/attention overhead):

Input: x ∈ ℝ^{H×W×C}

1. DepthwiseConv3x3(x)           # Spatial mixing, O(HW·C)
2. RMSNorm
3. PointwiseConv(C → 2C)          # Channel expansion
4. SiLU activation
5. PointwiseConv(2C → C)          # Channel reduction
6. Scale by timestep embedding

Output: x + scaled_output

UIB (Universal Inverted Bottleneck) design from SnapGen. Expansion ratio 2 balances parameters and quality.

5.5 Cross-Attention for Text Conditioning

Multi-Query Attention (MQA) for efficiency:

Q = Linear(image_features)     # ∈ ℝ^{N × h × d_k}    (h heads)
K = Linear(text_emb)           # ∈ ℝ^{L × 1 × d_k}    (1 shared head)
V = Linear(text_emb)           # ∈ ℝ^{L × 1 × d_v}    (1 shared head)

Attention = softmax(Q @ K.T / √d_k) @ V

MQA uses a single key-value head shared across all query heads, reducing text encoder memory by ~h× during inference. With 8 query heads and 1 KV head, this is 8× more efficient than standard multi-head attention.

5.6 Timestep & Conditioning Integration

Following DiT's AdaLN-Zero:

t_emb = MLP(sinusoidal_encoding(t))                    # Timestep
s_emb = MLP(style_mod)                                  # Style
m_emb = MLP(mood_dyn)                                   # Mood
c_emb = t_emb + s_emb + m_emb                          # Combined condition

# Applied as adaptive layer norm:
γ, β, α = chunk(Linear(c_emb), 3)
output = α * (γ * LayerNorm(x) + β)

The α (gate) starts near zero, providing stable training initialization.

6. Module 3: ArtStyle Matrix Encoder (~5M params)

6.1 Design Philosophy

Instead of learning styles implicitly in the backbone weights, we explicitly factor style into a learnable matrix:

S ∈ ℝ^{K × d_style}

where K = 256 base style vectors and d_style = 512.

Each style vector encodes a complete artistic style along dimensions like:

Line weight and quality (0-1: thin precise → thick expressive)
Color palette warmth (-1 to 1: cool → warm)
Detail density (0-1: minimal → intricate)
Shading type (categorical: cell-shaded, soft gradient, crosshatch, etc.)
Background treatment (0-1: abstract → detailed)
... (learned dimensions, not hand-coded)

6.2 Style Selection & Interpolation

# Single style:
style_vec = S[style_id]  # ∈ ℝ^d

# Style interpolation:
style_vec = α * S[style_a] + (1-α) * S[style_b]

# Multi-style composition:
style_vec = Σ_i w_i * S[style_i], where Σ w_i = 1

# Novel style invention:
style_vec = any_vector ∈ ℝ^d  # The space is continuous!

6.3 Style-to-Modulation Network

style_vec ∈ ℝ^d 
  → MLP(d → 4d → 4d → d_mod)
  → split into: style_B, style_C, style_Δ, style_adaLN

These modulation signals are injected into every WaveMamba block and AdaLN layer. The MLP is small (~3M params) but crucial — it translates abstract style codes into concrete modulations of the generation dynamics.

6.4 Training the Style Matrix

The style matrix is trained in Stage 2 of the training pipeline (after the backbone learns basic generation). We use a contrastive approach:

Sample images from the same artist/style → should produce similar style_vec
Sample images from different artists → should produce different style_vec
Style consistency loss: generated image's CLIP style embedding should match the input style_vec's implied style

The matrix S is randomly initialized and trained end-to-end with gradient descent. The continuous nature of the space means intermediate vectors (not in training data) produce coherent interpolated styles.

7. Module 4: Concept Reasoning Engine (CRE, ~15M params)

7.1 Purpose

The CRE gives the model explicit understanding of image concepts:

What objects/characters are present
Their spatial arrangement (who is in front, what's overlapping)
Actions and poses (standing, sitting, fighting)
Scene type (indoor, outdoor, abstract background)

7.2 Architecture

The CRE is a small graph neural network that operates on text-extracted concept tokens:

Input: text_emb → ConceptExtractor → concept_nodes ∈ ℝ^{M × d}  (M concepts)

GraphAttention layers × 3:
  for each concept node i:
    neighbors = top-k similar concepts (by learned similarity)
    node_i = node_i + Σ_j α_ij * V(node_j)    # Attend to related concepts

Output: concept_emb ∈ ℝ^{M × d}  → spatial layout hints

7.3 KAN-Based Composition Rules

We use Kolmogorov-Arnold Network layers for learning compositional rules:

class CompositionKAN(nn.Module):
    """Uses learnable activation functions to capture smooth compositional rules
    like rule-of-thirds, golden ratio, visual balance."""
    
    def __init__(self, d_in, d_out, grid_size=5):
        # B-spline basis functions on edges
        self.basis = BSplineBasis(grid_size)
        self.coeffs = nn.Parameter(torch.randn(d_in, d_out, grid_size))
    
    def forward(self, x):
        # Each edge has its own learned activation function
        basis_vals = self.basis(x.unsqueeze(-1))  # [B, d_in, grid_size]
        return torch.einsum('big,iog->bo', basis_vals, self.coeffs)

Why KAN here? Compositional rules are smooth mathematical functions (golden ratio ≈ 1.618, rule of thirds at 1/3 and 2/3 positions). KAN with B-spline basis can represent these functions more compactly than MLPs.

7.4 Spatial Layout Generation

The CRE produces a soft spatial layout that biases the denoising process:

concept_emb → LayoutMLP → spatial_bias ∈ ℝ^{32×32×1}

This spatial bias is added to the latent at each denoising step, gently guiding where concepts should appear. It's a soft prior, not a hard constraint — the denoising backbone can override it.

8. Module 5: Mood & Philosophy Controller (~2M params)

8.1 Liquid Dynamics Formulation

Inspired by Liquid Neural Networks [arXiv:2006.04439], the mood controller uses continuous dynamics:

Mood input: m ∈ {warm, cold, serene, chaotic, melancholic, joyful, ...}
  → mood_embedding ∈ ℝ^d_mood

Liquid Time Constants:
  τ(m) = τ_base * σ(W_τ * mood_embedding + b_τ)
  
  where τ ∈ ℝ^d_mod controls the temporal dynamics of each modulation dimension

Physics interpretation:

Large τ (serene mood) → slow dynamics → smooth, gradual color transitions, soft edges
Small τ (chaotic mood) → fast dynamics → sharp contrasts, dynamic compositions, high frequency detail
This is analogous to how diffusion coefficients in physics control the speed of spreading

8.2 Mood Modulation Injection

mood_signal = mood_embedding * (1/τ(m))  # Scaled by dynamics
→ Integrated into AdaLN: c_emb = t_emb + s_emb + mood_signal

The mood modulates the rate at which style and content evolve during denoising. Early steps (high noise) are dominated by composition; later steps (low noise) are dominated by details. The mood controller adjusts this balance:

Melancholic: Slow detail emergence, emphasis on composition and negative space
Joyful: Fast detail emergence, emphasis on bright colors and dynamic poses
Mysterious: Asymmetric — fast in dark regions, slow in light regions

8.3 Philosophy of Image Understanding

The mood controller also encodes what we call "artistic philosophy":

Narrative intent: Is this image telling a story? (learned from captioned illustration datasets)
Emotional depth: How much emotional weight does this image carry?
Visual metaphor: Does this image use visual metaphors? (learned from art-analysis datasets)

These are encoded as additional dimensions in the mood embedding, trained through:

Art-commentary datasets (descriptions of art that discuss mood, meaning, metaphor)
Emotion classification datasets (images + emotion labels)
Generated aesthetic score datasets (e.g., LAION aesthetic scores)

9. Module 6: Text Understanding (TinyTextEnc, ~67M params)

9.1 Architecture Choice

We use a distilled CLIP-ViT-B/32 text encoder (~~63M params) or TinyBERT (~~67M params):

Small enough for mobile (134MB in fp16)
Good text understanding for short prompts (anime tags + natural language)
Can be further distilled or quantized to 4-bit (~17MB) with minimal quality loss

9.2 Dual Prompt Format

Following Illustrious [arXiv:2409.19946]:

Format 1 (Tag-based): 
  "1girl, white hair, blue eyes, sword, standing, forest background, best quality"

Format 2 (Natural language):
  "A girl with white hair and blue eyes standing in a forest, holding a sword"

Format 3 (Mixed):
  "1girl, white hair, blue eyes | standing in a sunlit forest clearing, sword drawn"

The model handles both formats because training alternates between tag-based (Danbooru style) and natural language (BLIP2 captions).

9.3 Quasi-Register Tokens (from Illustrious)

For concepts the model can't express through text alone, we use register tokens — special learnable tokens appended to the sequence that capture residual information:

text_emb = TextEncoder([prompt_tokens, REG_1, REG_2, ..., REG_8])

The 8 register tokens are free to encode whatever the text prompt doesn't cover (implicit style cues, quality signals, etc.).

10. Mathematical Foundations

10.1 Flow Matching Objective

We use rectified flow with v-prediction following SD3/FLUX:

Forward process:  x_t = (1-t) * x_0 + t * ε,     ε ~ N(0, I)
Velocity:         v = dx_t/dt = ε - x_0
Training loss:    L = E_{t,x_0,ε} [ ||v_θ(x_t, t, c) - v||² ]

Timestep sampling: Logit-normal distribution shifted toward t=0.5 (from FLUX):

t ~ σ(μ + σ_ln * N(0,1))     where μ=0, σ_ln=1

This concentrates training on the mid-noise range where learning is most effective.

10.2 Art-Aware Velocity Scaling (Novel)

Standard flow matching weighs all spatial locations equally. But for artistic images:

Lines and edges (high-frequency) carry the most artistic identity
Color masses (low-frequency) carry composition
Details (mid-frequency) carry texture and style

We propose Frequency-Weighted Flow Matching:

L = E_{t,x_0,ε} [ Σ_b w_b * ||DWT_b(v_θ - v)||² ]

where b ∈ {LL, LH, HL, HH} are wavelet subbands and:
  w_LL = 1.0     (composition: standard weight)
  w_LH = 2.0     (horizontal lines: extra weight for art quality)
  w_HL = 2.0     (vertical lines: extra weight)
  w_HH = 1.5     (diagonal details: moderate extra weight)

This forces the model to pay more attention to getting line work right — crucial for illustration/anime quality.

10.3 Recursive Latent Reasoning (RLR) Formulation

Within each denoising step, we perform R recursions:

Initialize: z_H^0 = x_t (current noisy latent)
            z_L^0 = 0   (empty working memory)

For r = 1 to R:
  z_L^r = f_L(z_L^{r-1} + embed(x_t) + z_H^{r-1}; θ)    # Update working memory
  z_H^r = f_H(z_L^r + z_H^{r-1}; θ)                       # Update solution

Final: v_predicted = output_head(z_H^R)

where f_L and f_H share parameters (same WaveMamba blocks, different inputs). This is the TRM principle applied to denoising.

Key insight: z_L acts as a "reasoning scratchpad" — it can encode things like "the sword should overlap the character's hand" or "the background trees should be darker than the foreground" without explicitly representing these as images. It's a latent chain-of-thought.

10.4 Deep Improvement Supervision for Training RLR

From [arXiv:2511.16886], we train each recursion step toward progressively less-corrupted targets:

For supervision step s ∈ {1, ..., S}:
  target_s = corrupt(ground_truth, noise_level = (S-s)/S)
  
  # Step s sees a target with noise_level decreasing from ~1 to ~0
  L_s = ||output_head(z_H^s) - target_s||²

This gives each recursion a concrete learning signal: "improve the current estimate by this much." Without this, only the final recursion gets gradient signal, and earlier recursions become dead compute.

10.5 Mamba SSM Mathematics

The core State Space Model dynamics:

Continuous:  h'(t) = A·h(t) + B·x(t)
             y(t) = C·h(t)

Discrete (ZOH):  
  Ā = exp(Δ·A)
  B̄ = (Δ·A)^{-1} (exp(Δ·A) - I) · Δ·B
  
  h_t = Ā·h_{t-1} + B̄·x_t
  y_t = C·h_t

Selective Mamba (input-dependent):
  B_t = Linear(x_t)
  C_t = Linear(x_t)
  Δ_t = softplus(Linear(x_t))

Complexity: O(n) in sequence length (vs O(n²) for attention). With n=1024 (our latent size), Mamba is ~1000× cheaper than self-attention.

Memory: Hidden state h ∈ ℝ^{N×D} where N=state_dim (typically 16-64) and D=model_dim. This is constant regardless of sequence length — perfect for mobile.

10.6 Wavelet-Based Multi-Resolution Analysis

2D Discrete Wavelet Transform with Haar wavelets (simplest, no parameters):

LL = (x[::2,::2] + x[::2,1::2] + x[1::2,::2] + x[1::2,1::2]) / 2
LH = (x[::2,::2] + x[::2,1::2] - x[1::2,::2] - x[1::2,1::2]) / 2
HL = (x[::2,::2] - x[::2,1::2] + x[1::2,::2] - x[1::2,1::2]) / 2  
HH = (x[::2,::2] - x[::2,1::2] - x[1::2,::2] + x[1::2,1::2]) / 2

This is O(n) and fully differentiable. Inverse is equally simple.

11. Training Pipeline

Stage 0: Pretrain VAE (Skip — use existing)

We use pretrained DC-AE f32 from MIT Han Lab. Frozen during all subsequent training.

Alternative: Use SD3 VAE (f8, 16 channels) if DC-AE f32 isn't available. This gives 128×128 latent but is well-tested.

Stage 1: Base Generation Training (~100K steps)

Goal: Learn basic denoising (noise → latent image) without style/mood modules.

Config:

Dataset: ~10M image-text pairs (filtered for illustration/anime quality)
Resolution: 256px (8×8 latent with f32, or 32×32 with f8)
Batch size: 256
Learning rate: 1e-4 with cosine annealing
Optimizer: AdamW (β1=0.9, β2=0.99, wd=0.01)
Loss: MSE velocity prediction (standard flow matching)
No RLR recursion yet (R=1)
No style/mood modulation yet (set to zero)
AMP training (fp16/bf16)

Stability techniques:

QK RMSNorm in all attention layers (prevents softmax saturation)
Zero-initialized output projections in AdaLN (α starts near 0)
Gradient clipping at 1.0
EMA with decay 0.9999

Freezing: Text encoder frozen. DC-AE frozen. Only WaveMamba backbone trains.

Hardware: Single A100 80GB or 4× A10G 24GB. ~3-5 days.

Stage 2: Style Matrix Training (~50K steps)

Goal: Learn the ArtStyle Matrix to disentangle styles.

Config:

Dataset: Same as Stage 1 + artist/style labels
Resolution: 256px → 512px (progressive)
Unfreeze: ArtStyle Matrix + style modulation networks
Keep frozen: WaveMamba backbone (trained in Stage 1)
Loss: Standard flow matching + style consistency loss

Style Consistency Loss:

L_style = -cos_sim(CLIP_style(generated), CLIP_style(reference_of_same_style))

After 25K steps, unfreeze backbone for joint fine-tuning at lower LR (1e-5).

Stage 3: Resolution & Quality Scaling (~50K steps)

Goal: Scale to 1024px with high visual quality.

Config:

Resolution: 512px → 768px → 1024px (progressive over training)
Unfreeze: Everything except text encoder and DC-AE
Enable RLR recursion (R=2)
Enable Art-Aware Velocity Scaling loss
Loss: Frequency-weighted flow matching
Batch size: 64 (smaller due to resolution)

Progressive resolution prevents the model from needing to learn multi-resolution from scratch — it progressively extends its capability.

Stage 4: Reasoning & Concept Training (~30K steps)

Goal: Train the Concept Reasoning Engine and Mood Controller.

Config:

Unfreeze: CRE + Mood Controller
Freeze: Everything else
Loss: Standard + spatial layout guidance loss + mood classification loss
Datasets: Caption-enriched illustrations with mood/concept annotations

After 15K steps, unfreeze all for joint fine-tuning (1e-6 LR).

Stage 5: Quality Post-Training (SFT + RL, ~10K steps)

Goal: Align model with human aesthetic preferences.

Config:

Curated high-quality dataset (~100K best illustrations)
Loss: Flow matching + ImageReward score maximization
Step distillation: Train 4-step consistency model from the multi-step base

Following DreamLite's post-training recipe: SFT on curated data → RL with ImageReward → Step distillation.

Training Stability Summary

Technique	Purpose	Stage
QK RMSNorm	Prevent attention collapse	All
Zero-init AdaLN gates	Stable initialization	All
Gradient clipping (1.0)	Prevent explosion	All
EMA (0.9999)	Smooth training	All
Cosine annealing LR	Controlled convergence	All
Progressive resolution	Avoid resolution shock	Stage 3
Modular freeze/unfreeze	Stable staged training	All
Logit-normal timestep	Focus on informative t	All
Frequency-weighted loss	Art-quality emphasis	Stage 3+
Deep Improvement Supervision	Train RLR recursions	Stage 3+

Colab/Kaggle Feasibility

Stage 1 can be trained on Kaggle P100 (16GB) or Colab T4 (15GB):

Batch size 4 with gradient accumulation 64 = effective batch 256
Mixed precision (fp16)
Gradient checkpointing
256px resolution
~3-5 hours per 10K steps on T4

Total training budget for a proof-of-concept (Stages 1-3 at reduced scale):

Dataset: 1M images (subset)
Resolution: up to 512px
~48-72 hours on Kaggle (need to use multiple sessions)

12. Datasets & Data Strategy

12.1 Primary Datasets (Freely Available)

Dataset	Size	Purpose	Stage
Danbooru2023	~6M	Anime/illustration, tag-based	All
Pixiv Fanbox (filtered)	~2M	High-quality illustration	Stage 3+
ArtBench	60K	Style classification	Stage 2
WikiArt	80K	Art style diversity	Stage 2
LAION-Aesthetic V2 (≥6.5)	~600K	High aesthetic quality	Stage 1
JourneyDB	~4M	High-quality AI-assisted	Stage 1
Sakuga-42M	~42M clips	Anime understanding	Stage 4
Emotion/Mood datasets	~100K	Mood controller training	Stage 4

12.2 Illustration-Specific Data Preprocessing

Following Illustrious [arXiv:2409.19946]:

Tag ordering: person_count | character_names | rating | general_tags | artist | quality_score | year_modifier
Quality scoring: Percentile-based (worst → masterpiece scale)
No dropout on critical tokens (to prevent unwanted content generation)
Quasi-register tokens for unknown concepts
Mixed tag + natural language captions
Resolution filtering: Min 768×768, max aspect ratio 1:3
Aesthetic scoring: Filter with CLIP aesthetic predictor + hand-tuned thresholds

12.3 Art Style Dataset Construction

For the ArtStyle Matrix (Stage 2):

Cluster Danbooru by artist tags → ~5000 distinct artists
Select top 256 artists with most images (>500 each)
Each artist = one style vector in S
Additional synthetic styles from interpolation

12.4 Concept & Mood Annotation Pipeline

For CRE and Mood Controller (Stage 4):

Use existing VLM (e.g., InternVL2 or LLaVA) to generate:
- Object/character descriptions
- Spatial relationship descriptions
- Mood/emotion labels
- Scene type classifications
Filter and clean with rule-based heuristics
This creates a pseudo-labeled dataset for concept/mood training without manual annotation

13. Inference Pipeline

13.1 Standard Generation (4-8 steps)

def generate(prompt, style_id=None, mood=None, steps=8, cfg_scale=4.0):
    # 1. Encode text
    text_emb = text_encoder(tokenize(prompt))
    
    # 2. Get style modulation
    if style_id is not None:
        style_mod = art_style_matrix[style_id]
    else:
        style_mod = default_style  # or zero
    
    # 3. Get mood dynamics
    if mood is not None:
        mood_dyn = mood_controller(mood)
    else:
        mood_dyn = neutral_mood
    
    # 4. Sample noise
    z_t = torch.randn(1, 32, 32, 32)  # DC-AE f32 latent
    
    # 5. Flow matching denoising
    dt = 1.0 / steps
    for i in range(steps):
        t = 1.0 - i * dt
        
        # Classifier-free guidance
        v_cond = model(z_t, t, text_emb, style_mod, mood_dyn)
        v_uncond = model(z_t, t, null_text, style_mod, mood_dyn)
        v = v_uncond + cfg_scale * (v_cond - v_uncond)
        
        # Euler step
        z_t = z_t - v * dt
    
    # 6. Decode
    image = dc_ae_decoder(z_t)  # 1024×1024×3
    return image

13.2 Memory During Inference

Text encoder:    ~134 MB (fp16)
WaveMamba:       ~500 MB (fp16)  
ArtStyle Matrix:  ~10 MB
Mood Controller:   ~4 MB
DC-AE Decoder:   ~80 MB (or ~3 MB tiny decoder)
Latent tensor:    ~0.1 MB (32×32×32 × 2 bytes)
Activations:     ~200 MB (peak, during forward pass)
─────────────────────────
Total:           ~928 MB (with tiny decoder: ~851 MB)

Under 1GB for the model! With activation memory, peak is ~1.1-1.5 GB.

With INT8 quantization of the backbone: ~600 MB total. Well within 2-4 GB mobile budget.

13.3 Inference Speed Estimate

On a modern mobile GPU (Adreno 730 / Apple A16):

32×32 latent → 1024 tokens
Mamba: O(1024) per block
~50 WaveMamba blocks total
8 denoising steps with R=2 recursions = 16 backbone evaluations with CFG (×2) = 32 forward passes

Estimated: 1-3 seconds on flagship mobile (comparable to MobileDiffusion/SnapGen)

13.4 Future: Image Editing

The architecture naturally supports editing because:

Inpainting: Mask regions in the latent → denoise only masked regions
Style transfer: Change style_mod mid-generation
Mood editing: Change mood_dyn to alter atmosphere
Prompt editing: Change text_emb at different denoising steps
Super-resolution: Use the decoder at higher resolution with a fine-tuned upsampler

Following DreamLite's approach, we can add editing support by:

Concatenating source image latent with target latent (in-context conditioning)
Fine-tuning with editing pairs
No architecture change needed — just a training stage

14. Memory & Compute Analysis

14.1 FLOPs per Denoising Step

Component	Spatial Size	FLOPs (per step)
Stage 1 SepConv (×2)	32×32	~0.5 GFLOPs
Stage 2 WaveMamba (×2)	16×16	~1.0 GFLOPs
Stage 3 WaveMamba (×2)	8×8	~0.5 GFLOPs
Bottleneck WaveMamba (×4)	8×8	~1.0 GFLOPs
Cross-Attention (all stages)	various	~0.3 GFLOPs
RLR Recursion overhead (R=2)	8×8	~1.0 GFLOPs
Total per step		~4.3 GFLOPs

Per image (8 steps, CFG): ~69 GFLOPs

Compare: SDXL 600 GFLOPs per step, ~30,000 GFLOPs total. We're **430× more efficient**.

14.2 Attention Complexity Comparison

Method	Complexity	At 1024 tokens	At 16384 tokens (SD)
Self-Attention	O(n²d)	1×	256×
Mamba SSM	O(nd)	1×	16×
Our WaveMamba	O(n/4 × d) × 4	1×	16×

WaveMamba processes 4 subbands each at n/4 length, total work = O(nd) same as Mamba but with frequency awareness.

14.3 Mobile Deployment Considerations

Quantization-friendly: SiLU activations (not GELU), no complex operations
No self-attention: Eliminates the most VRAM-hungry operation
Constant memory Mamba: SSM state is fixed-size regardless of image resolution
Tiny latent space: 32×32 vs 128×128 = 16× less memory for activations
Separable convolutions: Efficient on mobile NPUs

15. Comparison with Existing Models

Feature	SDXL	FLUX	MobileDiffusion	SnapGen	ArtFlow
Params (backbone)	2.6B	12B	400M	372M	250M
Total params	~6B	~24B	~500M	~500M	379M
Latent size (1024px)	128²	128²	64²	128²	32²
Attention type	Self+Cross	Full	SA bottleneck	MQA	Mamba (O(n))
Native reasoning	❌	❌	❌	❌	✅ (RLR)
Style control	LoRA/fine-tune	LoRA	LoRA	-	Native matrix
Mood control	Prompt only	Prompt only	Prompt only	-	Native module
Art-focused	❌	❌	❌	❌	✅ by design
Mobile ready	❌	❌	✅	✅	✅
Training: Colab feasible	❌	❌	❌	❌	✅ (staged)
Editing support	Via separate model	Via fine-tune	❌	❌	Native
Peak RAM (1024px, fp16)	~8GB	~24GB	~1.5GB	~1.2GB	~1.0GB

Novel Contributions Summary

WaveMamba: First wavelet-decomposed Mamba denoising backbone in a UNet topology
Recursive Latent Reasoning for images: First application of TRM/HRM reasoning to image generation
ArtStyle Matrix: Explicit, manipulable style space for illustration generation
Liquid-dynamics Mood Control: Physics-inspired mood modulation using adaptive time constants
Art-Aware Velocity Scaling: Frequency-weighted flow matching loss for artistic quality
Deep Improvement Supervision for denoising: Training recursion steps with progressively cleaner targets
KAN-based Composition: Kolmogorov-Arnold Networks for learning smooth compositional rules

Appendix A: Key Paper References

MobileDiffusion [arXiv:2311.16567] - Mobile architecture optimization
SnapGen [arXiv:2412.09619] - Efficient UNet + knowledge distillation
DreamLite [arXiv:2603.28713] - Unified on-device gen+edit
ZigMa [arXiv:2403.13802] - Mamba for diffusion with zigzag scan
DiMSUM [arXiv:2411.04168] - Wavelet + Mamba for diffusion
DC-AE [arXiv:2410.10733] - Deep compression autoencoder f32/f64
TRM/DIS [arXiv:2511.16886] - Recursive reasoning as policy improvement
Liquid Neural Networks [arXiv:2006.04439] - Adaptive ODE dynamics
RWKV-7 [arXiv:2503.14456] - Linear-complexity language model
KAN [arXiv:2404.19756] - Kolmogorov-Arnold Networks
Illustrious [arXiv:2409.19946] - Anime-focused training methodology
Rectified Flow++ [arXiv:2405.20320] - Improved flow matching training
Stable Velocity [arXiv:2602.05435] - Variance reduction in flow matching
USO [arXiv:2508.18966] - Disentangled style+subject generation
Vision Mamba [arXiv:2401.09417] - Bidirectional Mamba for vision

ArtFlow Architecture v1.0 — Designed from research synthesis across 40+ papers spanning efficient architectures, state space models, latent reasoning, liquid neural networks, wavelet processing, and artistic style learning.

🎨 ArtFlow: Reasoning-Native Artistic Image Generation for Mobile Devices

A Novel Architecture for Intelligent, Lightweight Illustration Generation

Table of Contents

1. Executive Summary

Key Innovations

Parameter Budget

2. Research Foundations & Inspirations

2.1 Efficient Mobile Diffusion (What We Learned)

2.2 State Space Models for Vision (Our Core Innovation)

2.3 Recursive Latent Reasoning (Our Reasoning Innovation)

2.4 Liquid Neural Networks & Continuous Dynamics

2.5 Art Style Disentanglement

2.6 Wavelet Multi-Scale Processing

2.7 Kolmogorov-Arnold Networks

2.8 DC-AE for Extreme Latent Compression

3. Architecture Overview

Core Data Flow

4. Module 1: Latent Codec (Pretrained DC-AE)

Why DC-AE f32?

Tiny Decoder Optimization

5. Module 2: WaveMamba Denoising Backbone (~250M params)

5.1 UNet Topology

5.2 WaveMamba Block

5.3 Style-Modulated Mamba

5.4 Expanded Separable Convolution Block (for Stage 1)

5.5 Cross-Attention for Text Conditioning

5.6 Timestep & Conditioning Integration

6. Module 3: ArtStyle Matrix Encoder (~5M params)

6.1 Design Philosophy

6.2 Style Selection & Interpolation

6.3 Style-to-Modulation Network

6.4 Training the Style Matrix

7. Module 4: Concept Reasoning Engine (CRE, ~15M params)

7.1 Purpose

7.2 Architecture

7.3 KAN-Based Composition Rules

7.4 Spatial Layout Generation

8. Module 5: Mood & Philosophy Controller (~2M params)

8.1 Liquid Dynamics Formulation

8.2 Mood Modulation Injection

8.3 Philosophy of Image Understanding

9. Module 6: Text Understanding (TinyTextEnc, ~67M params)

9.1 Architecture Choice

9.2 Dual Prompt Format

9.3 Quasi-Register Tokens (from Illustrious)

10. Mathematical Foundations

10.1 Flow Matching Objective

10.2 Art-Aware Velocity Scaling (Novel)

10.3 Recursive Latent Reasoning (RLR) Formulation

10.4 Deep Improvement Supervision for Training RLR

10.5 Mamba SSM Mathematics

10.6 Wavelet-Based Multi-Resolution Analysis

11. Training Pipeline

Stage 0: Pretrain VAE (Skip — use existing)

Stage 1: Base Generation Training (~100K steps)

Stage 2: Style Matrix Training (~50K steps)

Stage 3: Resolution & Quality Scaling (~50K steps)

Stage 4: Reasoning & Concept Training (~30K steps)

Stage 5: Quality Post-Training (SFT + RL, ~10K steps)

Training Stability Summary

Colab/Kaggle Feasibility

12. Datasets & Data Strategy

12.1 Primary Datasets (Freely Available)

12.2 Illustration-Specific Data Preprocessing

12.3 Art Style Dataset Construction

12.4 Concept & Mood Annotation Pipeline

13. Inference Pipeline

13.1 Standard Generation (4-8 steps)

13.2 Memory During Inference

13.3 Inference Speed Estimate

13.4 Future: Image Editing

14. Memory & Compute Analysis

14.1 FLOPs per Denoising Step

14.2 Attention Complexity Comparison

14.3 Mobile Deployment Considerations

15. Comparison with Existing Models

Novel Contributions Summary

Appendix A: Key Paper References