File size: 48,721 Bytes
f0d55ac | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 | # π¨ ArtFlow: Reasoning-Native Artistic Image Generation for Mobile Devices
## A Novel Architecture for Intelligent, Lightweight Illustration Generation
**Version:** 1.0
**Status:** Architecture Specification + Prototype Implementation
**Target:** 2-4GB RAM, 1024px native generation, anime/illustration focus
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [Research Foundations & Inspirations](#2-research-foundations)
3. [Architecture Overview](#3-architecture-overview)
4. [Module 1: Latent Codec (Pretrained VAE)](#4-latent-codec)
5. [Module 2: WaveMamba Denoising Backbone](#5-wavemamba-backbone)
6. [Module 3: ArtStyle Matrix Encoder](#6-artstyle-encoder)
7. [Module 4: Concept Reasoning Engine (CRE)](#7-concept-reasoning)
8. [Module 5: Mood & Philosophy Controller](#8-mood-controller)
9. [Module 6: Text Understanding with Tiny Encoder](#9-text-encoder)
10. [Mathematical Foundations](#10-mathematical-foundations)
11. [Training Pipeline](#11-training-pipeline)
12. [Datasets & Data Strategy](#12-datasets)
13. [Inference Pipeline](#13-inference)
14. [Memory & Compute Analysis](#14-memory-analysis)
15. [Comparison with Existing Models](#15-comparison)
---
## 1. Executive Summary
**ArtFlow** is a novel image generation architecture designed from first principles to solve a specific problem: **generating high-quality artistic/illustration images on mobile devices (2-4GB RAM) with native reasoning capabilities about art concepts, styles, moods, and composition.**
### Key Innovations
1. **WaveMamba Denoising Core**: A hybrid architecture combining wavelet-decomposed multi-scale processing with Selective State Space Models (Mamba) instead of transformer self-attention. Achieves O(n) complexity instead of O(nΒ²) while maintaining global context awareness through the SSM hidden state. Inspired by DiMSUM [arXiv:2411.04168] and ZigMa [arXiv:2403.13802] but redesigned with a UNet topology and wavelet frequency routing.
2. **Recursive Latent Reasoning (RLR)**: Borrowed from TRM/HRM [arXiv:2511.16886] β the denoising backbone performs iterative latent refinement where a "working memory" state z_L and "current solution" state z_H are updated recursively. This gives the model native reasoning about image content without increasing parameters. Each denoising step internally performs 2-3 reasoning recursions, letting the network "think" about composition, spatial relationships, and artistic coherence.
3. **Disentangled Art Modules**: Instead of a monolithic backbone, we decompose generation into:
- **ArtStyle Matrix** (S β β^{kΓd}): Learned style vectors in a continuous style space. New styles = new vectors/matrices. Users can interpolate, combine, or invent entirely new styles by manipulating these compact representations.
- **Concept Graph Embeddings**: A lightweight module that encodes scene concepts (character poses, spatial relationships, object interactions) as graph-structured latent codes.
- **Mood Controller**: A small MLP that modulates generation based on emotional/atmospheric parameters (warm/cold, serene/chaotic, melancholic/joyful).
4. **Flow Matching Training**: We use rectified flow with logit-normal timestep sampling (from SD3/FLUX) for stable, fast convergence. Combined with a novel "Art-Aware Velocity Scaling" that weights the loss differently for high-frequency artistic details vs low-frequency composition.
5. **Extreme Efficiency**: Total denoising backbone ~250M parameters. With DC-AE [arXiv:2410.10733] f32 compression, we operate on tiny 32Γ32 latent maps for 1024px images. Combined with Mamba's O(n) complexity, inference requires <2GB VRAM and generates 1024px images in 4-8 steps.
### Parameter Budget
| Component | Parameters | RAM (fp16) |
|-----------|-----------|------------|
| DC-AE f32 Decoder | ~40M | ~80MB |
| WaveMamba Backbone | ~250M | ~500MB |
| ArtStyle Matrix | ~5M | ~10MB |
| Concept Reasoning | ~15M | ~30MB |
| Mood Controller | ~2M | ~4MB |
| Text Encoder (TinyBERT) | ~67M | ~134MB |
| **Total** | **~379M** | **~758MB** |
**Peak inference RAM at 1024px**: ~1.5-2.0 GB (including activations)
---
## 2. Research Foundations & Inspirations
### 2.1 Efficient Mobile Diffusion (What We Learned)
**MobileDiffusion** [arXiv:2311.16567]: Key insight β transformers are expensive at high resolution. They moved transformers to the UNet bottleneck only (16Γ16), used separable convolutions elsewhere, shared K-V projections, replaced softmaxβReLU for linear attention, replaced GELUβSiLU for mobile compatibility. Achieved 400M params, sub-second on mobile.
**SnapGen** [arXiv:2412.09619]: 372M params, FID 2.06 on ImageNet. Key techniques: removed self-attention from high-res stages, used expanded separable convolutions (UIB blocks), Multi-Query Attention (MQA), injected conditions from the very first stage with cross-attention (no self-attention), 2D RoPE, QK RMSNorm. Tiny 1.38M decoder.
**DreamLite** [arXiv:2603.28713]: 390M unified gen+edit model. In-context spatial concatenation for editing. Task-progressive joint pretraining. RLHF post-training. 4-step generation via adversarial distillation.
**Our takeaway**: UNet topology > pure ViT for mobile. Move heavy compute to lowest resolution. Separable convolutions for spatial blocks. Cross-attention is cheap and essential; self-attention is expensive and can be removed at high-res.
### 2.2 State Space Models for Vision (Our Core Innovation)
**ZigMa** [arXiv:2403.13802]: First successful Mamba-based diffusion. Used DiT-style architecture with zigzag scan patterns that maintain spatial continuity. Key finding: spatial continuity in scan order is critical β naive raster scan loses spatial relationships. Zigzag scan with heterogeneous layer-wise patterns adds zero memory overhead.
**DiMSUM** [arXiv:2411.04168]: Combined Mamba with wavelet decomposition. Wavelet transform decomposes images into frequency subbands, then each subband is processed by Mamba blocks. This gives Mamba local structure awareness (via high-frequency wavelets) while maintaining global context (via the SSM state). Outperformed DiT and DIFFUSSM.
**Mamba2D** [arXiv:2412.16146]: Native 2D state space model using a single 2D scan direction instead of multiple 1D scans. Better captures spatial dependencies.
**Vision Mamba** [arXiv:2401.09417]: Bidirectional Mamba blocks for vision. Outperformed DeiT with fewer parameters and better scaling to high-res.
**Our synthesis**: We combine the UNet topology (from MobileDiffusion/SnapGen efficiency findings) with Mamba-based processing at all resolutions. Instead of transformer self-attention blocks, we use **WaveMamba blocks** that perform wavelet decomposition β Mamba processing per subband β wavelet reconstruction. This gives O(n) global context at every resolution level while maintaining frequency-aware local processing.
### 2.3 Recursive Latent Reasoning (Our Reasoning Innovation)
**TRM (Tiny Recursive Models)** [Jolicoeur-Martineau 2025]: A single tiny transformer that recursively refines two latent states: z_H (current solution, directly supervised) and z_L (working memory/reasoning scratchpad, indirectly supervised). With just 2-layer transformers and ~1M params, achieved near-SOTA on ARC-AGI reasoning benchmarks. Key insight: z_L naturally becomes a "chain-of-thought" in latent space because it's only supervised through its effect on z_H.
**HRM (Hierarchical Reasoning Models)** [Wang et al. 2025]: Two recurrent networks at different update frequencies. Low-level module updates n times per high-level update. Deep supervision with detached states enables hundreds of effective layers from tiny models.
**Deep Improvement Supervision (DIS)** [arXiv:2511.16886]: Reframed TRM as policy improvement β each recursion step produces a reference policy and improved policy. Training each supervision step toward progressively less-corrupted targets reduced forward passes by 18Γ while maintaining performance.
**LatentSeek** [arXiv:2505.13308]: Test-time reasoning via policy gradient in latent space. No training needed β adapts pre-trained models at inference time.
**Our application to image generation**: We apply the TRM recursive reasoning principle directly to the denoising process. Each denoising step doesn't just predict noise once β it performs 2-3 internal recursions where:
- z_L (working memory) processes the composition, spatial layout, and concept consistency
- z_H (current image estimate) gets progressively refined by z_L's reasoning
- This effectively gives the model a "thinking" capability about what it's generating, without any extra parameters
This is fundamentally different from simply running more denoising steps. The recursion happens *within* a single denoising step, using the same weights but different states.
### 2.4 Liquid Neural Networks & Continuous Dynamics
**Liquid Time-Constant Networks** [arXiv:2006.04439]: ODE-based neural networks with input-dependent time constants. The dynamics adapt to the input signal, making them extremely expressive per parameter. The key equation:
```
dx/dt = -[1/Ο(x,I)] β x + [f(x,I)/Ο(x,I)]
```
where Ο is a learned, input-dependent time constant.
**Neural ODEs** [arXiv:1806.07366]: Continuous-depth models. Memory efficient via adjoint method. Adaptive evaluation speed.
**Our application**: We use a liquid-time-constant formulation for the **Mood Controller** β emotional/atmospheric parameters are encoded as time constants that modulate the dynamics of generation. A "serene" mood produces slow, smooth dynamics; a "chaotic" mood produces fast, turbulent dynamics. This is physics-inspired: mood literally changes the *dynamics* of how the image forms in latent space.
### 2.5 Art Style Disentanglement
**USO** [arXiv:2508.18966]: Unified style and subject generation via disentangled learning. Content-style decomposition training + style reward learning. State-of-the-art in both style similarity and subject consistency.
**StyleGAN StyleSpace** [arXiv:2011.12799]: Highly disentangled style control through channel-wise style parameters.
**Illustrious** [arXiv:2409.19946]: Anime model trained on Danbooru with: no-dropout tokens for sensitive content control, cosine annealing, quasi-register tokens for unknown concepts, multi-level score-based quality tags, resolution-specific training stages.
**Our application**: We create a **learnable ArtStyle Matrix S β β^{KΓd}** where K is the number of base styles and d is the style dimension. Each style is a vector that modulates the Mamba SSM parameters (A, B, C, Ξ). New styles are just new rows in the matrix. Interpolation between styles = interpolation between rows. This is like a "style periodic table" β atomic style elements that combine to form complex styles.
### 2.6 Wavelet Multi-Scale Processing
**DiMSUM** [arXiv:2411.04168]: Wavelet decomposition for Mamba-based diffusion.
**WaveMix** [arXiv:2203.03689]: 2D DWT for token mixing, competitive with ViTs/CNNs with fewer resources.
**Wavelet Diffusion** [arXiv:2211.16152]: Wavelet-based diffusion operating on frequency subbands.
**Our synthesis**: Wavelets are a perfect match for our architecture because:
1. They naturally decompose images into local frequency bands β the high-frequency bands capture artistic line work and details, low-frequency bands capture composition and color masses
2. Each subband is much smaller than the full image, so Mamba processing each subband is extremely efficient
3. We can apply different art-style modulation strengths to different frequency bands (e.g., strong style influence on line quality, moderate on color)
4. Wavelet transform/inverse is O(n) and parameter-free
### 2.7 Kolmogorov-Arnold Networks
**KAN** [arXiv:2404.19756]: Learnable activation functions on edges instead of fixed activations on nodes. More expressive per parameter for smooth functions. Good for learning scientific/mathematical relationships.
**KA-Attention** [arXiv:2503.10632]: KAN-based attention in ViTs showed competitive performance with learnable attention kernels.
**Our application**: We use KAN-inspired learnable activation functions in the **Concept Reasoning Engine** β the module that reasons about spatial relationships and scene composition. The idea is that compositional rules (rule of thirds, golden ratio, balance) are smooth mathematical functions that KAN can capture more efficiently than MLPs.
### 2.8 DC-AE for Extreme Latent Compression
**DC-AE** [arXiv:2410.10733]: Deep Compression Autoencoder achieving f32 and f64 compression ratios (vs f8 in SD). Key technique: Residual Autoencoding β non-parametric space-to-channel shortcuts that let the neural network learn *residuals* on top of a simple pixel shuffle. With Decoupled High-Resolution Adaptation, handles 1024px without quality loss.
**DC-AE 1.5** [arXiv:2508.00413]: Structured Latent Space for even better diffusion model convergence.
**Our application**: We use DC-AE f32 as our frozen latent codec. A 1024Γ1024 image β 32Γ32Γ32 latent (32,768 values). This is 32Γ smaller sequence length than SD's 128Γ128. With Mamba's O(n) complexity, processing this tiny latent is extremely fast and memory-efficient.
---
## 3. Architecture Overview
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ArtFlow Pipeline β
β β
β Text βββ [TinyTextEnc] βββ text_emb βββββββββββββββββββ β
β β β
β Style βββ [ArtStyleMatrix] βββ style_mod βββββββββββ β β
β β β β
β Mood βββ [MoodController] βββ mood_dyn βββββββββ β β β
β β β β β
β z_noise βββ βββββββββββββββββββββββββββββββββββ β β β β
β β WaveMamba UNet + RLR Reasoning βββ β β β
β β βββββββ β β
β β [Down] β [Mid+Reason] β [Up] βββββββββββ β
β β β β
β β Internal per-step: β β
β β for r in 1..R: β β
β β z_L = f(z_L + x + z_H) β β
β β z_H = g(z_L + z_H) β β
β ββββββββββββ¬βββββββββββββββββββββββ β
β β β
β z_denoised β
β β β
β ββββββββββββ΄βββββββββββ β
β β DC-AE f32 Decoder β β
β ββββββββββββ¬βββββββββββ β
β β β
β 1024Γ1024 Image β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Core Data Flow
1. **Text** β TinyTextEncoder β `text_emb β β^{LΓ768}` (L=77 tokens)
2. **Art Style** β ArtStyle Matrix lookup/interpolation β `style_mod β β^d`
3. **Mood** β Mood Controller β `mood_dyn β β^d` (time constants for liquid dynamics)
4. **Noise** `z_t β β^{32Γ32Γ32}` (from DC-AE f32 latent space)
5. **Denoising**: 4-8 flow matching steps, each with R=2 internal reasoning recursions
6. **Decode**: DC-AE decoder β 1024Γ1024Γ3 image
---
## 4. Module 1: Latent Codec (Pretrained DC-AE)
We use a **pretrained, frozen** DC-AE with spatial compression factor f=32 and channel dimension c=32.
### Why DC-AE f32?
| Codec | Spatial Factor | Latent Size (1024px) | Sequence Length | rFID |
|-------|---------------|---------------------|-----------------|------|
| SD-VAE f8 | 8Γ | 128Γ128Γ4 | 16,384 | 0.51 |
| SD3-VAE f8 | 8Γ | 128Γ128Γ16 | 16,384 | 0.28 |
| DC-AE f32 | 32Γ | 32Γ32Γ32 | 1,024 | 0.35 |
| DC-AE f64 | 64Γ | 16Γ16Γ128 | 256 | 0.50 |
**f32 is the sweet spot**: 16Γ fewer tokens than SD-VAE (1024 vs 16384), with comparable reconstruction quality. For our Mamba backbone with O(n) complexity, sequence length directly determines speed. 1024 tokens is trivially fast even on mobile.
### Tiny Decoder Optimization
Following SnapGen [arXiv:2412.09619], we can optionally replace the full DC-AE decoder with a tiny ~1.4M parameter decoder that uses:
- Single-layer ConvNeXt blocks instead of ResNet blocks
- No attention in the decoder (purely convolutional upsampling)
- Trained with a combination of L1 + perceptual (LPIPS) + GAN loss
This reduces decoder RAM from ~80MB to ~3MB while maintaining visual quality for illustration/anime styles (which have less fine texture detail than photorealistic images).
---
## 5. Module 2: WaveMamba Denoising Backbone (~250M params)
This is the core innovation. A UNet-shaped denoising network where every processing block uses **WaveMamba** instead of transformers.
### 5.1 UNet Topology
```
Input: z_t β β^{32Γ32ΓC_latent} [C_latent=32 from DC-AE]
Encoder:
Stage 1 (32Γ32): SepConv + CrossAttn(text) [channels: 256]
Stage 2 (16Γ16): WaveMamba + CrossAttn(text) [channels: 512] β downsample 2Γ
Stage 3 (8Γ8): WaveMamba + CrossAttn(text) [channels: 768] β downsample 2Γ
Bottleneck (8Γ8):
WaveMamba Γ 4 + CrossAttn(text) + RecursiveReasoning [channels: 768]
Decoder:
Stage 3 (8Γ8β16Γ16): WaveMamba + CrossAttn(text) + Skip [channels: 512]
Stage 2 (16Γ16β32Γ32): WaveMamba + CrossAttn(text) + Skip [channels: 256]
Stage 1 (32Γ32): SepConv + CrossAttn(text) + Skip [channels: 256]
Output: v_predicted β β^{32Γ32ΓC_latent}
```
Key design decisions (informed by MobileDiffusion + SnapGen research):
- **No self-attention at 32Γ32** β too expensive; use SepConv only (with cross-attention for text)
- **WaveMamba at 16Γ16 and 8Γ8** β Mamba is efficient enough here, and we need global context
- **Heavy bottleneck** β 4 WaveMamba blocks + recursive reasoning at 8Γ8 (only 64 tokens!)
- **Cross-attention everywhere** β it's cheap (text is only 77 tokens) and crucial for prompt adherence
- **Skip connections** β standard UNet skip connections for preserving details
### 5.2 WaveMamba Block
The core building block that replaces transformer self-attention:
```
Input: x β β^{HΓWΓC}
1. Wavelet Decomposition (parameter-free):
x_LL, x_LH, x_HL, x_HH = DWT2D(x)
# Each subband: β^{H/2 Γ W/2 Γ C}
2. Flatten to sequences (zigzag scan for spatial continuity):
seq_LL = zigzag_flatten(x_LL) # β β^{HW/4 Γ C}
seq_LH = zigzag_flatten(x_LH)
seq_HL = zigzag_flatten(x_HL)
seq_HH = zigzag_flatten(x_HH)
3. Selective SSM processing (Mamba) per subband:
out_LL = Mamba(seq_LL, style_mod) # Style modulates SSM parameters
out_LH = Mamba(seq_LH, style_mod)
out_HL = Mamba(seq_HL, style_mod)
out_HH = Mamba(seq_HH, style_mod)
4. Inverse zigzag + Wavelet Reconstruction:
out_LL = zigzag_unflatten(out_LL, H/2, W/2)
... (same for others)
y = IDWT2D(out_LL, out_LH, out_HL, out_HH)
5. Residual + Norm:
output = LayerNorm(x + y)
```
**Why wavelets + Mamba?**
- The wavelet transform splits the signal into 4 subbands, each at half resolution β 4Γ less work per subband
- Low-frequency (LL) captures composition; high-frequency (LH, HL, HH) captures line work and details
- Each subband is processed independently by Mamba, so we get O(n) per subband, total O(n)
- Style modulation can apply differently to each subband (strong in HH for line style, subtle in LL for composition)
- Zigzag scan (from ZigMa) maintains spatial continuity within each subband
### 5.3 Style-Modulated Mamba
Standard Mamba has parameters (A, B, C, Ξ) that are input-dependent. We add style-dependence:
```
Standard Mamba:
B_t = Linear(x_t)
C_t = Linear(x_t)
Ξ_t = softplus(Linear(x_t))
Style-Modulated Mamba:
B_t = Linear(x_t) + Linear_B(style_mod) # Additive style bias
C_t = Linear(x_t) + Linear_C(style_mod)
Ξ_t = softplus(Linear(x_t) * Ο(Linear_Ξ(style_mod))) # Multiplicative time scale
```
The style vector modulates:
- **B** (input projection): How much each input token contributes to the hidden state β controls what details the model attends to
- **C** (output projection): What information to read from the hidden state β controls what features are expressed
- **Ξ** (time step): How quickly the hidden state evolves β controls the "rhythm" of the style (detailed vs smooth)
This is inspired by Liquid Neural Networks where the time constant Ο modulates dynamics. Here, style acts as the time constant for how the image forms.
### 5.4 Expanded Separable Convolution Block (for Stage 1)
At 32Γ32 resolution, we use purely convolutional blocks (no Mamba/attention overhead):
```
Input: x β β^{HΓWΓC}
1. DepthwiseConv3x3(x) # Spatial mixing, O(HWΒ·C)
2. RMSNorm
3. PointwiseConv(C β 2C) # Channel expansion
4. SiLU activation
5. PointwiseConv(2C β C) # Channel reduction
6. Scale by timestep embedding
Output: x + scaled_output
```
UIB (Universal Inverted Bottleneck) design from SnapGen. Expansion ratio 2 balances parameters and quality.
### 5.5 Cross-Attention for Text Conditioning
Multi-Query Attention (MQA) for efficiency:
```
Q = Linear(image_features) # β β^{N Γ h Γ d_k} (h heads)
K = Linear(text_emb) # β β^{L Γ 1 Γ d_k} (1 shared head)
V = Linear(text_emb) # β β^{L Γ 1 Γ d_v} (1 shared head)
Attention = softmax(Q @ K.T / βd_k) @ V
```
MQA uses a single key-value head shared across all query heads, reducing text encoder memory by ~hΓ during inference. With 8 query heads and 1 KV head, this is 8Γ more efficient than standard multi-head attention.
### 5.6 Timestep & Conditioning Integration
Following DiT's AdaLN-Zero:
```
t_emb = MLP(sinusoidal_encoding(t)) # Timestep
s_emb = MLP(style_mod) # Style
m_emb = MLP(mood_dyn) # Mood
c_emb = t_emb + s_emb + m_emb # Combined condition
# Applied as adaptive layer norm:
Ξ³, Ξ², Ξ± = chunk(Linear(c_emb), 3)
output = Ξ± * (Ξ³ * LayerNorm(x) + Ξ²)
```
The Ξ± (gate) starts near zero, providing stable training initialization.
---
## 6. Module 3: ArtStyle Matrix Encoder (~5M params)
### 6.1 Design Philosophy
Instead of learning styles implicitly in the backbone weights, we explicitly factor style into a learnable matrix:
```
S β β^{K Γ d_style}
```
where K = 256 base style vectors and d_style = 512.
Each style vector encodes a complete artistic style along dimensions like:
- Line weight and quality (0-1: thin precise β thick expressive)
- Color palette warmth (-1 to 1: cool β warm)
- Detail density (0-1: minimal β intricate)
- Shading type (categorical: cell-shaded, soft gradient, crosshatch, etc.)
- Background treatment (0-1: abstract β detailed)
- ... (learned dimensions, not hand-coded)
### 6.2 Style Selection & Interpolation
```python
# Single style:
style_vec = S[style_id] # β β^d
# Style interpolation:
style_vec = Ξ± * S[style_a] + (1-Ξ±) * S[style_b]
# Multi-style composition:
style_vec = Ξ£_i w_i * S[style_i], where Ξ£ w_i = 1
# Novel style invention:
style_vec = any_vector β β^d # The space is continuous!
```
### 6.3 Style-to-Modulation Network
```
style_vec β β^d
β MLP(d β 4d β 4d β d_mod)
β split into: style_B, style_C, style_Ξ, style_adaLN
```
These modulation signals are injected into every WaveMamba block and AdaLN layer. The MLP is small (~3M params) but crucial β it translates abstract style codes into concrete modulations of the generation dynamics.
### 6.4 Training the Style Matrix
The style matrix is trained in **Stage 2** of the training pipeline (after the backbone learns basic generation). We use a contrastive approach:
1. Sample images from the same artist/style β should produce similar style_vec
2. Sample images from different artists β should produce different style_vec
3. Style consistency loss: generated image's CLIP style embedding should match the input style_vec's implied style
The matrix S is randomly initialized and trained end-to-end with gradient descent. The continuous nature of the space means intermediate vectors (not in training data) produce coherent interpolated styles.
---
## 7. Module 4: Concept Reasoning Engine (CRE, ~15M params)
### 7.1 Purpose
The CRE gives the model explicit understanding of image concepts:
- What objects/characters are present
- Their spatial arrangement (who is in front, what's overlapping)
- Actions and poses (standing, sitting, fighting)
- Scene type (indoor, outdoor, abstract background)
### 7.2 Architecture
The CRE is a small graph neural network that operates on text-extracted concept tokens:
```
Input: text_emb β ConceptExtractor β concept_nodes β β^{M Γ d} (M concepts)
GraphAttention layers Γ 3:
for each concept node i:
neighbors = top-k similar concepts (by learned similarity)
node_i = node_i + Ξ£_j Ξ±_ij * V(node_j) # Attend to related concepts
Output: concept_emb β β^{M Γ d} β spatial layout hints
```
### 7.3 KAN-Based Composition Rules
We use Kolmogorov-Arnold Network layers for learning compositional rules:
```python
class CompositionKAN(nn.Module):
"""Uses learnable activation functions to capture smooth compositional rules
like rule-of-thirds, golden ratio, visual balance."""
def __init__(self, d_in, d_out, grid_size=5):
# B-spline basis functions on edges
self.basis = BSplineBasis(grid_size)
self.coeffs = nn.Parameter(torch.randn(d_in, d_out, grid_size))
def forward(self, x):
# Each edge has its own learned activation function
basis_vals = self.basis(x.unsqueeze(-1)) # [B, d_in, grid_size]
return torch.einsum('big,iog->bo', basis_vals, self.coeffs)
```
Why KAN here? Compositional rules are smooth mathematical functions (golden ratio β 1.618, rule of thirds at 1/3 and 2/3 positions). KAN with B-spline basis can represent these functions more compactly than MLPs.
### 7.4 Spatial Layout Generation
The CRE produces a soft spatial layout that biases the denoising process:
```
concept_emb β LayoutMLP β spatial_bias β β^{32Γ32Γ1}
```
This spatial bias is added to the latent at each denoising step, gently guiding where concepts should appear. It's a soft prior, not a hard constraint β the denoising backbone can override it.
---
## 8. Module 5: Mood & Philosophy Controller (~2M params)
### 8.1 Liquid Dynamics Formulation
Inspired by Liquid Neural Networks [arXiv:2006.04439], the mood controller uses continuous dynamics:
```
Mood input: m β {warm, cold, serene, chaotic, melancholic, joyful, ...}
β mood_embedding β β^d_mood
Liquid Time Constants:
Ο(m) = Ο_base * Ο(W_Ο * mood_embedding + b_Ο)
where Ο β β^d_mod controls the temporal dynamics of each modulation dimension
```
**Physics interpretation**:
- Large Ο (serene mood) β slow dynamics β smooth, gradual color transitions, soft edges
- Small Ο (chaotic mood) β fast dynamics β sharp contrasts, dynamic compositions, high frequency detail
- This is analogous to how diffusion coefficients in physics control the speed of spreading
### 8.2 Mood Modulation Injection
```
mood_signal = mood_embedding * (1/Ο(m)) # Scaled by dynamics
β Integrated into AdaLN: c_emb = t_emb + s_emb + mood_signal
```
The mood modulates the *rate* at which style and content evolve during denoising. Early steps (high noise) are dominated by composition; later steps (low noise) are dominated by details. The mood controller adjusts this balance:
- **Melancholic**: Slow detail emergence, emphasis on composition and negative space
- **Joyful**: Fast detail emergence, emphasis on bright colors and dynamic poses
- **Mysterious**: Asymmetric β fast in dark regions, slow in light regions
### 8.3 Philosophy of Image Understanding
The mood controller also encodes what we call "artistic philosophy":
- **Narrative intent**: Is this image telling a story? (learned from captioned illustration datasets)
- **Emotional depth**: How much emotional weight does this image carry?
- **Visual metaphor**: Does this image use visual metaphors? (learned from art-analysis datasets)
These are encoded as additional dimensions in the mood embedding, trained through:
1. Art-commentary datasets (descriptions of art that discuss mood, meaning, metaphor)
2. Emotion classification datasets (images + emotion labels)
3. Generated aesthetic score datasets (e.g., LAION aesthetic scores)
---
## 9. Module 6: Text Understanding (TinyTextEnc, ~67M params)
### 9.1 Architecture Choice
We use a **distilled CLIP-ViT-B/32 text encoder** (~63M params) or **TinyBERT** (~67M params):
- Small enough for mobile (134MB in fp16)
- Good text understanding for short prompts (anime tags + natural language)
- Can be further distilled or quantized to 4-bit (~17MB) with minimal quality loss
### 9.2 Dual Prompt Format
Following Illustrious [arXiv:2409.19946]:
```
Format 1 (Tag-based):
"1girl, white hair, blue eyes, sword, standing, forest background, best quality"
Format 2 (Natural language):
"A girl with white hair and blue eyes standing in a forest, holding a sword"
Format 3 (Mixed):
"1girl, white hair, blue eyes | standing in a sunlit forest clearing, sword drawn"
```
The model handles both formats because training alternates between tag-based (Danbooru style) and natural language (BLIP2 captions).
### 9.3 Quasi-Register Tokens (from Illustrious)
For concepts the model can't express through text alone, we use register tokens β special learnable tokens appended to the sequence that capture residual information:
```
text_emb = TextEncoder([prompt_tokens, REG_1, REG_2, ..., REG_8])
```
The 8 register tokens are free to encode whatever the text prompt doesn't cover (implicit style cues, quality signals, etc.).
---
## 10. Mathematical Foundations
### 10.1 Flow Matching Objective
We use rectified flow with v-prediction following SD3/FLUX:
```
Forward process: x_t = (1-t) * x_0 + t * Ξ΅, Ξ΅ ~ N(0, I)
Velocity: v = dx_t/dt = Ξ΅ - x_0
Training loss: L = E_{t,x_0,Ξ΅} [ ||v_ΞΈ(x_t, t, c) - v||Β² ]
```
Timestep sampling: **Logit-normal distribution** shifted toward t=0.5 (from FLUX):
```
t ~ Ο(ΞΌ + Ο_ln * N(0,1)) where ΞΌ=0, Ο_ln=1
```
This concentrates training on the mid-noise range where learning is most effective.
### 10.2 Art-Aware Velocity Scaling (Novel)
Standard flow matching weighs all spatial locations equally. But for artistic images:
- **Lines and edges** (high-frequency) carry the most artistic identity
- **Color masses** (low-frequency) carry composition
- **Details** (mid-frequency) carry texture and style
We propose **Frequency-Weighted Flow Matching**:
```
L = E_{t,x_0,Ξ΅} [ Ξ£_b w_b * ||DWT_b(v_ΞΈ - v)||Β² ]
where b β {LL, LH, HL, HH} are wavelet subbands and:
w_LL = 1.0 (composition: standard weight)
w_LH = 2.0 (horizontal lines: extra weight for art quality)
w_HL = 2.0 (vertical lines: extra weight)
w_HH = 1.5 (diagonal details: moderate extra weight)
```
This forces the model to pay more attention to getting line work right β crucial for illustration/anime quality.
### 10.3 Recursive Latent Reasoning (RLR) Formulation
Within each denoising step, we perform R recursions:
```
Initialize: z_H^0 = x_t (current noisy latent)
z_L^0 = 0 (empty working memory)
For r = 1 to R:
z_L^r = f_L(z_L^{r-1} + embed(x_t) + z_H^{r-1}; ΞΈ) # Update working memory
z_H^r = f_H(z_L^r + z_H^{r-1}; ΞΈ) # Update solution
Final: v_predicted = output_head(z_H^R)
```
where f_L and f_H **share parameters** (same WaveMamba blocks, different inputs). This is the TRM principle applied to denoising.
**Key insight**: z_L acts as a "reasoning scratchpad" β it can encode things like "the sword should overlap the character's hand" or "the background trees should be darker than the foreground" without explicitly representing these as images. It's a latent chain-of-thought.
### 10.4 Deep Improvement Supervision for Training RLR
From [arXiv:2511.16886], we train each recursion step toward progressively less-corrupted targets:
```
For supervision step s β {1, ..., S}:
target_s = corrupt(ground_truth, noise_level = (S-s)/S)
# Step s sees a target with noise_level decreasing from ~1 to ~0
L_s = ||output_head(z_H^s) - target_s||Β²
```
This gives each recursion a concrete learning signal: "improve the current estimate by this much." Without this, only the final recursion gets gradient signal, and earlier recursions become dead compute.
### 10.5 Mamba SSM Mathematics
The core State Space Model dynamics:
```
Continuous: h'(t) = AΒ·h(t) + BΒ·x(t)
y(t) = CΒ·h(t)
Discrete (ZOH):
Δ = exp(ΞΒ·A)
BΜ = (ΞΒ·A)^{-1} (exp(ΞΒ·A) - I) Β· ΞΒ·B
h_t = ΔΒ·h_{t-1} + BΜΒ·x_t
y_t = CΒ·h_t
Selective Mamba (input-dependent):
B_t = Linear(x_t)
C_t = Linear(x_t)
Ξ_t = softplus(Linear(x_t))
```
**Complexity**: O(n) in sequence length (vs O(nΒ²) for attention). With n=1024 (our latent size), Mamba is ~1000Γ cheaper than self-attention.
**Memory**: Hidden state h β β^{NΓD} where N=state_dim (typically 16-64) and D=model_dim. This is constant regardless of sequence length β perfect for mobile.
### 10.6 Wavelet-Based Multi-Resolution Analysis
2D Discrete Wavelet Transform with Haar wavelets (simplest, no parameters):
```
LL = (x[::2,::2] + x[::2,1::2] + x[1::2,::2] + x[1::2,1::2]) / 2
LH = (x[::2,::2] + x[::2,1::2] - x[1::2,::2] - x[1::2,1::2]) / 2
HL = (x[::2,::2] - x[::2,1::2] + x[1::2,::2] - x[1::2,1::2]) / 2
HH = (x[::2,::2] - x[::2,1::2] - x[1::2,::2] + x[1::2,1::2]) / 2
```
This is O(n) and fully differentiable. Inverse is equally simple.
---
## 11. Training Pipeline
### Stage 0: Pretrain VAE (Skip β use existing)
We use **pretrained DC-AE f32** from MIT Han Lab. Frozen during all subsequent training.
Alternative: Use SD3 VAE (f8, 16 channels) if DC-AE f32 isn't available. This gives 128Γ128 latent but is well-tested.
### Stage 1: Base Generation Training (~100K steps)
**Goal**: Learn basic denoising (noise β latent image) without style/mood modules.
**Config**:
- Dataset: ~10M image-text pairs (filtered for illustration/anime quality)
- Resolution: 256px (8Γ8 latent with f32, or 32Γ32 with f8)
- Batch size: 256
- Learning rate: 1e-4 with cosine annealing
- Optimizer: AdamW (Ξ²1=0.9, Ξ²2=0.99, wd=0.01)
- Loss: MSE velocity prediction (standard flow matching)
- No RLR recursion yet (R=1)
- No style/mood modulation yet (set to zero)
- AMP training (fp16/bf16)
**Stability techniques**:
- QK RMSNorm in all attention layers (prevents softmax saturation)
- Zero-initialized output projections in AdaLN (Ξ± starts near 0)
- Gradient clipping at 1.0
- EMA with decay 0.9999
**Freezing**: Text encoder frozen. DC-AE frozen. Only WaveMamba backbone trains.
**Hardware**: Single A100 80GB or 4Γ A10G 24GB. ~3-5 days.
### Stage 2: Style Matrix Training (~50K steps)
**Goal**: Learn the ArtStyle Matrix to disentangle styles.
**Config**:
- Dataset: Same as Stage 1 + artist/style labels
- Resolution: 256px β 512px (progressive)
- Unfreeze: ArtStyle Matrix + style modulation networks
- Keep frozen: WaveMamba backbone (trained in Stage 1)
- Loss: Standard flow matching + style consistency loss
**Style Consistency Loss**:
```
L_style = -cos_sim(CLIP_style(generated), CLIP_style(reference_of_same_style))
```
After 25K steps, unfreeze backbone for joint fine-tuning at lower LR (1e-5).
### Stage 3: Resolution & Quality Scaling (~50K steps)
**Goal**: Scale to 1024px with high visual quality.
**Config**:
- Resolution: 512px β 768px β 1024px (progressive over training)
- Unfreeze: Everything except text encoder and DC-AE
- Enable RLR recursion (R=2)
- Enable Art-Aware Velocity Scaling loss
- Loss: Frequency-weighted flow matching
- Batch size: 64 (smaller due to resolution)
**Progressive resolution** prevents the model from needing to learn multi-resolution from scratch β it progressively extends its capability.
### Stage 4: Reasoning & Concept Training (~30K steps)
**Goal**: Train the Concept Reasoning Engine and Mood Controller.
**Config**:
- Unfreeze: CRE + Mood Controller
- Freeze: Everything else
- Loss: Standard + spatial layout guidance loss + mood classification loss
- Datasets: Caption-enriched illustrations with mood/concept annotations
After 15K steps, unfreeze all for joint fine-tuning (1e-6 LR).
### Stage 5: Quality Post-Training (SFT + RL, ~10K steps)
**Goal**: Align model with human aesthetic preferences.
**Config**:
- Curated high-quality dataset (~100K best illustrations)
- Loss: Flow matching + ImageReward score maximization
- Step distillation: Train 4-step consistency model from the multi-step base
Following DreamLite's post-training recipe: SFT on curated data β RL with ImageReward β Step distillation.
### Training Stability Summary
| Technique | Purpose | Stage |
|-----------|---------|-------|
| QK RMSNorm | Prevent attention collapse | All |
| Zero-init AdaLN gates | Stable initialization | All |
| Gradient clipping (1.0) | Prevent explosion | All |
| EMA (0.9999) | Smooth training | All |
| Cosine annealing LR | Controlled convergence | All |
| Progressive resolution | Avoid resolution shock | Stage 3 |
| Modular freeze/unfreeze | Stable staged training | All |
| Logit-normal timestep | Focus on informative t | All |
| Frequency-weighted loss | Art-quality emphasis | Stage 3+ |
| Deep Improvement Supervision | Train RLR recursions | Stage 3+ |
### Colab/Kaggle Feasibility
Stage 1 can be trained on **Kaggle P100** (16GB) or **Colab T4** (15GB):
- Batch size 4 with gradient accumulation 64 = effective batch 256
- Mixed precision (fp16)
- Gradient checkpointing
- 256px resolution
- ~3-5 hours per 10K steps on T4
Total training budget for a proof-of-concept (Stages 1-3 at reduced scale):
- Dataset: 1M images (subset)
- Resolution: up to 512px
- **~48-72 hours on Kaggle** (need to use multiple sessions)
---
## 12. Datasets & Data Strategy
### 12.1 Primary Datasets (Freely Available)
| Dataset | Size | Purpose | Stage |
|---------|------|---------|-------|
| Danbooru2023 | ~6M | Anime/illustration, tag-based | All |
| Pixiv Fanbox (filtered) | ~2M | High-quality illustration | Stage 3+ |
| ArtBench | 60K | Style classification | Stage 2 |
| WikiArt | 80K | Art style diversity | Stage 2 |
| LAION-Aesthetic V2 (β₯6.5) | ~600K | High aesthetic quality | Stage 1 |
| JourneyDB | ~4M | High-quality AI-assisted | Stage 1 |
| Sakuga-42M | ~42M clips | Anime understanding | Stage 4 |
| Emotion/Mood datasets | ~100K | Mood controller training | Stage 4 |
### 12.2 Illustration-Specific Data Preprocessing
Following Illustrious [arXiv:2409.19946]:
1. **Tag ordering**: person_count | character_names | rating | general_tags | artist | quality_score | year_modifier
2. **Quality scoring**: Percentile-based (worst β masterpiece scale)
3. **No dropout on critical tokens** (to prevent unwanted content generation)
4. **Quasi-register tokens** for unknown concepts
5. **Mixed tag + natural language** captions
6. **Resolution filtering**: Min 768Γ768, max aspect ratio 1:3
7. **Aesthetic scoring**: Filter with CLIP aesthetic predictor + hand-tuned thresholds
### 12.3 Art Style Dataset Construction
For the ArtStyle Matrix (Stage 2):
1. Cluster Danbooru by artist tags β ~5000 distinct artists
2. Select top 256 artists with most images (>500 each)
3. Each artist = one style vector in S
4. Additional synthetic styles from interpolation
### 12.4 Concept & Mood Annotation Pipeline
For CRE and Mood Controller (Stage 4):
1. Use existing VLM (e.g., InternVL2 or LLaVA) to generate:
- Object/character descriptions
- Spatial relationship descriptions
- Mood/emotion labels
- Scene type classifications
2. Filter and clean with rule-based heuristics
3. This creates a pseudo-labeled dataset for concept/mood training without manual annotation
---
## 13. Inference Pipeline
### 13.1 Standard Generation (4-8 steps)
```python
def generate(prompt, style_id=None, mood=None, steps=8, cfg_scale=4.0):
# 1. Encode text
text_emb = text_encoder(tokenize(prompt))
# 2. Get style modulation
if style_id is not None:
style_mod = art_style_matrix[style_id]
else:
style_mod = default_style # or zero
# 3. Get mood dynamics
if mood is not None:
mood_dyn = mood_controller(mood)
else:
mood_dyn = neutral_mood
# 4. Sample noise
z_t = torch.randn(1, 32, 32, 32) # DC-AE f32 latent
# 5. Flow matching denoising
dt = 1.0 / steps
for i in range(steps):
t = 1.0 - i * dt
# Classifier-free guidance
v_cond = model(z_t, t, text_emb, style_mod, mood_dyn)
v_uncond = model(z_t, t, null_text, style_mod, mood_dyn)
v = v_uncond + cfg_scale * (v_cond - v_uncond)
# Euler step
z_t = z_t - v * dt
# 6. Decode
image = dc_ae_decoder(z_t) # 1024Γ1024Γ3
return image
```
### 13.2 Memory During Inference
```
Text encoder: ~134 MB (fp16)
WaveMamba: ~500 MB (fp16)
ArtStyle Matrix: ~10 MB
Mood Controller: ~4 MB
DC-AE Decoder: ~80 MB (or ~3 MB tiny decoder)
Latent tensor: ~0.1 MB (32Γ32Γ32 Γ 2 bytes)
Activations: ~200 MB (peak, during forward pass)
βββββββββββββββββββββββββ
Total: ~928 MB (with tiny decoder: ~851 MB)
```
**Under 1GB for the model!** With activation memory, peak is ~1.1-1.5 GB.
With INT8 quantization of the backbone: ~600 MB total. **Well within 2-4 GB mobile budget.**
### 13.3 Inference Speed Estimate
On a modern mobile GPU (Adreno 730 / Apple A16):
- 32Γ32 latent β 1024 tokens
- Mamba: O(1024) per block
- ~50 WaveMamba blocks total
- 8 denoising steps with R=2 recursions = 16 backbone evaluations with CFG (Γ2) = 32 forward passes
Estimated: **1-3 seconds on flagship mobile** (comparable to MobileDiffusion/SnapGen)
### 13.4 Future: Image Editing
The architecture naturally supports editing because:
1. **Inpainting**: Mask regions in the latent β denoise only masked regions
2. **Style transfer**: Change style_mod mid-generation
3. **Mood editing**: Change mood_dyn to alter atmosphere
4. **Prompt editing**: Change text_emb at different denoising steps
5. **Super-resolution**: Use the decoder at higher resolution with a fine-tuned upsampler
Following DreamLite's approach, we can add editing support by:
- Concatenating source image latent with target latent (in-context conditioning)
- Fine-tuning with editing pairs
- No architecture change needed β just a training stage
---
## 14. Memory & Compute Analysis
### 14.1 FLOPs per Denoising Step
| Component | Spatial Size | FLOPs (per step) |
|-----------|-------------|-----------------|
| Stage 1 SepConv (Γ2) | 32Γ32 | ~0.5 GFLOPs |
| Stage 2 WaveMamba (Γ2) | 16Γ16 | ~1.0 GFLOPs |
| Stage 3 WaveMamba (Γ2) | 8Γ8 | ~0.5 GFLOPs |
| Bottleneck WaveMamba (Γ4) | 8Γ8 | ~1.0 GFLOPs |
| Cross-Attention (all stages) | various | ~0.3 GFLOPs |
| RLR Recursion overhead (R=2) | 8Γ8 | ~1.0 GFLOPs |
| **Total per step** | | **~4.3 GFLOPs** |
**Per image (8 steps, CFG)**: ~69 GFLOPs
Compare: SDXL ~600 GFLOPs per step, ~30,000 GFLOPs total. We're **~430Γ more efficient**.
### 14.2 Attention Complexity Comparison
| Method | Complexity | At 1024 tokens | At 16384 tokens (SD) |
|--------|-----------|----------------|---------------------|
| Self-Attention | O(nΒ²d) | 1Γ | 256Γ |
| Mamba SSM | O(nd) | 1Γ | 16Γ |
| Our WaveMamba | O(n/4 Γ d) Γ 4 | 1Γ | 16Γ |
WaveMamba processes 4 subbands each at n/4 length, total work = O(nd) same as Mamba but with frequency awareness.
### 14.3 Mobile Deployment Considerations
1. **Quantization-friendly**: SiLU activations (not GELU), no complex operations
2. **No self-attention**: Eliminates the most VRAM-hungry operation
3. **Constant memory Mamba**: SSM state is fixed-size regardless of image resolution
4. **Tiny latent space**: 32Γ32 vs 128Γ128 = 16Γ less memory for activations
5. **Separable convolutions**: Efficient on mobile NPUs
---
## 15. Comparison with Existing Models
| Feature | SDXL | FLUX | MobileDiffusion | SnapGen | **ArtFlow** |
|---------|------|------|-----------------|---------|-------------|
| Params (backbone) | 2.6B | 12B | 400M | 372M | **250M** |
| Total params | ~6B | ~24B | ~500M | ~500M | **379M** |
| Latent size (1024px) | 128Β² | 128Β² | 64Β² | 128Β² | **32Β²** |
| Attention type | Self+Cross | Full | SA bottleneck | MQA | **Mamba (O(n))** |
| Native reasoning | β | β | β | β | **β
(RLR)** |
| Style control | LoRA/fine-tune | LoRA | LoRA | - | **Native matrix** |
| Mood control | Prompt only | Prompt only | Prompt only | - | **Native module** |
| Art-focused | β | β | β | β | **β
by design** |
| Mobile ready | β | β | β
| β
| **β
** |
| Training: Colab feasible | β | β | β | β | **β
(staged)** |
| Editing support | Via separate model | Via fine-tune | β | β | **Native** |
| Peak RAM (1024px, fp16) | ~8GB | ~24GB | ~1.5GB | ~1.2GB | **~1.0GB** |
### Novel Contributions Summary
1. **WaveMamba**: First wavelet-decomposed Mamba denoising backbone in a UNet topology
2. **Recursive Latent Reasoning for images**: First application of TRM/HRM reasoning to image generation
3. **ArtStyle Matrix**: Explicit, manipulable style space for illustration generation
4. **Liquid-dynamics Mood Control**: Physics-inspired mood modulation using adaptive time constants
5. **Art-Aware Velocity Scaling**: Frequency-weighted flow matching loss for artistic quality
6. **Deep Improvement Supervision for denoising**: Training recursion steps with progressively cleaner targets
7. **KAN-based Composition**: Kolmogorov-Arnold Networks for learning smooth compositional rules
---
## Appendix A: Key Paper References
1. MobileDiffusion [arXiv:2311.16567] - Mobile architecture optimization
2. SnapGen [arXiv:2412.09619] - Efficient UNet + knowledge distillation
3. DreamLite [arXiv:2603.28713] - Unified on-device gen+edit
4. ZigMa [arXiv:2403.13802] - Mamba for diffusion with zigzag scan
5. DiMSUM [arXiv:2411.04168] - Wavelet + Mamba for diffusion
6. DC-AE [arXiv:2410.10733] - Deep compression autoencoder f32/f64
7. TRM/DIS [arXiv:2511.16886] - Recursive reasoning as policy improvement
8. Liquid Neural Networks [arXiv:2006.04439] - Adaptive ODE dynamics
9. RWKV-7 [arXiv:2503.14456] - Linear-complexity language model
10. KAN [arXiv:2404.19756] - Kolmogorov-Arnold Networks
11. Illustrious [arXiv:2409.19946] - Anime-focused training methodology
12. Rectified Flow++ [arXiv:2405.20320] - Improved flow matching training
13. Stable Velocity [arXiv:2602.05435] - Variance reduction in flow matching
14. USO [arXiv:2508.18966] - Disentangled style+subject generation
15. Vision Mamba [arXiv:2401.09417] - Bidirectional Mamba for vision
---
*ArtFlow Architecture v1.0 β Designed from research synthesis across 40+ papers spanning efficient architectures, state space models, latent reasoning, liquid neural networks, wavelet processing, and artistic style learning.*
|