# 🎨 ArtFlow v2: Reasoning-Native Artistic Image Generation for Mobile Devices **Version 2.0** — Real Mamba SSM backbone, real dataset support **Target:** 2-4GB RAM, 1024px native, anime/illustration focus ## ⚡ What's New in v2 ### 🐍 Real Mamba SSM (fixes `torch._utils` error) - **Pure PyTorch implementation** — no `mamba-ssm` or `causal-conv1d` CUDA packages needed - Implements the exact Mamba-1 selective scan algorithm (arXiv:2312.00752) - **Style-modulated dt_bias**: art style directly modulates SSM selectivity per channel - **AdaLN-Zero conditioning**: DiT-style zero-initialized conditioning on every Mamba block - Works on CPU, CUDA, and mobile — no CUDA extension compilation needed ### 🖼️ Real Dataset Support - **WikiArt** (80K paintings, 27 styles) — `huggan/wikiart` - **Teyvat** (anime illustrations with structured captions) — `Fazzie/Teyvat` - **Pokemon** (GPT-4 captioned illustrations) — `diffusers/pokemon-gpt4-captions` - **Danbooru2023** (6M+ anime images) — `KBlueLeaf/danbooru2023-webp-4Mpixel` - Auto-detects image/text/style columns from any HF dataset ### 🔧 Bug Fixes - Fixed `AttributeError: module 'torch' has no attribute '_utils'` — caused by mamba-ssm CUDA version mismatch - Fixed batch dimension broadcasting when style_ids/mood_ids are None - Proper handling of (1, d) vs (B, d) conditioning tensors in WaveMamba blocks ## Quick Start (Colab / Kaggle) ```python # Install (no CUDA extensions needed!) !pip install torch torchvision huggingface_hub datasets # Download from huggingface_hub import hf_hub_download import shutil for f in ['artflow_model.py', 'artflow_train.py']: shutil.copy(hf_hub_download('krystv/ArtFlow', f), f'./{f}') # Train with real data from artflow_model import ArtFlow, ArtFlowConfig from artflow_train import TrainConfig, RealArtDataset, freeze_for_stage, train import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') config = ArtFlowConfig() model = ArtFlow(config).to(device) model = freeze_for_stage(model, 1) # Use real WikiArt dataset! dataset = RealArtDataset("huggan/wikiart", config=config, max_samples=5000) tcfg = TrainConfig(lr=1e-4, batch_size=2, grad_accum=32, num_steps=10000, warmup_steps=500, stage=1) engine = train(model, config, tcfg, dataset, device) ``` ## Validated Results ``` 📊 104.5M params (backbone only) 💾 209 MB fp16 / 104.5 MB int8 📱 ~235 MB peak inference — fits mobile ✅ Forward/backward: no NaN, no Inf ✅ 30-step training: stable loss, no oscillation ✅ Real Mamba SSM selective scan — pure PyTorch 🐍 No mamba-ssm package needed! ``` ## Architecture: 8 Novel Contributions 1. **WaveMamba** — Wavelet × Real Mamba SSM denoising (O(n) complexity) 2. **Style-Modulated SSM** — Art style directly controls Mamba's dt_bias (selectivity) 3. **Recursive Latent Reasoning** — TRM-style "thinking" inside denoising steps 4. **ArtStyle Matrix** — Continuous style vectors, interpolatable 5. **Liquid-Dynamics Mood** — Physics-inspired atmosphere control 6. **Art-Aware Velocity Loss** — Frequency-weighted flow matching 7. **Deep Improvement Supervision** — Progressive recursion targets 8. **KAN Composition** — Smooth compositional rules via B-splines ## Real Datasets for Training | Dataset | Size | Purpose | Stage | |---------|------|---------|-------| | [huggan/wikiart](https://hf.co/datasets/huggan/wikiart) | 80K | Art style diversity | 1-2 | | [Fazzie/Teyvat](https://hf.co/datasets/Fazzie/Teyvat) | 446MB | Anime + structured concepts | 1-4 | | [diffusers/pokemon-gpt4-captions](https://hf.co/datasets/diffusers/pokemon-gpt4-captions) | 49MB | Anime + NL captions | 1 | | [KBlueLeaf/danbooru2023-webp-4Mpixel](https://hf.co/datasets/KBlueLeaf/danbooru2023-webp-4Mpixel) | 1.5TB | Full anime training | All | | [Artificio/WikiArt](https://hf.co/datasets/Artificio/WikiArt) | 1.6GB | 27 styles + NL descriptions | 2 | ## 5-Stage Pipeline ``` Stage 1: Backbone learns denoising (50K steps, lr=1e-4) ← freeze style/mood/concept Stage 2: Style matrix disentanglement (25K steps, lr=5e-5) ← freeze mood/concept Stage 3: Resolution scaling + reasoning (25K steps, lr=3e-5) ← freeze mood/concept Stage 4: Concept & mood understanding (15K steps, lr=2e-5) ← freeze backbone Stage 5: Quality alignment (5K steps, lr=1e-5) ← all trainable ``` ## Research Papers - Mamba-1 selective scan: arXiv:2312.00752 - Mamba-2 SSD: arXiv:2405.21060 - ZigMa zigzag scan: arXiv:2403.13802 - DiMSUM wavelet+Mamba: arXiv:2411.04168 - DiT AdaLN-Zero: arXiv:2212.09748 - TRM recursive reasoning: arXiv:2511.16886 - SnapGen MQA: arXiv:2412.09619 - DC-AE latent compression: arXiv:2410.10733 - Min-SNR-γ: arXiv:2303.09556 - Pseudo-Huber loss: arXiv:2403.16728 - Illustrious training: arXiv:2409.19946 ## License MIT