| # π¨ ArtFlow v2: Reasoning-Native Artistic Image Generation for Mobile Devices |
|
|
| **Version 2.0** β Real Mamba SSM backbone, real dataset support |
| **Target:** 2-4GB RAM, 1024px native, anime/illustration focus |
|
|
| ## β‘ What's New in v2 |
|
|
| ### π Real Mamba SSM (fixes `torch._utils` error) |
| - **Pure PyTorch implementation** β no `mamba-ssm` or `causal-conv1d` CUDA packages needed |
| - Implements the exact Mamba-1 selective scan algorithm (arXiv:2312.00752) |
| - **Style-modulated dt_bias**: art style directly modulates SSM selectivity per channel |
| - **AdaLN-Zero conditioning**: DiT-style zero-initialized conditioning on every Mamba block |
| - Works on CPU, CUDA, and mobile β no CUDA extension compilation needed |
| |
| ### πΌοΈ Real Dataset Support |
| - **WikiArt** (80K paintings, 27 styles) β `huggan/wikiart` |
| - **Teyvat** (anime illustrations with structured captions) β `Fazzie/Teyvat` |
| - **Pokemon** (GPT-4 captioned illustrations) β `diffusers/pokemon-gpt4-captions` |
| - **Danbooru2023** (6M+ anime images) β `KBlueLeaf/danbooru2023-webp-4Mpixel` |
| - Auto-detects image/text/style columns from any HF dataset |
| |
| ### π§ Bug Fixes |
| - Fixed `AttributeError: module 'torch' has no attribute '_utils'` β caused by mamba-ssm CUDA version mismatch |
| - Fixed batch dimension broadcasting when style_ids/mood_ids are None |
| - Proper handling of (1, d) vs (B, d) conditioning tensors in WaveMamba blocks |
|
|
| ## Quick Start (Colab / Kaggle) |
|
|
| ```python |
| # Install (no CUDA extensions needed!) |
| !pip install torch torchvision huggingface_hub datasets |
| |
| # Download |
| from huggingface_hub import hf_hub_download |
| import shutil |
| for f in ['artflow_model.py', 'artflow_train.py']: |
| shutil.copy(hf_hub_download('krystv/ArtFlow', f), f'./{f}') |
| |
| # Train with real data |
| from artflow_model import ArtFlow, ArtFlowConfig |
| from artflow_train import TrainConfig, RealArtDataset, freeze_for_stage, train |
| import torch |
| |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| config = ArtFlowConfig() |
| model = ArtFlow(config).to(device) |
| model = freeze_for_stage(model, 1) |
| |
| # Use real WikiArt dataset! |
| dataset = RealArtDataset("huggan/wikiart", config=config, max_samples=5000) |
| |
| tcfg = TrainConfig(lr=1e-4, batch_size=2, grad_accum=32, num_steps=10000, |
| warmup_steps=500, stage=1) |
| engine = train(model, config, tcfg, dataset, device) |
| ``` |
|
|
| ## Validated Results |
| ``` |
| π 104.5M params (backbone only) |
| πΎ 209 MB fp16 / 104.5 MB int8 |
| π± ~235 MB peak inference β fits mobile |
| β
Forward/backward: no NaN, no Inf |
| β
30-step training: stable loss, no oscillation |
| β
Real Mamba SSM selective scan β pure PyTorch |
| π No mamba-ssm package needed! |
| ``` |
|
|
| ## Architecture: 8 Novel Contributions |
|
|
| 1. **WaveMamba** β Wavelet Γ Real Mamba SSM denoising (O(n) complexity) |
| 2. **Style-Modulated SSM** β Art style directly controls Mamba's dt_bias (selectivity) |
| 3. **Recursive Latent Reasoning** β TRM-style "thinking" inside denoising steps |
| 4. **ArtStyle Matrix** β Continuous style vectors, interpolatable |
| 5. **Liquid-Dynamics Mood** β Physics-inspired atmosphere control |
| 6. **Art-Aware Velocity Loss** β Frequency-weighted flow matching |
| 7. **Deep Improvement Supervision** β Progressive recursion targets |
| 8. **KAN Composition** β Smooth compositional rules via B-splines |
| |
| ## Real Datasets for Training |
| |
| | Dataset | Size | Purpose | Stage | |
| |---------|------|---------|-------| |
| | [huggan/wikiart](https://hf.co/datasets/huggan/wikiart) | 80K | Art style diversity | 1-2 | |
| | [Fazzie/Teyvat](https://hf.co/datasets/Fazzie/Teyvat) | 446MB | Anime + structured concepts | 1-4 | |
| | [diffusers/pokemon-gpt4-captions](https://hf.co/datasets/diffusers/pokemon-gpt4-captions) | 49MB | Anime + NL captions | 1 | |
| | [KBlueLeaf/danbooru2023-webp-4Mpixel](https://hf.co/datasets/KBlueLeaf/danbooru2023-webp-4Mpixel) | 1.5TB | Full anime training | All | |
| | [Artificio/WikiArt](https://hf.co/datasets/Artificio/WikiArt) | 1.6GB | 27 styles + NL descriptions | 2 | |
| |
| ## 5-Stage Pipeline |
| ``` |
| Stage 1: Backbone learns denoising (50K steps, lr=1e-4) β freeze style/mood/concept |
| Stage 2: Style matrix disentanglement (25K steps, lr=5e-5) β freeze mood/concept |
| Stage 3: Resolution scaling + reasoning (25K steps, lr=3e-5) β freeze mood/concept |
| Stage 4: Concept & mood understanding (15K steps, lr=2e-5) β freeze backbone |
| Stage 5: Quality alignment (5K steps, lr=1e-5) β all trainable |
| ``` |
| |
| ## Research Papers |
| - Mamba-1 selective scan: arXiv:2312.00752 |
| - Mamba-2 SSD: arXiv:2405.21060 |
| - ZigMa zigzag scan: arXiv:2403.13802 |
| - DiMSUM wavelet+Mamba: arXiv:2411.04168 |
| - DiT AdaLN-Zero: arXiv:2212.09748 |
| - TRM recursive reasoning: arXiv:2511.16886 |
| - SnapGen MQA: arXiv:2412.09619 |
| - DC-AE latent compression: arXiv:2410.10733 |
| - Min-SNR-Ξ³: arXiv:2303.09556 |
| - Pseudo-Huber loss: arXiv:2403.16728 |
| - Illustrious training: arXiv:2409.19946 |
| |
| ## License |
| MIT |
| |