File size: 4,887 Bytes
4c58a98 a01ca3e 4c58a98 238e9f5 a01ca3e 4c58a98 238e9f5 4c58a98 a01ca3e 238e9f5 a01ca3e 4c58a98 238e9f5 4c58a98 a01ca3e 4c58a98 238e9f5 aae615d 4c58a98 aae615d 4c58a98 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | # π¨ ArtFlow v2: Reasoning-Native Artistic Image Generation for Mobile Devices
**Version 2.0** β Real Mamba SSM backbone, real dataset support
**Target:** 2-4GB RAM, 1024px native, anime/illustration focus
## β‘ What's New in v2
### π Real Mamba SSM (fixes `torch._utils` error)
- **Pure PyTorch implementation** β no `mamba-ssm` or `causal-conv1d` CUDA packages needed
- Implements the exact Mamba-1 selective scan algorithm (arXiv:2312.00752)
- **Style-modulated dt_bias**: art style directly modulates SSM selectivity per channel
- **AdaLN-Zero conditioning**: DiT-style zero-initialized conditioning on every Mamba block
- Works on CPU, CUDA, and mobile β no CUDA extension compilation needed
### πΌοΈ Real Dataset Support
- **WikiArt** (80K paintings, 27 styles) β `huggan/wikiart`
- **Teyvat** (anime illustrations with structured captions) β `Fazzie/Teyvat`
- **Pokemon** (GPT-4 captioned illustrations) β `diffusers/pokemon-gpt4-captions`
- **Danbooru2023** (6M+ anime images) β `KBlueLeaf/danbooru2023-webp-4Mpixel`
- Auto-detects image/text/style columns from any HF dataset
### π§ Bug Fixes
- Fixed `AttributeError: module 'torch' has no attribute '_utils'` β caused by mamba-ssm CUDA version mismatch
- Fixed batch dimension broadcasting when style_ids/mood_ids are None
- Proper handling of (1, d) vs (B, d) conditioning tensors in WaveMamba blocks
## Quick Start (Colab / Kaggle)
```python
# Install (no CUDA extensions needed!)
!pip install torch torchvision huggingface_hub datasets
# Download
from huggingface_hub import hf_hub_download
import shutil
for f in ['artflow_model.py', 'artflow_train.py']:
shutil.copy(hf_hub_download('krystv/ArtFlow', f), f'./{f}')
# Train with real data
from artflow_model import ArtFlow, ArtFlowConfig
from artflow_train import TrainConfig, RealArtDataset, freeze_for_stage, train
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
config = ArtFlowConfig()
model = ArtFlow(config).to(device)
model = freeze_for_stage(model, 1)
# Use real WikiArt dataset!
dataset = RealArtDataset("huggan/wikiart", config=config, max_samples=5000)
tcfg = TrainConfig(lr=1e-4, batch_size=2, grad_accum=32, num_steps=10000,
warmup_steps=500, stage=1)
engine = train(model, config, tcfg, dataset, device)
```
## Validated Results
```
π 104.5M params (backbone only)
πΎ 209 MB fp16 / 104.5 MB int8
π± ~235 MB peak inference β fits mobile
β
Forward/backward: no NaN, no Inf
β
30-step training: stable loss, no oscillation
β
Real Mamba SSM selective scan β pure PyTorch
π No mamba-ssm package needed!
```
## Architecture: 8 Novel Contributions
1. **WaveMamba** β Wavelet Γ Real Mamba SSM denoising (O(n) complexity)
2. **Style-Modulated SSM** β Art style directly controls Mamba's dt_bias (selectivity)
3. **Recursive Latent Reasoning** β TRM-style "thinking" inside denoising steps
4. **ArtStyle Matrix** β Continuous style vectors, interpolatable
5. **Liquid-Dynamics Mood** β Physics-inspired atmosphere control
6. **Art-Aware Velocity Loss** β Frequency-weighted flow matching
7. **Deep Improvement Supervision** β Progressive recursion targets
8. **KAN Composition** β Smooth compositional rules via B-splines
## Real Datasets for Training
| Dataset | Size | Purpose | Stage |
|---------|------|---------|-------|
| [huggan/wikiart](https://hf.co/datasets/huggan/wikiart) | 80K | Art style diversity | 1-2 |
| [Fazzie/Teyvat](https://hf.co/datasets/Fazzie/Teyvat) | 446MB | Anime + structured concepts | 1-4 |
| [diffusers/pokemon-gpt4-captions](https://hf.co/datasets/diffusers/pokemon-gpt4-captions) | 49MB | Anime + NL captions | 1 |
| [KBlueLeaf/danbooru2023-webp-4Mpixel](https://hf.co/datasets/KBlueLeaf/danbooru2023-webp-4Mpixel) | 1.5TB | Full anime training | All |
| [Artificio/WikiArt](https://hf.co/datasets/Artificio/WikiArt) | 1.6GB | 27 styles + NL descriptions | 2 |
## 5-Stage Pipeline
```
Stage 1: Backbone learns denoising (50K steps, lr=1e-4) β freeze style/mood/concept
Stage 2: Style matrix disentanglement (25K steps, lr=5e-5) β freeze mood/concept
Stage 3: Resolution scaling + reasoning (25K steps, lr=3e-5) β freeze mood/concept
Stage 4: Concept & mood understanding (15K steps, lr=2e-5) β freeze backbone
Stage 5: Quality alignment (5K steps, lr=1e-5) β all trainable
```
## Research Papers
- Mamba-1 selective scan: arXiv:2312.00752
- Mamba-2 SSD: arXiv:2405.21060
- ZigMa zigzag scan: arXiv:2403.13802
- DiMSUM wavelet+Mamba: arXiv:2411.04168
- DiT AdaLN-Zero: arXiv:2212.09748
- TRM recursive reasoning: arXiv:2511.16886
- SnapGen MQA: arXiv:2412.09619
- DC-AE latent compression: arXiv:2410.10733
- Min-SNR-Ξ³: arXiv:2303.09556
- Pseudo-Huber loss: arXiv:2403.16728
- Illustrious training: arXiv:2409.19946
## License
MIT
|