File size: 6,456 Bytes
9385d07 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | # ArtiGen V1.0 β Adaptive Reasoning Token-Informed Generative Engine
## What is ArtiGen?
A **novel, lightweight, mobile-friendly** text-to-image generation architecture designed specifically for **anime/illustration** art. It runs under **3GB RAM** on consumer devices and trains on **Colab Free Tier**.
## Why a New Architecture?
- Existing models (SDXL, FLUX) are too heavy for mobile.
- Quantization destroys aesthetic quality.
- Old models (SD 1.5) lack prompt adherence and visual quality.
- Attention-based transformers have O(NΒ²) memory that explodes on high-res latent grids.
## Core Innovations
1. **CARTEL Backbone**: Hybrid SSM (Mamba-style) + RWKV + Liquid Time-Constant gates. O(N) complexity, no heavy attention.
2. **PHI-SCAN**: Physics-informed multi-directional scanning (Hilbert, zigzag-diagonal, row/column-major) preserving 2D spatial continuity. Zero extra parameters.
3. **ASDL (Art-Style Disentangled Latent Space)**: Modular heads that natively learn style, content, concept, mood, and composition as separate vectors in latent space. Users can tweak vectors to invent new art styles.
4. **Flow Matching + Spectral Smoothness**: Replaces unstable diffusion training with rectified flow matching. Spectral Laplacian penalty reduces artifacts at 1024px native resolution.
5. **Progressive Modular Curriculum**: 5-stage freeze/thaw training that forces each module to specialize before end-to-end tuning. Prevents loss explosion.
## Architecture
```
Text Prompt βββΊ Text Encoder βββΊ Ο_text
β
Timestep t βββββΊ t_embed βββββββΊ β
βΌ
Latent z_t βββββΊ Patchify ββββββΊ PHI-SCAN βββΊ [CARTEL Block Γ N] βββΊ v_t(z_t)
β² β
βββββββ Long Skip ββββββββββββββ
β
ASDL Heads (style, content, concept, mood, composition)
```
## Memory Footprint
| Component | Parameters | FP16 VRAM |
|-----------------|------------|-----------|
| CARTEL Backbone | ~80M | ~160 MB |
| ASDL Heads | ~20M | ~40 MB |
| Pretrained VAE | ~50M | ~100 MB |
| **Total** | **~150M** | **~300 MB** |
With KV cache, activations, and overhead: **< 1.5 GB** at inference. Training on Colab Free Tier: **batch_size=2, embed_dim=256, 16 layers** fits in 15GB T4 VRAM.
## Training Stages
| Stage | Module Trained | Losses | Purpose |
|-------|--------------------|----------------------------------|---------------------------------|
| 1 | Style Head | L_flow + L_style | Learn artistic styles |
| 2 | Content Head | L_flow + L_content | Learn semantic objects/scenes |
| 3 | Concept Head | L_flow + L_concept | Learn abstract relationships |
| 4 | Mood + Composition | L_flow + L_mood | Learn emotion & layout |
| 5 | All (unfrozen) | L_flow + all aux + L_spectral | End-to-end fine-tuning |
## Key Design Decisions
- **SSM+RWKV over Transformers**: Linear O(N) vs quadratic O(NΒ²). For 1024px β 32Γ32 latent = 1024 tokens. Attention needs ~1M ops per layer; SSM needs ~1K.
- **Flow Matching over DDPM**: Stable training, fewer sampling steps (1β4), no exploding losses at tβ0.
- **Wavelet spectral smoothness**: Penalizes unnatural high-frequency noise, native 1024px quality without upsampling hacks.
- **Modular curriculum**: Prevents catastrophic forgetting, forces each ASDL head to learn a clean, separable subspace.
- **LTC Gate**: Liquid Time-Constant residual dynamically adapts between fast (textures) and slow (structures) pathways.
## Datasets (Suggested)
| Stage | Dataset | Source |
|-------|---------|--------|
| 1 | Anime illustrations with style tags | Danbooru / Safebooru filtered |
| 2-3 | Detailed caption dataset | `none-yet/anime-captions`, `latentcat/animesfw` |
| 4 | Mood-labeled artwork | Self-annotated via CLIP clustering |
| 5 | Full quality mix | Curated high-quality anime illustration set |
## Usage
### 1. Generate Image (with pretrained VAE)
```python
from artigen.model import ArtiGen
from artigen.sampling import sample
from diffusers import AutoencoderKL
import torch
# Load lightweight VAE (e.g., madebyollin/taesd)
vae = AutoencoderKL.from_pretrained("madebyollin/taesd").to("cuda")
# Build model
model = ArtiGen(
embed_dim=256, num_layers=16,
latent_h=32, latent_w=32,
).to("cuda")
model.load_state_dict(torch.load("artigen_stage5.pt")["ema"])
# Text embed (e.g., CLIP)
text_embed = torch.randn(1, 768).to("cuda")
# Sample latent
z0 = sample(model, text_embed, latent_shape=(4, 32, 32), num_steps=4, cfg_scale=2.0)
# Decode
img = vae.decode(z0).sample
```
### 2. Invent a New Art Style
```python
# Extract ASDL vectors
with torch.no_grad():
_, asdl = model(z_t, t, text_embed, return_asdl=True)
style_vec = asdl["style_vec"] # (1, 64)
# Interpolate between two styles
new_style = 0.7 * style_a + 0.3 * style_b
# Inject during generation by conditioning text_embed with style vector
```
### 3. Train (Colab Free Tier)
```bash
# In a Colab notebook cell
!git clone https://github.com/<repo>/artigen.git
%cd artigen
!python -m artigen.train \
--epochs 5 --bs 2 --dim 256 --layers 16 \
--latent_h 32 --latent_w 32 --device cuda
```
## Citation & References
Architecture inspired by:
1. **DiM** (2405.14224): SSM-based diffusion with multi-directional scan
2. **Zigzag Mamba** (2403.13802): Spatial continuity via zigzag scanning
3. **Diffusion-RWKV** (2404.04478): RWKV for diffusion generation
4. **MobileMamba** (2411.15941): Three-stage wavelet-enhanced SSM backbone
5. **MILR** (2509.22761): Test-time latent reasoning in unified space
6. **Unified Thinker** (2601.03127): Reasoning-decoupled generation core
7. **LatentMorph** (2602.02227): Implicit latent reasoning without decode loops
8. **LFM** (2307.08698): Flow matching in pretrained VAE latent space
9. **Liquid Time-Constant Networks** (2006.04439): Adaptive continuous-time gates
10. **Disentanglement via Latent Quantization** (2305.18378): Modular latent decomposition
## License
MIT License β free to use, modify, and deploy.
|