File size: 6,456 Bytes

9385d07

# ArtiGen V1.0 — Adaptive Reasoning Token-Informed Generative Engine

## What is ArtiGen?

A **novel, lightweight, mobile-friendly** text-to-image generation architecture designed specifically for **anime/illustration** art. It runs under **3GB RAM** on consumer devices and trains on **Colab Free Tier**.

## Why a New Architecture?

- Existing models (SDXL, FLUX) are too heavy for mobile.
- Quantization destroys aesthetic quality.
- Old models (SD 1.5) lack prompt adherence and visual quality.
- Attention-based transformers have O(N²) memory that explodes on high-res latent grids.

## Core Innovations

1. **CARTEL Backbone**: Hybrid SSM (Mamba-style) + RWKV + Liquid Time-Constant gates. O(N) complexity, no heavy attention.
2. **PHI-SCAN**: Physics-informed multi-directional scanning (Hilbert, zigzag-diagonal, row/column-major) preserving 2D spatial continuity. Zero extra parameters.
3. **ASDL (Art-Style Disentangled Latent Space)**: Modular heads that natively learn style, content, concept, mood, and composition as separate vectors in latent space. Users can tweak vectors to invent new art styles.
4. **Flow Matching + Spectral Smoothness**: Replaces unstable diffusion training with rectified flow matching. Spectral Laplacian penalty reduces artifacts at 1024px native resolution.
5. **Progressive Modular Curriculum**: 5-stage freeze/thaw training that forces each module to specialize before end-to-end tuning. Prevents loss explosion.

## Architecture

```
Text Prompt ──► Text Encoder ──► φ_text
                                        │
Timestep t ────► t_embed ──────►        │
                                        ▼
Latent z_t ────► Patchify ─────► PHI-SCAN ──► [CARTEL Block × N] ──► v_t(z_t)
                    ▲                              │
                    └────── Long Skip ─────────────┘
                           │
                    ASDL Heads (style, content, concept, mood, composition)
```

## Memory Footprint

| Component       | Parameters | FP16 VRAM |
|-----------------|------------|-----------|
| CARTEL Backbone | ~80M       | ~160 MB   |
| ASDL Heads      | ~20M       | ~40 MB    |
| Pretrained VAE  | ~50M       | ~100 MB   |
| **Total**       | **~150M**  | **~300 MB** |

With KV cache, activations, and overhead: **< 1.5 GB** at inference. Training on Colab Free Tier: **batch_size=2, embed_dim=256, 16 layers** fits in 15GB T4 VRAM.

## Training Stages

| Stage | Module Trained      | Losses                           | Purpose                         |
|-------|--------------------|----------------------------------|---------------------------------|
| 1     | Style Head         | L_flow + L_style                 | Learn artistic styles           |
| 2     | Content Head       | L_flow + L_content               | Learn semantic objects/scenes   |
| 3     | Concept Head       | L_flow + L_concept               | Learn abstract relationships    |
| 4     | Mood + Composition | L_flow + L_mood                  | Learn emotion & layout          |
| 5     | All (unfrozen)     | L_flow + all aux + L_spectral    | End-to-end fine-tuning          |

## Key Design Decisions

- **SSM+RWKV over Transformers**: Linear O(N) vs quadratic O(N²). For 1024px → 32×32 latent = 1024 tokens. Attention needs ~1M ops per layer; SSM needs ~1K.
- **Flow Matching over DDPM**: Stable training, fewer sampling steps (1–4), no exploding losses at t→0.
- **Wavelet spectral smoothness**: Penalizes unnatural high-frequency noise, native 1024px quality without upsampling hacks.
- **Modular curriculum**: Prevents catastrophic forgetting, forces each ASDL head to learn a clean, separable subspace.
- **LTC Gate**: Liquid Time-Constant residual dynamically adapts between fast (textures) and slow (structures) pathways.

## Datasets (Suggested)

| Stage | Dataset | Source |
|-------|---------|--------|
| 1     | Anime illustrations with style tags | Danbooru / Safebooru filtered |
| 2-3   | Detailed caption dataset | `none-yet/anime-captions`, `latentcat/animesfw` |
| 4     | Mood-labeled artwork | Self-annotated via CLIP clustering |
| 5     | Full quality mix | Curated high-quality anime illustration set |

## Usage

### 1. Generate Image (with pretrained VAE)

```python
from artigen.model import ArtiGen
from artigen.sampling import sample
from diffusers import AutoencoderKL
import torch

# Load lightweight VAE (e.g., madebyollin/taesd)
vae = AutoencoderKL.from_pretrained("madebyollin/taesd").to("cuda")

# Build model
model = ArtiGen(
    embed_dim=256, num_layers=16,
    latent_h=32, latent_w=32,
).to("cuda")
model.load_state_dict(torch.load("artigen_stage5.pt")["ema"])

# Text embed (e.g., CLIP)
text_embed = torch.randn(1, 768).to("cuda")

# Sample latent
z0 = sample(model, text_embed, latent_shape=(4, 32, 32), num_steps=4, cfg_scale=2.0)

# Decode
img = vae.decode(z0).sample
```

### 2. Invent a New Art Style

```python
# Extract ASDL vectors
with torch.no_grad():
    _, asdl = model(z_t, t, text_embed, return_asdl=True)
    style_vec = asdl["style_vec"]  # (1, 64)

# Interpolate between two styles
new_style = 0.7 * style_a + 0.3 * style_b
# Inject during generation by conditioning text_embed with style vector
```

### 3. Train (Colab Free Tier)

```bash
# In a Colab notebook cell
!git clone https://github.com/<repo>/artigen.git
%cd artigen
!python -m artigen.train \
    --epochs 5 --bs 2 --dim 256 --layers 16 \
    --latent_h 32 --latent_w 32 --device cuda
```

## Citation & References

Architecture inspired by:

1. **DiM** (2405.14224): SSM-based diffusion with multi-directional scan
2. **Zigzag Mamba** (2403.13802): Spatial continuity via zigzag scanning
3. **Diffusion-RWKV** (2404.04478): RWKV for diffusion generation
4. **MobileMamba** (2411.15941): Three-stage wavelet-enhanced SSM backbone
5. **MILR** (2509.22761): Test-time latent reasoning in unified space
6. **Unified Thinker** (2601.03127): Reasoning-decoupled generation core
7. **LatentMorph** (2602.02227): Implicit latent reasoning without decode loops
8. **LFM** (2307.08698): Flow matching in pretrained VAE latent space
9. **Liquid Time-Constant Networks** (2006.04439): Adaptive continuous-time gates
10. **Disentanglement via Latent Quantization** (2305.18378): Modular latent decomposition

## License

MIT License — free to use, modify, and deploy.