Upload architecture.md
Browse files- architecture.md +132 -0
architecture.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ARTIGEN V1.0 β Novel Architecture Specification
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
**ARTIGEN** (Adaptive Reasoning-based Token-Informed Generative ENgine) is a lightweight, modular image generation architecture designed for anime/illustration generation on mobile devices (<3GB RAM). It combines insights from DiM (Diffusion Mamba), Zigzag Mamba, RWKV diffusion, MobileMamba, unified latent reasoning (MILR, Unified Thinker), flow matching in latent space (LFM), disentangled modular representation learning, and liquid time-constant dynamics.
|
| 5 |
+
|
| 6 |
+
## Core Innovations
|
| 7 |
+
1. **CARTEL** (CAscading Reasoning Token Ensemble Layer) β Modular latent reasoning backbone using hybrid SSM+RWKV blocks
|
| 8 |
+
2. **PHI-SCAN** (Physics-Informed multi-directional scan) β No parameter scanning scheme preserving 2D geometric continuity
|
| 9 |
+
3. **Art-Style Disentangled Latent Space (ASDL)** β Modular decomposition of style/content/concept/mood in latent vectors
|
| 10 |
+
4. **Flow-Matching in Wavelet Space** β Training using rectified flow matching with Haar wavelet decomposition
|
| 11 |
+
5. **Progressive Modular Curriculum** β Freeze/thaw training stages forcing specialization
|
| 12 |
+
|
| 13 |
+
## Mathematical Formulation
|
| 14 |
+
|
| 15 |
+
### 1. Pretrained VAE (Frozen)
|
| 16 |
+
Use `madebyollin/taesd` (1M params) or `stabilityai/sd-vae-ft-mse` for latent encoding. Image x β z β R^(hΓwΓc) where h=w=32, c=4 for 1024px images.
|
| 17 |
+
|
| 18 |
+
### 2. CARTEL Backbone
|
| 19 |
+
Each CARTEL block consists of:
|
| 20 |
+
- **SSM branch**: Bidirectional selective state space model (Mamba-style)
|
| 21 |
+
- **RWKV branch**: Linear attention with spatial shift
|
| 22 |
+
- **LTC residual**: Liquid Time-Constant gate for adaptive information routing
|
| 23 |
+
|
| 24 |
+
### 3. PHI-SCAN Pattern
|
| 25 |
+
Physics-informed zigzag scan minimizing geodesic distance distortion when flattening 2Dβ1D:
|
| 26 |
+
- Uses Hilbert curve ordering (preserves locality)
|
| 27 |
+
- Alternates between row-major, column-major, and diagonal scans per layer
|
| 28 |
+
- Zero additional parameters (permutation matrices are fixed)
|
| 29 |
+
|
| 30 |
+
### 4. ASDL Latent Space
|
| 31 |
+
The model learns to decompose latent representations into:
|
| 32 |
+
- **Style vector**: s β R^d_s β artistic style (colored pencil, watercolor, anime, etc.)
|
| 33 |
+
- **Content vector**: c β R^d_c β semantic content (characters, objects, scenes)
|
| 34 |
+
- **Concept vector**: n β R^d_n β abstract concept relationships
|
| 35 |
+
- **Mood vector**: m β R^d_m β emotional tone, atmosphere, lighting feel
|
| 36 |
+
- **Composition vector**: p β R^d_p β spatial layout, perspective, framing
|
| 37 |
+
|
| 38 |
+
### 5. Training: Flow Matching with Auxiliary Reasoning Losses
|
| 39 |
+
Instead of diffusion, use flow matching (Rectified Flow) in latent space, more stable and faster converging for small models.
|
| 40 |
+
|
| 41 |
+
Loss function:
|
| 42 |
+
L = Ξ»_flow * L_flow + Ξ»_concept * L_concept + Ξ»_style * L_style + Ξ»_mood * L_mood + Ξ»_consistency * L_VAE_recon + Ξ»_physics * L_spectral_smoothness
|
| 43 |
+
|
| 44 |
+
where L_flow is the standard flow-matching loss predicting velocity field v_t(z_t).
|
| 45 |
+
|
| 46 |
+
## Architecture Diagram (Simplified)
|
| 47 |
+
|
| 48 |
+
```
|
| 49 |
+
Text Prompt β CLIP/Text Encoder β Semantic Embedding Ο_t
|
| 50 |
+
Timestep t β Time Embedding Ο_t
|
| 51 |
+
Image Latent z_t β PHI-SCAN β [CARTEL Block Γ N] β Pred z_0 / v_t(z_t)
|
| 52 |
+
Long skip connections (U-Net style)
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Training Stages (Progressive Modular)
|
| 56 |
+
|
| 57 |
+
### Stage 1: Foundation + Style Learning
|
| 58 |
+
Dataset: High-quality anime illustrations with strong style tags
|
| 59 |
+
Objective: Learn ASDL style vectors. Style head predicts s from z.
|
| 60 |
+
Loss: L_style + L_flow + L_VAE_recon
|
| 61 |
+
Duration: ~50K steps
|
| 62 |
+
|
| 63 |
+
### Stage 2: Content Understanding
|
| 64 |
+
Dataset: Detailed caption dataset (e.g., Danbooru captions)
|
| 65 |
+
Freeze style head, train content head.
|
| 66 |
+
Objective: Content head predicts c from text embedding.
|
| 67 |
+
Loss: L_concept + cross-modal alignment
|
| 68 |
+
|
| 69 |
+
### Stage 3: Concept & Logic
|
| 70 |
+
Dataset: Reasoning datasets + attribute counting + spatial relationship data
|
| 71 |
+
Train concept head with structured reasoning traces.
|
| 72 |
+
|
| 73 |
+
### Stage 4: Mood & Philosophy
|
| 74 |
+
Dataset: Art with mood labels, emotional analysis data
|
| 75 |
+
Train mood head to predict m from latent z.
|
| 76 |
+
|
| 77 |
+
### Stage 5: Full End-to-End
|
| 78 |
+
Unfreeze entire model, train with all losses balanced.
|
| 79 |
+
Use classifier-free guidance during training (10% dropout of text).
|
| 80 |
+
|
| 81 |
+
## Key Design Decisions
|
| 82 |
+
|
| 83 |
+
1. **Why Flow Matching over Diffusion?**
|
| 84 |
+
- Fewer function evaluations for sampling (fewer steps = faster mobile inference)
|
| 85 |
+
- More stable training dynamics (no exploding loss from tβ0 in DDPM)
|
| 86 |
+
- Compatible with rectified flow for 1-4 step generation
|
| 87 |
+
|
| 88 |
+
2. **Why SSM+RWKV over Transformers?**
|
| 89 |
+
- Linear complexity in sequence length: O(N) vs O(NΒ²)
|
| 90 |
+
- For 32Γ32 latent = 1024 tokens, transformer attention needs ~1M attention ops per layer
|
| 91 |
+
- SSM needs ~1024 state updates = 1000x fewer ops
|
| 92 |
+
- Memory usage: ~50MB vs ~2GB for equivalent context
|
| 93 |
+
|
| 94 |
+
3. **Why Modular Heads?**
|
| 95 |
+
- Forces disentanglement in latent space
|
| 96 |
+
- Enables controllable generation: user can tweak style vectors post-training
|
| 97 |
+
- Prevents catastrophic forgetting during curriculum
|
| 98 |
+
- Each module can be trained/fine-tuned independently
|
| 99 |
+
|
| 100 |
+
4. **Why LTC (Liquid Time-Constant) residual?**
|
| 101 |
+
- Adapts effective receptive field dynamically
|
| 102 |
+
- For structured content (faces, buildings): slower dynamics (larger time constant)
|
| 103 |
+
- For texture/noise: faster dynamics (smaller time constant)
|
| 104 |
+
- Adds only 2 extra parameters per channel
|
| 105 |
+
|
| 106 |
+
5. **Why Wavelet Flow Matching?**
|
| 107 |
+
- Haar wavelet decomposes image into frequency bands
|
| 108 |
+
- Different CARTEL blocks can specialize on different frequencies
|
| 109 |
+
- LL subband: global structure (handled by deep SSM)
|
| 110 |
+
- LH/HL/HH: edges and textures (handled by RWKV with high-frequency bias)
|
| 111 |
+
|
| 112 |
+
## Implementation Notes
|
| 113 |
+
|
| 114 |
+
- Total model size target: **<150M parameters** (fits in ~600MB fp16, ~300MB int8)
|
| 115 |
+
- With frozen VAE (~50M) + CARTEL backbone (~80M) + modular heads (~20M) = ~150M
|
| 116 |
+
- In fp16: ~300MB RAM
|
| 117 |
+
- In int8: ~150MB RAM
|
| 118 |
+
- Plus overhead for KV cache/sampling: <3GB total on device
|
| 119 |
+
|
| 120 |
+
## References (Inspiration Sources)
|
| 121 |
+
|
| 122 |
+
1. DiM (2405.14224): SSM for diffusion, multi-directional scan, skip connections
|
| 123 |
+
2. Zigzag Mamba (2403.13802): Zigzag scanning for spatial continuity
|
| 124 |
+
3. Diffusion-RWKV (2404.04478): RWKV for image generation, adaLN conditioning
|
| 125 |
+
4. MobileMamba (2411.15941): Three-stage network, wavelet-enhanced, multi-receptive field
|
| 126 |
+
5. MILR (2509.22761): Test-time latent reasoning, policy gradients in unified space
|
| 127 |
+
6. Unified Thinker (2601.03127): Reasoning-decoupled generation, task-agnostic reasoning core
|
| 128 |
+
7. LatentMorph (2602.02227): Implicit latent reasoning without decode-encode loops
|
| 129 |
+
8. LFM / Flow Matching in Latent Space (2307.08698): Flow matching with pretrained VAE
|
| 130 |
+
9. Liquid Time-Constant Networks (2006.04439): Continuous-time adaptive dynamics
|
| 131 |
+
10. Disentanglement via Latent Quantization (2305.18378): Modular latent decomposition
|
| 132 |
+
11. Modular Deep Learning (2302.11529): Parameter-efficient routing modules
|