krystv commited on
Commit
c9ecdca
Β·
verified Β·
1 Parent(s): 9385d07

Upload architecture.md

Browse files
Files changed (1) hide show
  1. architecture.md +132 -0
architecture.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ARTIGEN V1.0 β€” Novel Architecture Specification
2
+
3
+ ## Overview
4
+ **ARTIGEN** (Adaptive Reasoning-based Token-Informed Generative ENgine) is a lightweight, modular image generation architecture designed for anime/illustration generation on mobile devices (<3GB RAM). It combines insights from DiM (Diffusion Mamba), Zigzag Mamba, RWKV diffusion, MobileMamba, unified latent reasoning (MILR, Unified Thinker), flow matching in latent space (LFM), disentangled modular representation learning, and liquid time-constant dynamics.
5
+
6
+ ## Core Innovations
7
+ 1. **CARTEL** (CAscading Reasoning Token Ensemble Layer) β€” Modular latent reasoning backbone using hybrid SSM+RWKV blocks
8
+ 2. **PHI-SCAN** (Physics-Informed multi-directional scan) β€” No parameter scanning scheme preserving 2D geometric continuity
9
+ 3. **Art-Style Disentangled Latent Space (ASDL)** β€” Modular decomposition of style/content/concept/mood in latent vectors
10
+ 4. **Flow-Matching in Wavelet Space** β€” Training using rectified flow matching with Haar wavelet decomposition
11
+ 5. **Progressive Modular Curriculum** β€” Freeze/thaw training stages forcing specialization
12
+
13
+ ## Mathematical Formulation
14
+
15
+ ### 1. Pretrained VAE (Frozen)
16
+ Use `madebyollin/taesd` (1M params) or `stabilityai/sd-vae-ft-mse` for latent encoding. Image x β†’ z ∈ R^(hΓ—wΓ—c) where h=w=32, c=4 for 1024px images.
17
+
18
+ ### 2. CARTEL Backbone
19
+ Each CARTEL block consists of:
20
+ - **SSM branch**: Bidirectional selective state space model (Mamba-style)
21
+ - **RWKV branch**: Linear attention with spatial shift
22
+ - **LTC residual**: Liquid Time-Constant gate for adaptive information routing
23
+
24
+ ### 3. PHI-SCAN Pattern
25
+ Physics-informed zigzag scan minimizing geodesic distance distortion when flattening 2D→1D:
26
+ - Uses Hilbert curve ordering (preserves locality)
27
+ - Alternates between row-major, column-major, and diagonal scans per layer
28
+ - Zero additional parameters (permutation matrices are fixed)
29
+
30
+ ### 4. ASDL Latent Space
31
+ The model learns to decompose latent representations into:
32
+ - **Style vector**: s ∈ R^d_s β€” artistic style (colored pencil, watercolor, anime, etc.)
33
+ - **Content vector**: c ∈ R^d_c β€” semantic content (characters, objects, scenes)
34
+ - **Concept vector**: n ∈ R^d_n β€” abstract concept relationships
35
+ - **Mood vector**: m ∈ R^d_m β€” emotional tone, atmosphere, lighting feel
36
+ - **Composition vector**: p ∈ R^d_p β€” spatial layout, perspective, framing
37
+
38
+ ### 5. Training: Flow Matching with Auxiliary Reasoning Losses
39
+ Instead of diffusion, use flow matching (Rectified Flow) in latent space, more stable and faster converging for small models.
40
+
41
+ Loss function:
42
+ L = Ξ»_flow * L_flow + Ξ»_concept * L_concept + Ξ»_style * L_style + Ξ»_mood * L_mood + Ξ»_consistency * L_VAE_recon + Ξ»_physics * L_spectral_smoothness
43
+
44
+ where L_flow is the standard flow-matching loss predicting velocity field v_t(z_t).
45
+
46
+ ## Architecture Diagram (Simplified)
47
+
48
+ ```
49
+ Text Prompt β†’ CLIP/Text Encoder β†’ Semantic Embedding Ο†_t
50
+ Timestep t β†’ Time Embedding Ο†_t
51
+ Image Latent z_t β†’ PHI-SCAN β†’ [CARTEL Block Γ— N] β†’ Pred z_0 / v_t(z_t)
52
+ Long skip connections (U-Net style)
53
+ ```
54
+
55
+ ## Training Stages (Progressive Modular)
56
+
57
+ ### Stage 1: Foundation + Style Learning
58
+ Dataset: High-quality anime illustrations with strong style tags
59
+ Objective: Learn ASDL style vectors. Style head predicts s from z.
60
+ Loss: L_style + L_flow + L_VAE_recon
61
+ Duration: ~50K steps
62
+
63
+ ### Stage 2: Content Understanding
64
+ Dataset: Detailed caption dataset (e.g., Danbooru captions)
65
+ Freeze style head, train content head.
66
+ Objective: Content head predicts c from text embedding.
67
+ Loss: L_concept + cross-modal alignment
68
+
69
+ ### Stage 3: Concept & Logic
70
+ Dataset: Reasoning datasets + attribute counting + spatial relationship data
71
+ Train concept head with structured reasoning traces.
72
+
73
+ ### Stage 4: Mood & Philosophy
74
+ Dataset: Art with mood labels, emotional analysis data
75
+ Train mood head to predict m from latent z.
76
+
77
+ ### Stage 5: Full End-to-End
78
+ Unfreeze entire model, train with all losses balanced.
79
+ Use classifier-free guidance during training (10% dropout of text).
80
+
81
+ ## Key Design Decisions
82
+
83
+ 1. **Why Flow Matching over Diffusion?**
84
+ - Fewer function evaluations for sampling (fewer steps = faster mobile inference)
85
+ - More stable training dynamics (no exploding loss from t→0 in DDPM)
86
+ - Compatible with rectified flow for 1-4 step generation
87
+
88
+ 2. **Why SSM+RWKV over Transformers?**
89
+ - Linear complexity in sequence length: O(N) vs O(NΒ²)
90
+ - For 32Γ—32 latent = 1024 tokens, transformer attention needs ~1M attention ops per layer
91
+ - SSM needs ~1024 state updates = 1000x fewer ops
92
+ - Memory usage: ~50MB vs ~2GB for equivalent context
93
+
94
+ 3. **Why Modular Heads?**
95
+ - Forces disentanglement in latent space
96
+ - Enables controllable generation: user can tweak style vectors post-training
97
+ - Prevents catastrophic forgetting during curriculum
98
+ - Each module can be trained/fine-tuned independently
99
+
100
+ 4. **Why LTC (Liquid Time-Constant) residual?**
101
+ - Adapts effective receptive field dynamically
102
+ - For structured content (faces, buildings): slower dynamics (larger time constant)
103
+ - For texture/noise: faster dynamics (smaller time constant)
104
+ - Adds only 2 extra parameters per channel
105
+
106
+ 5. **Why Wavelet Flow Matching?**
107
+ - Haar wavelet decomposes image into frequency bands
108
+ - Different CARTEL blocks can specialize on different frequencies
109
+ - LL subband: global structure (handled by deep SSM)
110
+ - LH/HL/HH: edges and textures (handled by RWKV with high-frequency bias)
111
+
112
+ ## Implementation Notes
113
+
114
+ - Total model size target: **<150M parameters** (fits in ~600MB fp16, ~300MB int8)
115
+ - With frozen VAE (~50M) + CARTEL backbone (~80M) + modular heads (~20M) = ~150M
116
+ - In fp16: ~300MB RAM
117
+ - In int8: ~150MB RAM
118
+ - Plus overhead for KV cache/sampling: <3GB total on device
119
+
120
+ ## References (Inspiration Sources)
121
+
122
+ 1. DiM (2405.14224): SSM for diffusion, multi-directional scan, skip connections
123
+ 2. Zigzag Mamba (2403.13802): Zigzag scanning for spatial continuity
124
+ 3. Diffusion-RWKV (2404.04478): RWKV for image generation, adaLN conditioning
125
+ 4. MobileMamba (2411.15941): Three-stage network, wavelet-enhanced, multi-receptive field
126
+ 5. MILR (2509.22761): Test-time latent reasoning, policy gradients in unified space
127
+ 6. Unified Thinker (2601.03127): Reasoning-decoupled generation, task-agnostic reasoning core
128
+ 7. LatentMorph (2602.02227): Implicit latent reasoning without decode-encode loops
129
+ 8. LFM / Flow Matching in Latent Space (2307.08698): Flow matching with pretrained VAE
130
+ 9. Liquid Time-Constant Networks (2006.04439): Continuous-time adaptive dynamics
131
+ 10. Disentanglement via Latent Quantization (2305.18378): Modular latent decomposition
132
+ 11. Modular Deep Learning (2302.11529): Parameter-efficient routing modules