NoesisLab
/

Spartacus-1B-Instruct

@@ -31,144 +31,6 @@ A 1.3B parameter language model that replaces softmax attention with **causal mo
 ![Core Mechanism: The Monoid Recurrence](ARCH.png)
-## Architecture Overview
-```
- ╔═══════════════════════════════════════════════════════════════════════════╗
- ║                        MonoidForCausalLM  (1.34B)                        ║
- ╠═══════════════════════════════════════════════════════════════════════════╣
- ║                                                                          ║
- ║   token_ids ──> [ embed_tokens  128256 × 2048 ] ──> x_0                 ║
- ║                                                                          ║
- ║                    ┌─────────────────────────┐                           ║
- ║                    │  MonoidDecoderLayer × 16 │ ◄── see detail below     ║
- ║                    └─────────────────────────┘                           ║
- ║                                │                                         ║
- ║                          [ RMSNorm ]                                     ║
- ║                                │                                         ║
- ║                     [ lm_head  2048 × 128256 ] ──> logits               ║
- ║                     (tied with embed_tokens)                             ║
- ╚═══════════════════════════════════════════════════════════════════════════╝
- ╔═══════════════════════════════════════════════════════════════════════════╗
- ║                MonoidDecoderLayer  (× 16 layers)                         ║
- ╠═══════════════════════════════════════════════════════════════════════════╣
- ║                                                                          ║
- ║   x ─────────────────────────────────────────┐  (residual)              ║
- ║   │                                          │                           ║
- ║   [ input_layernorm  RMSNorm ]               │                           ║
- ║   │                                          │                           ║
- ║   [ MonoidAttention ] ◄── see detail below   │                           ║
- ║   │                                          │                           ║
- ║   + <────────────────────────────────────────┘                           ║
- ║   │                                                                      ║
- ║   x ─────────────────────────────────────────┐  (residual)              ║
- ║   │                                          │                           ║
- ║   [ post_attention_layernorm  RMSNorm ]      │                           ║
- ║   │                                          │                           ║
- ║   [ LlamaMLP  2048 → 8192 → 2048 ]          │                           ║
- ║   │    gate_proj ─┐                          │                           ║
- ║   │    up_proj ───┤─> SiLU(gate) ⊙ up       │                           ║
- ║   │               └──> down_proj ──> out     │                           ║
- ║   │                                          │                           ║
- ║   + <────────────────────────────────────────┘                           ║
- ║   │                                                                      ║
- ║   out                                                                    ║
- ╚═══════════════════════════════════════════════════════════════════════════╝
- ╔═══════════════════════════════════════════════��═══════════════════════════╗
- ║              MonoidAttention  (32 heads, d=64 per head)                  ║
- ╠═══════════════════════════════════════════════════════════════════════════╣
- ║                                                                          ║
- ║   x_t ∈ R^{2048}                                                        ║
- ║    │                                                                     ║
- ║    ├──> q_proj ──> [B,H,T,d] ──> RMSNorm ──> ×(1/√d) ──────> q_t       ║
- ║    │                                                                     ║
- ║    ├──> k_proj ──> [B,H,T,d] ──> RMSNorm ──> SiLU ──────────> k_t  ≥0  ║
- ║    │                                                                     ║
- ║    ├──> v_proj ──> [B,H,T,d] ────────────────────────────────> v_t      ║
- ║    │                                                                     ║
- ║    ├──> decay_proj ──> Sigmoid ──> α_t ∈ (0,1)^d   (vector decay gate)  ║
- ║    │                                          bias init = 3.0            ║
- ║    │                                          → σ(3) ≈ 0.95 at start    ║
- ║    │                                                                     ║
- ║    └──> gate_proj ──> SiLU ──────> g_t ∈ R^{H*d}   (output gate)       ║
- ║                                                                          ║
- ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
- ║   Monoid Recurrence (training: parallel prefix scan, decode: O(1))       ║
- ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
- ║                                                                          ║
- ║    k_t ⊗ v_t  ──────────────┐                                           ║
- ║         [d×d]               v                                            ║
- ║                  ┌─────────────────────────┐                             ║
- ║    S_{t-1} ────> │  S_t = diag(α_t)·S_{t-1}│                            ║
- ║     [d×d]        │       + k_t ⊗ v_t       │──> S_t                     ║
- ║                  └─────────────────────────┘    [d×d]                    ║
- ║                   "compressed causal history"                            ║
- ║                                                                          ║
- ║    h0 (learnable, zero-init) ──> S_0 at sequence start                   ║
- ║                                                                          ║
- ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
- ║   Readout + Output Projection                                            ║
- ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
- ║                                                                          ║
- ║    q_t ──> einsum(q, S_t) ──> o_t ──> RMSNorm ──┐                       ║
- ║                                   (o_norm)       │                       ║
- ║                                                  v                       ║
- ║    g_t ──────────────────────────────────> g_t ⊙ o_t ──> o_proj ──> out ║
- ║                                                                          ║
- ╚═══════════════════════════════════════════════════════════════════════════╝
- ╔═══════════════════════════════════════════════════════════════════════════╗
- ║         MonoidCache — O(1) State  (replaces O(T) KV-Cache)              ║
- ╠═══════════════════��═══════════════════════════════════════════════════════╣
- ║                                                                          ║
- ║   Transformer KV-Cache:          Monoid State Cache:                     ║
- ║   ┌──────────────────┐           ┌──────────────────┐                    ║
- ║   │ K: [B,H,T,d]     │           │ S: [B,H,d,d]     │  ← fixed size    ║
- ║   │ V: [B,H,T,d]     │           │ α_acc: [B,H,d]   │                   ║
- ║   │ grows with T ↑↑↑  │           │ per layer         │                   ║
- ║   └──────────────────┘           └──────────────────┘                    ║
- ║   Memory: O(T·H·d)              Memory: O(H·d²)                         ║
- ║   1000 tok → 2M floats/layer    ANY length → 131K floats/layer          ║
- ║                                                                          ║
- ║   Decode step:                   Decode step:                            ║
- ║   o = softmax(q·K^T)·V          S_t = α_t·S_{t-1} + k_t⊗v_t           ║
- ║   scan T keys ↑                  o_t = q_t · S_t                        ║
- ║   Time: O(T·d)                   Time: O(d²)  ← constant!              ║
- ╚═══════════════════════════════════════════════════════════════════════════╝
- ╔═══════════════════════════════════════════════════════════════════════════╗
- ║           Weight Transfer from Llama-3.2-1B-Instruct                     ║
- ╠═══════════════════════════════════════════════════════════════════════════╣
- ║                                                                          ║
- ║   Reused directly (frozen-compatible):                                   ║
- ║   ┌──────────────────────────────────────────────┐                       ║
- ║   │  embed_tokens       128256 × 2048            │                       ║
- ║   │  lm_head            2048 × 128256 (tied)     │                       ║
- ║   │  LlamaMLP × 16      gate/up/down_proj        │                       ║
- ║   │  LlamaRMSNorm × 33  input/post_attn/final    │                       ║
- ║   │  q_proj × 16        2048 → 2048              │                       ║
- ║   │  k_proj × 16        2048 → 2048  (tiled 8→32 heads from GQA)  │     ║
- ║   │  v_proj × 16        2048 → 2048  (tiled 8→32 heads from GQA)  │     ║
- ║   │  o_proj × 16        2048 → 2048              │                       ║
- ║   └──────────────────────────────────────────────┘                       ║
- ║                                                                          ║
- ║   Novel (randomly initialized):                                          ║
- ║   ┌──────────────────────────────────────────────┐                       ║
- ║   │  decay_proj × 16    2048 → 2048  (bias=3.0)  │                       ║
- ║   │  gate_proj × 16     2048 → 2048  (std=0.01)  │                       ║
- ║   │  q_norm × 16        RMSNorm(64)              │                       ║
- ║   │  k_norm × 16        RMSNorm(64)              │                       ║
- ║   │  o_norm × 16        RMSNorm(64)  (weight=1)  │                       ║
- ║   │  h0 × 16            [1,32,64,64] (zeros)     │                       ║
- ║   └──────────────────────────────────────────────┘                       ║
- ╚═══════════════════════════════════════════════════════════════════════════╝
-```
 ## Key Properties


31
32	![Core Mechanism: The Monoid Recurrence](ARCH.png)
33










































































































































34
35	## Key Properties
36