Update README.md
Browse files
README.md
CHANGED
|
@@ -31,144 +31,6 @@ A 1.3B parameter language model that replaces softmax attention with **causal mo
|
|
| 31 |
|
| 32 |

|
| 33 |
|
| 34 |
-
## Architecture Overview
|
| 35 |
-
|
| 36 |
-
```
|
| 37 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 38 |
-
β MonoidForCausalLM (1.34B) β
|
| 39 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
|
| 40 |
-
β β
|
| 41 |
-
β token_ids ββ> [ embed_tokens 128256 Γ 2048 ] ββ> x_0 β
|
| 42 |
-
β β
|
| 43 |
-
β βββββββββββββββββββββββββββ β
|
| 44 |
-
β β MonoidDecoderLayer Γ 16 β βββ see detail below β
|
| 45 |
-
β βββββββββββββββββββββββββββ β
|
| 46 |
-
β β β
|
| 47 |
-
β [ RMSNorm ] β
|
| 48 |
-
β β β
|
| 49 |
-
β [ lm_head 2048 Γ 128256 ] ββ> logits β
|
| 50 |
-
β (tied with embed_tokens) β
|
| 51 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
-
β MonoidDecoderLayer (Γ 16 layers) β
|
| 56 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
|
| 57 |
-
β β
|
| 58 |
-
β x ββββββββββββββββββββββββββββββββββββββββββ (residual) β
|
| 59 |
-
β β β β
|
| 60 |
-
β [ input_layernorm RMSNorm ] β β
|
| 61 |
-
β β β β
|
| 62 |
-
β [ MonoidAttention ] βββ see detail below β β
|
| 63 |
-
β β β β
|
| 64 |
-
β + <βββββββββββββββββββββββββββββββββββββββββ β
|
| 65 |
-
β β β
|
| 66 |
-
β x ββββββββββββββββββββββββββββββββββββββββββ (residual) β
|
| 67 |
-
β β β β
|
| 68 |
-
β [ post_attention_layernorm RMSNorm ] β β
|
| 69 |
-
β β β β
|
| 70 |
-
β [ LlamaMLP 2048 β 8192 β 2048 ] β β
|
| 71 |
-
β β gate_proj ββ β β
|
| 72 |
-
β β up_proj ββββ€β> SiLU(gate) β up β β
|
| 73 |
-
β β βββ> down_proj ββ> out β β
|
| 74 |
-
β β β β
|
| 75 |
-
β + <βββββββββββββββββββββββββββββββββββββββββ β
|
| 76 |
-
β β β
|
| 77 |
-
β out β
|
| 78 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
ββββββββββββββββββββββββββββββββββββββββββββββββοΏ½οΏ½ββββββββββββββββββββββββββββ
|
| 82 |
-
β MonoidAttention (32 heads, d=64 per head) β
|
| 83 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
|
| 84 |
-
β β
|
| 85 |
-
β x_t β R^{2048} β
|
| 86 |
-
β β β
|
| 87 |
-
β βββ> q_proj ββ> [B,H,T,d] ββ> RMSNorm ββ> Γ(1/βd) ββββββ> q_t β
|
| 88 |
-
β β β
|
| 89 |
-
β βββ> k_proj ββ> [B,H,T,d] ββ> RMSNorm ββ> SiLU ββββββββββ> k_t β₯0 β
|
| 90 |
-
β β β
|
| 91 |
-
β βββ> v_proj ββ> [B,H,T,d] ββββββββββββββββββββββββββββββββ> v_t β
|
| 92 |
-
β β β
|
| 93 |
-
β βββ> decay_proj ββ> Sigmoid ββ> Ξ±_t β (0,1)^d (vector decay gate) β
|
| 94 |
-
β β bias init = 3.0 β
|
| 95 |
-
β β β Ο(3) β 0.95 at start β
|
| 96 |
-
β β β
|
| 97 |
-
β βββ> gate_proj ββ> SiLU ββββββ> g_t β R^{H*d} (output gate) β
|
| 98 |
-
β β
|
| 99 |
-
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 100 |
-
β Monoid Recurrence (training: parallel prefix scan, decode: O(1)) β
|
| 101 |
-
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 102 |
-
β β
|
| 103 |
-
β k_t β v_t βββββββββββββββ β
|
| 104 |
-
β [dΓd] v β
|
| 105 |
-
β βββββββββββββββββββββββββββ β
|
| 106 |
-
β S_{t-1} ββββ> β S_t = diag(Ξ±_t)Β·S_{t-1}β β
|
| 107 |
-
β [dΓd] β + k_t β v_t βββ> S_t β
|
| 108 |
-
β βββββββββββββββββββββββββββ [dΓd] β
|
| 109 |
-
β "compressed causal history" β
|
| 110 |
-
β β
|
| 111 |
-
β h0 (learnable, zero-init) ββ> S_0 at sequence start β
|
| 112 |
-
β β
|
| 113 |
-
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 114 |
-
β Readout + Output Projection β
|
| 115 |
-
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 116 |
-
β β
|
| 117 |
-
β q_t ββ> einsum(q, S_t) ββ> o_t ββ> RMSNorm βββ β
|
| 118 |
-
β (o_norm) β β
|
| 119 |
-
β v β
|
| 120 |
-
β g_t ββββββββββββββββββββββββββββββββββ> g_t β o_t ββ> o_proj ββ> out β
|
| 121 |
-
β β
|
| 122 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 126 |
-
β MonoidCache β O(1) State (replaces O(T) KV-Cache) β
|
| 127 |
-
β βββββββββββββββββββοΏ½οΏ½ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
|
| 128 |
-
β β
|
| 129 |
-
β Transformer KV-Cache: Monoid State Cache: β
|
| 130 |
-
β ββββββββββββββββββββ ββββββββββββββββββββ β
|
| 131 |
-
β β K: [B,H,T,d] β β S: [B,H,d,d] β β fixed size β
|
| 132 |
-
β β V: [B,H,T,d] β β Ξ±_acc: [B,H,d] β β
|
| 133 |
-
β β grows with T βββ β β per layer β β
|
| 134 |
-
β ββββββββββββββββββββ ββββββββββββββββββββ β
|
| 135 |
-
β Memory: O(TΒ·HΒ·d) Memory: O(HΒ·dΒ²) β
|
| 136 |
-
β 1000 tok β 2M floats/layer ANY length β 131K floats/layer β
|
| 137 |
-
β β
|
| 138 |
-
β Decode step: Decode step: β
|
| 139 |
-
β o = softmax(qΒ·K^T)Β·V S_t = Ξ±_tΒ·S_{t-1} + k_tβv_t β
|
| 140 |
-
β scan T keys β o_t = q_t Β· S_t β
|
| 141 |
-
β Time: O(TΒ·d) Time: O(dΒ²) β constant! β
|
| 142 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 146 |
-
β Weight Transfer from Llama-3.2-1B-Instruct β
|
| 147 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
|
| 148 |
-
β β
|
| 149 |
-
β Reused directly (frozen-compatible): β
|
| 150 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 151 |
-
β β embed_tokens 128256 Γ 2048 β β
|
| 152 |
-
β β lm_head 2048 Γ 128256 (tied) β β
|
| 153 |
-
β β LlamaMLP Γ 16 gate/up/down_proj β β
|
| 154 |
-
β β LlamaRMSNorm Γ 33 input/post_attn/final β β
|
| 155 |
-
β β q_proj Γ 16 2048 β 2048 β β
|
| 156 |
-
β β k_proj Γ 16 2048 β 2048 (tiled 8β32 heads from GQA) β β
|
| 157 |
-
β β v_proj Γ 16 2048 β 2048 (tiled 8β32 heads from GQA) β β
|
| 158 |
-
β β o_proj Γ 16 2048 β 2048 β β
|
| 159 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 160 |
-
β β
|
| 161 |
-
β Novel (randomly initialized): β
|
| 162 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 163 |
-
β β decay_proj Γ 16 2048 β 2048 (bias=3.0) β β
|
| 164 |
-
β β gate_proj Γ 16 2048 β 2048 (std=0.01) β β
|
| 165 |
-
β β q_norm Γ 16 RMSNorm(64) β β
|
| 166 |
-
β β k_norm Γ 16 RMSNorm(64) β β
|
| 167 |
-
β β o_norm Γ 16 RMSNorm(64) (weight=1) β β
|
| 168 |
-
β β h0 Γ 16 [1,32,64,64] (zeros) β β
|
| 169 |
-
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 170 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 171 |
-
```
|
| 172 |
|
| 173 |
## Key Properties
|
| 174 |
|
|
|
|
| 31 |
|
| 32 |

|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## Key Properties
|
| 36 |
|