reaperdoesntknow commited on
Commit
a9467db
·
verified ·
1 Parent(s): 784c21a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -23
README.md CHANGED
@@ -11,14 +11,17 @@ tags:
11
  - mqa
12
  - hyperffn
13
  - router-gating
 
 
 
14
  ---
15
 
16
- # MoAMetricLM-185M — Mixture of Attentions (MoA)
17
 
18
  **A geometry-aware Transformer with a mixture of attention mechanisms and metric-based routing.**
19
- **Parameters:** ~185M | **Type:** Causal LM (decoder-only) | **KV cache:** not yet implemented
 
20
 
21
- ---
22
 
23
  ## Model Index
24
 
@@ -26,10 +29,10 @@ tags:
26
  - **Task:** text generation (`text-generation`)
27
  - **Library:** 🤗 Transformers
28
  - **License:** Apache-2.0 (change here & add LICENSE file if different)
29
- - **Datasets (examples used in dev runs):**
30
- - `WeMake/Intelligent-Content-Understanding` (+ another 250k-token dataset)
 
31
 
32
- ---
33
 
34
  ## Overview
35
 
@@ -49,7 +52,6 @@ tags:
49
 
50
  **Design goals:** geometric consistency, diverse inductive biases, structured efficiency, and full HF compatibility.
51
 
52
- ---
53
 
54
  ## What’s different from a standard Transformer?
55
 
@@ -63,7 +65,7 @@ tags:
63
  - **Up/Down projections** (SwiGLU-style) inside heads to expand/contract the value stream.
64
  - **HyperFFN** provides non-lazy capacity with token-wise branch routing.
65
 
66
- ---
67
 
68
  ## Intended Use & Limitations
69
 
@@ -76,7 +78,7 @@ tags:
76
 
77
  **Out-of-scope:** high-stakes applications (medical/legal/etc.) without further training, evaluation, and safeguards.
78
 
79
- ---
80
 
81
  ## Training Details
82
 
@@ -98,7 +100,7 @@ tags:
98
 
99
  **Stability aids:** safe softmax (subtract max), PreNorm, LayerScale (≈1e-4), DropPath (optional), label masking (`-100` on padding).
100
 
101
- ---
102
 
103
  ## Configuration (example)
104
 
@@ -133,6 +135,8 @@ tags:
133
  "bos_token_id": 50256,
134
  "eos_token_id": 50256
135
  }
 
 
136
 
137
  If you use gpt2 tokenizer, set pad_token = eos_token and ensure vocab_size/eos/pad match the tokenizer.
138
 
@@ -142,6 +146,7 @@ Usage
142
 
143
  Inference
144
 
 
145
  from transformers import AutoTokenizer, AutoModelForCausalLM
146
 
147
  model_id = "your-hf-username/MoAMetricLM-185M"
@@ -186,8 +191,8 @@ def collate(examples):
186
  # out.loss.backward()
187
  # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.2)
188
  # optimizer.step(); optimizer.zero_grad()
 
189
 
190
- ---
191
  ## Evaluation
192
 
193
  For meaningful comparisons, run:
@@ -198,9 +203,7 @@ For meaningful comparisons, run:
198
  • With vs without HyperFFN branch router/gates
199
  • With vs without TI regularizer
200
 
201
- Please share results via Issues/PRs.
202
 
203
-
204
 
205
  Efficiency Notes
206
  • Ball pruning: masks keys outside per-head radius → structured sparsity.
@@ -208,22 +211,17 @@ Efficiency Notes
208
  • HyperFFN: token-wise branch router (optional top-k) to avoid paying for all branches equally.
209
  • CPU tips: set OMP_NUM_THREADS/MKL_NUM_THREADS to core count; use pad_token = eos_token.
210
 
211
- Roadmap: metric-aware KV cache for long contexts; kernelized distance approximations (e.g., RFF) for sub-quadratic regimes; quantization & mixed precision exploration.
212
-
213
-
214
-
215
  Safety, Bias & Risks
216
  • May produce biased, offensive, or factually incorrect outputs.
217
  • No safety/alignment training included.
218
- • Do not deploy in high-stakes contexts without additional safeguards.
219
 
220
-
221
 
222
  License
223
 
224
  Apache-2.0 (update if different).
225
 
226
-
227
 
228
  Citation
229
 
@@ -235,16 +233,15 @@ Citation
235
  }
236
 
237
 
238
-
239
 
240
  Changelog
241
  • v0.2 (2025-09-20) — 500k-token CPU run, GPT-2 tokenizer, LR=5e-4, final loss ≈ 0.30.
242
  • v0.1 (2025-09-20) — initial public release: metric heads, MQA, ball pruning, HyperFFN, router & gates; HF-compatible; no KV cache.
243
 
244
-
245
 
246
  Maintainers
247
  • Author: reaper (Convergent Intelligence LLC)
248
  • Contact: add preferred contact
249
  • Issues: HF model hub issues tab
250
- ---
 
11
  - mqa
12
  - hyperffn
13
  - router-gating
14
+ datasets:
15
+ - nvidia/Nemotron-Math-HumanReasoning
16
+ - WeMake/Intelligent-Content-Understanding
17
  ---
18
 
19
+ # MoAMetricLM-100M — Mixture of Attentions (MoA)
20
 
21
  **A geometry-aware Transformer with a mixture of attention mechanisms and metric-based routing.**
22
+ **Parameters:** ~100M| **Type:** Causal LM (decoder-only) | **KV cache:** not yet implemented
23
+
24
 
 
25
 
26
  ## Model Index
27
 
 
29
  - **Task:** text generation (`text-generation`)
30
  - **Library:** 🤗 Transformers
31
  - **License:** Apache-2.0 (change here & add LICENSE file if different)
32
+ - **Datasets :**
33
+ - nvidia/Nemotron-Math-HumanReasoning: ~256k tokens
34
+ - WeMake/Intelligent-Content-Understanding ~256k tokens
35
 
 
36
 
37
  ## Overview
38
 
 
52
 
53
  **Design goals:** geometric consistency, diverse inductive biases, structured efficiency, and full HF compatibility.
54
 
 
55
 
56
  ## What’s different from a standard Transformer?
57
 
 
65
  - **Up/Down projections** (SwiGLU-style) inside heads to expand/contract the value stream.
66
  - **HyperFFN** provides non-lazy capacity with token-wise branch routing.
67
 
68
+
69
 
70
  ## Intended Use & Limitations
71
 
 
78
 
79
  **Out-of-scope:** high-stakes applications (medical/legal/etc.) without further training, evaluation, and safeguards.
80
 
81
+
82
 
83
  ## Training Details
84
 
 
100
 
101
  **Stability aids:** safe softmax (subtract max), PreNorm, LayerScale (≈1e-4), DropPath (optional), label masking (`-100` on padding).
102
 
103
+
104
 
105
  ## Configuration (example)
106
 
 
135
  "bos_token_id": 50256,
136
  "eos_token_id": 50256
137
  }
138
+ ```
139
+ ---
140
 
141
  If you use gpt2 tokenizer, set pad_token = eos_token and ensure vocab_size/eos/pad match the tokenizer.
142
 
 
146
 
147
  Inference
148
 
149
+ ```python
150
  from transformers import AutoTokenizer, AutoModelForCausalLM
151
 
152
  model_id = "your-hf-username/MoAMetricLM-185M"
 
191
  # out.loss.backward()
192
  # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.2)
193
  # optimizer.step(); optimizer.zero_grad()
194
+ ```
195
 
 
196
  ## Evaluation
197
 
198
  For meaningful comparisons, run:
 
203
  • With vs without HyperFFN branch router/gates
204
  • With vs without TI regularizer
205
 
 
206
 
 
207
 
208
  Efficiency Notes
209
  • Ball pruning: masks keys outside per-head radius → structured sparsity.
 
211
  • HyperFFN: token-wise branch router (optional top-k) to avoid paying for all branches equally.
212
  • CPU tips: set OMP_NUM_THREADS/MKL_NUM_THREADS to core count; use pad_token = eos_token.
213
 
214
+ Roadmap: metric-aware KV cache for long contexts; kernelized distance approximations (e.g., RFF) for sub-quadratic regimes; quantization & mixed precision
 
 
 
215
  Safety, Bias & Risks
216
  • May produce biased, offensive, or factually incorrect outputs.
217
  • No safety/alignment training included.
218
+ • Do not deploy in high-stakes contexts without additional
219
 
 
220
 
221
  License
222
 
223
  Apache-2.0 (update if different).
224
 
 
225
 
226
  Citation
227
 
 
233
  }
234
 
235
 
 
236
 
237
  Changelog
238
  • v0.2 (2025-09-20) — 500k-token CPU run, GPT-2 tokenizer, LR=5e-4, final loss ≈ 0.30.
239
  • v0.1 (2025-09-20) — initial public release: metric heads, MQA, ball pruning, HyperFFN, router & gates; HF-compatible; no KV cache.
240
 
241
+
242
 
243
  Maintainers
244
  • Author: reaper (Convergent Intelligence LLC)
245
  • Contact: add preferred contact
246
  • Issues: HF model hub issues tab
247
+ ---