reaperdoesntknow commited on
Commit
a9b9ecc
Β·
verified Β·
1 Parent(s): fa3e7e6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +209 -191
README.md CHANGED
@@ -3,6 +3,12 @@ library_name: transformers
3
  tags:
4
  - trl
5
  - sft
 
 
 
 
 
 
6
  license: cc
7
  datasets:
8
  - nohurry/Opus-4.6-Reasoning-3000x-filtered
@@ -13,197 +19,209 @@ language:
13
  pipeline_tag: text-generation
14
  ---
15
 
16
- # Model Card for Model ID
17
-
18
- <!-- Provide a quick summary of what the model is/does. -->
19
-
20
-
21
-
22
- ## Model Details
23
-
24
- ### Model Description
25
-
26
- <!-- Provide a longer summary of what this model is. -->
27
-
28
- This is the model card of a πŸ€— transformers model that has been pushed on the Hub. This model card has been automatically generated.
29
-
30
- - **Developed by:** [More Information Needed]
31
- - **Funded by [optional]:** [More Information Needed]
32
- - **Shared by [optional]:** [More Information Needed]
33
- - **Model type:** [More Information Needed]
34
- - **Language(s) (NLP):** [More Information Needed]
35
- - **License:** [More Information Needed]
36
- - **Finetuned from model [optional]:** [More Information Needed]
37
-
38
- ### Model Sources [optional]
39
-
40
- <!-- Provide the basic links for the model. -->
41
-
42
- - **Repository:** [More Information Needed]
43
- - **Paper [optional]:** [More Information Needed]
44
- - **Demo [optional]:** [More Information Needed]
45
-
46
- ## Uses
47
-
48
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
49
-
50
- ### Direct Use
51
-
52
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
53
-
54
- [More Information Needed]
55
-
56
- ### Downstream Use [optional]
57
-
58
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
59
-
60
- [More Information Needed]
61
-
62
- ### Out-of-Scope Use
63
-
64
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
65
-
66
- [More Information Needed]
67
-
68
- ## Bias, Risks, and Limitations
69
-
70
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
71
-
72
- [More Information Needed]
73
-
74
- ### Recommendations
75
-
76
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
77
-
78
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
79
-
80
- ## How to Get Started with the Model
81
-
82
- Use the code below to get started with the model.
83
-
84
- [More Information Needed]
85
-
86
- ## Training Details
87
-
88
- ### Training Data
89
-
90
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
91
-
92
- [More Information Needed]
93
-
94
- ### Training Procedure
95
-
96
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
97
-
98
- #### Preprocessing [optional]
99
-
100
- [More Information Needed]
101
-
102
-
103
- #### Training Hyperparameters
104
-
105
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
106
-
107
- #### Speeds, Sizes, Times [optional]
108
-
109
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
110
-
111
- [More Information Needed]
112
-
113
- ## Evaluation
114
-
115
- <!-- This section describes the evaluation protocols and provides the results. -->
116
-
117
- ### Testing Data, Factors & Metrics
118
-
119
- #### Testing Data
120
-
121
- <!-- This should link to a Dataset Card if possible. -->
122
-
123
- [More Information Needed]
124
-
125
- #### Factors
126
-
127
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
128
-
129
- [More Information Needed]
130
-
131
- #### Metrics
132
-
133
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
134
-
135
- [More Information Needed]
136
 
137
  ### Results
138
 
139
- [More Information Needed]
140
-
141
- #### Summary
142
-
143
-
144
-
145
- ## Model Examination [optional]
146
-
147
- <!-- Relevant interpretability work for the model goes here -->
148
-
149
- [More Information Needed]
150
-
151
- ## Environmental Impact
152
-
153
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
154
-
155
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
156
-
157
- - **Hardware Type:** [More Information Needed]
158
- - **Hours used:** [More Information Needed]
159
- - **Cloud Provider:** [More Information Needed]
160
- - **Compute Region:** [More Information Needed]
161
- - **Carbon Emitted:** [More Information Needed]
162
-
163
- ## Technical Specifications [optional]
164
-
165
- ### Model Architecture and Objective
166
-
167
- [More Information Needed]
168
-
169
- ### Compute Infrastructure
170
-
171
- [More Information Needed]
172
-
173
- #### Hardware
174
-
175
- [More Information Needed]
176
-
177
- #### Software
178
-
179
- [More Information Needed]
180
-
181
- ## Citation [optional]
182
-
183
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
184
-
185
- **BibTeX:**
186
-
187
- [More Information Needed]
188
-
189
- **APA:**
190
-
191
- [More Information Needed]
192
-
193
- ## Glossary [optional]
194
-
195
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
196
-
197
- [More Information Needed]
198
-
199
- ## More Information [optional]
200
-
201
- [More Information Needed]
202
-
203
- ## Model Card Authors [optional]
204
-
205
- [More Information Needed]
206
-
207
- ## Model Card Contact
208
-
209
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  tags:
4
  - trl
5
  - sft
6
+ - metric-attention
7
+ - mixture-of-attentions
8
+ - triangle-inequality
9
+ - blackhole-rope
10
+ - discrepancy-calculus
11
+ - discover
12
  license: cc
13
  datasets:
14
  - nohurry/Opus-4.6-Reasoning-3000x-filtered
 
19
  pipeline_tag: text-generation
20
  ---
21
 
22
+ # DiscoverLM-70M
23
+
24
+ A 69M parameter causal language model built on the **Mixture-of-Attentions (MoA)** architecture β€” distance-based metric attention that respects the triangle inequality by construction, not approximation.
25
+
26
+ Every attention head operates in a proper metric space. The geometry is enforced, not hoped for.
27
+
28
+ ## What Makes This Different
29
+
30
+ Standard transformers compute attention as a dot product: QΒ·Kα΅€. This has no geometric meaning β€” it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties.
31
+
32
+ MoA replaces this with **negative squared distance** under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) ≀ d(a,b) + d(b,c) holds.
33
+
34
+ This isn't a constraint that fights the model. It's structure the model uses.
35
+
36
+ ## Architecture
37
+
38
+ ```
39
+ Input β†’ Token Embedding (48K vocab, Qwen3)
40
+ β”‚
41
+ β–Ό
42
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
43
+ β”‚ MoA Block Γ— 4 β”‚
44
+ β”‚ β”‚
45
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
46
+ β”‚ β”‚ Local β”‚ β”‚ Global β”‚ β”‚Channel β”‚ β”‚ MQA β”‚ β”‚
47
+ β”‚ β”‚ Conv β”‚ β”‚ Metric β”‚ β”‚ Mix β”‚ β”‚ Metric β”‚ β”‚
48
+ β”‚ β”‚ β”‚ β”‚ (64 heads)β”‚ β”‚ β”‚ β”‚(64 Q) β”‚ β”‚
49
+ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
50
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
51
+ β”‚ β–Ό β”‚
52
+ β”‚ Feature Gates + Token Router (top-2) β”‚
53
+ β”‚ β–Ό β”‚
54
+ β”‚ Residual + DropPath β”‚
55
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
56
+ β–Ό
57
+ HyperFFN (SwiGLU + CausalConv + LowRank)
58
+ β–Ό
59
+ LayerNorm
60
+ β–Ό
61
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
62
+ β”‚ MoA Language Model Head β”‚
63
+ β”‚ (same 4-path mixture β†’ SwiGLU β†’ tied vocab) β”‚
64
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
65
+ β–Ό
66
+ Logits (48,000)
67
+ ```
68
+
69
+ ### Core Components
70
+
71
+ **Metric Attention.** Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax.
72
+
73
+ **Mixture-of-Attentions Routing.** Four parallel paths per token β€” local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing.
74
+
75
+ **BlackHoleRoPE.** Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation.
76
+
77
+ **HyperFFN.** Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck β€” routed per-token with top-2 sparse selection.
78
+
79
+ **MoA LM Head.** The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding.
80
+
81
+ ## Parameter Budget
82
+
83
+ | Component | Parameters | % |
84
+ |---|---|---|
85
+ | Token embedding (tied) | 24.6M | 35.5% |
86
+ | MoA blocks Γ— 4 | 28.9M | 41.8% |
87
+ | HyperFFN (shared) | 4.2M | 6.1% |
88
+ | MoA LM head | 10.8M | 15.6% |
89
+ | RoPE + norms | 0.6M | 0.9% |
90
+ | **Total** | **69.1M** | |
91
+
92
+ ## vs Standard Transformers
93
+
94
+ | | Transformer | MoA |
95
+ |---|---|---|
96
+ | Attention scoring | Dot product (QΒ·Kα΅€) | Negative Mahalanobis distance |
97
+ | Geometric guarantee | None | Triangle inequality regularized |
98
+ | Position encoding | RoPE | BlackHoleRoPE (learned phase + bounded V energy) |
99
+ | Attention sparsity | Causal mask only | Ball pruning + top-k routing |
100
+ | Head combination | Concatenation | Per-token routed mixture of 4 path types |
101
+ | FFN | Single MLP | 3-branch routed (SwiGLU + CausalConv + LowRank) |
102
+ | LM head | Linear projection | Full MoA mixture β†’ SwiGLU β†’ tied projection |
103
+
104
+ ## Training
105
+
106
+ ### Data
107
+
108
+ | Dataset | Domain |
109
+ |---|---|
110
+ | [Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Multi-step reasoning |
111
+ | [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | Mathematical problem solving |
112
+ | [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | General instruction following |
113
+
114
+ ### Hyperparameters
115
+
116
+ | Parameter | Value |
117
+ |---|---|
118
+ | Optimizer | AdamW |
119
+ | Learning rate | 3e-4 β†’ 0 (cosine) |
120
+ | Batch size | 4 |
121
+ | Max sequence length | 1,024 |
122
+ | Steps | 512 |
123
+ | Epochs | 8 |
124
+ | Tokens seen | 262,144 |
125
+ | Precision | fp32 |
126
+ | Hardware | NVIDIA H100 (Colab) |
127
+ | TI regularization | Ξ»=0.01, 64 samples/batch |
128
+ | Router top-k | 2 of 4 paths |
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  ### Results
131
 
132
+ | Epoch | Avg Loss | Min Loss | Οƒ | Token Accuracy |
133
+ |---|---|---|---|---|
134
+ | 1 | 2.887 | 2.285 | 0.291 | 59.2% |
135
+ | 2 | 2.324 | 1.651 | 0.259 | 63.4% |
136
+ | 3 | 1.931 | 1.232 | 0.211 | 68.4% |
137
+ | 4 | 1.616 | 1.012 | 0.201 | 74.4% |
138
+ | 5 | 1.432 | 0.954 | 0.169 | 77.0% |
139
+ | 6 | 1.211 | 0.677 | 0.180 | 79.0% |
140
+ | 7 | 1.075 | 0.599 | 0.151 | 80.1% |
141
+ | 8 | 1.014 | 0.718 | 0.142 | 80.8% |
142
+
143
+ **Best single step:** 393 β€” loss **0.599**, token accuracy **88.4%**
144
+
145
+ Loss variance halved across training (Οƒ: 0.291 β†’ 0.142), indicating the mixture-of-attentions learned stable routing preferences as training progressed.
146
+
147
+ ## Configuration
148
+
149
+ ```json
150
+ {
151
+ "dim": 512,
152
+ "num_layers": 4,
153
+ "attn_heads": 64,
154
+ "mqa_q_heads": 64,
155
+ "lm_attn_heads": 32,
156
+ "lm_mqa_q_heads": 32,
157
+ "metric": "maha_diag",
158
+ "vocab_size": 48000,
159
+ "max_position_embeddings": 1024,
160
+ "ffn_hidden": 1536,
161
+ "mixer_hidden": 768,
162
+ "n_branches": 3,
163
+ "router_topk": 2,
164
+ "use_balls": true,
165
+ "radius_init": 3.5,
166
+ "ti_reg_weight": 0.01,
167
+ "ti_reg_samples": 64,
168
+ "energy_amplification": 9.87,
169
+ "theta_base": 10000.0,
170
+ "tie_word_embeddings": true
171
+ }
172
+ ```
173
+
174
+ ## Usage
175
+
176
+ ```python
177
+ from transformers import AutoTokenizer
178
+ from MoA import MoAMetricLM, MoAMetricConfig
179
+
180
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
181
+ model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
182
+
183
+ inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt")
184
+ outputs = model.generate(**inputs, max_new_tokens=128)
185
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
186
+ ```
187
+
188
+ ## Mathematical Foundation
189
+
190
+ The metric attention mechanism is grounded in the Discrepancy Calculus (DISC), a measure-theoretic framework for singularity analysis developed by the author. The triangle inequality regularizer enforces that the learned attention geometry satisfies d(a,c) ≀ d(a,b) + d(b,c) across sampled triples, ensuring the distance function used for attention scoring is a proper metric β€” not merely a similarity function.
191
+
192
+ The ball pruning mechanism (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from the geometry itself rather than from fixed masking heuristics.
193
+
194
+ BlackHoleRoPE extends standard rotary position encoding with learned phase perturbations synthesized from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V β€” ensuring position-dependent energy gating stays within Lyapunov-stable bounds.
195
+
196
+ ## Lineage
197
+
198
+ This architecture derives from research in metric-native neural computation:
199
+
200
+ - **DISC** β€” Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025)
201
+ - **MoA** β€” Mixture-of-Attentions with triangle inequality enforcement
202
+ - **BlackHoleRoPE** β€” Learned rotary position encoding with bounded energy gating
203
+
204
+ ## Limitations
205
+
206
+ - Trained on 262K tokens β€” the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated.
207
+ - No eval split was used; training metrics only.
208
+ - 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale.
209
+ - fp32 training only β€” bf16/fp16 behavior untested.
210
+
211
+ ## Citation
212
+
213
+ ```bibtex
214
+ @misc{colca2025discoverLM,
215
+ author = {Colca, Roy},
216
+ title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement},
217
+ year = {2025},
218
+ publisher = {HuggingFace},
219
+ url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M}
220
+ }
221
+ ```
222
+
223
+ ## Author
224
+
225
+ Roy Colca Jr. β€” [Convergent Intelligence LLC](https://convergentintel.com)
226
+ Mercyhurst University, M.S. Applied Intelligence
227
+ HuggingFace: [reaperdoesntknow](https://huggingface.co/reaperdoesntknow)