File size: 14,225 Bytes
2ba606c
 
 
 
 
a9b9ecc
 
 
 
 
 
2192d73
fa3e7e6
 
 
 
 
 
 
 
2ba606c
 
a9b9ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6d5c2d
a9b9ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ba606c
 
 
a9b9ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6d5c2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9b9ecc
 
 
 
 
 
b6d5c2d
a9b9ecc
 
 
 
 
 
 
b6d5c2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31ca9ce
a9b9ecc
31ca9ce
a9b9ecc
31ca9ce
a9b9ecc
31ca9ce
 
 
 
 
 
 
 
 
a9b9ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6d5c2d
 
a9b9ecc
b6d5c2d
a9b9ecc
 
 
 
 
 
 
b6d5c2d
 
52aa9af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31ca9ce
52aa9af
 
efdf599
bf12333
 
 
 
 
 
 
039c799
bf12333
 
 
 
 
 
 
3b798d8
 
 
31ca9ce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
library_name: transformers
tags:
- trl
- sft
- metric-attention
- mixture-of-attentions
- triangle-inequality
- blackhole-rope
- discrepancy-calculus
- discover
- convergentintel
license: cc
datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- openbmb/UltraData-Math
- yahma/alpaca-cleaned
language:
- en
pipeline_tag: text-generation
---

# DiscoverLM-70M

A 69M parameter causal language model built on the **Mixture-of-Attentions (MoA)** architecture β€” distance-based metric attention that respects the triangle inequality by construction, not approximation.

Every attention head operates in a proper metric space. The geometry is enforced, not hoped for.

## What Makes This Different

Standard transformers compute attention as a dot product: QΒ·Kα΅€. This has no geometric meaning β€” it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties.

MoA replaces this with **negative squared distance** under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) ≀ d(a,b) + d(b,c) holds.

This isn't a constraint that fights the model. It's structure the model uses.

## Architecture

```
Input β†’ Token Embedding (48K vocab, custom tokenizer)
  β”‚
  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              MoA Block Γ— 4                       β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Local   β”‚ β”‚  Global  β”‚ β”‚Channel β”‚ β”‚  MQA   β”‚ β”‚
β”‚  β”‚  Conv    β”‚ β”‚  Metric  β”‚ β”‚  Mix   β”‚ β”‚ Metric β”‚ β”‚
β”‚  β”‚         β”‚ β”‚ (64 heads)β”‚ β”‚        β”‚ β”‚(64 Q)  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚              β–Ό                                   β”‚
β”‚     Feature Gates + Token Router (top-2)         β”‚
β”‚              β–Ό                                   β”‚
β”‚        Residual + DropPath                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
         HyperFFN (SwiGLU + CausalConv + LowRank)
                       β–Ό
                   LayerNorm
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            MoA Language Model Head               β”‚
β”‚  (same 4-path mixture β†’ SwiGLU β†’ tied vocab)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
                 Logits (48,000)
```

### Core Components

**Metric Attention.** Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax.

**Mixture-of-Attentions Routing.** Four parallel paths per token β€” local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing.

**BlackHoleRoPE.** Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation.

**HyperFFN.** Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck β€” routed per-token with top-2 sparse selection.

**MoA LM Head.** The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding.

## Parameter Budget

| Component | Parameters | % |
|---|---|---|
| Token embedding (tied) | 24.6M | 35.5% |
| MoA blocks Γ— 4 | 28.9M | 41.8% |
| HyperFFN (shared) | 4.2M | 6.1% |
| MoA LM head | 10.8M | 15.6% |
| RoPE + norms | 0.6M | 0.9% |
| **Total** | **69.1M** | |

## vs Standard Transformers

| | Transformer | MoA |
|---|---|---|
| Attention scoring | Dot product (QΒ·Kα΅€) | Negative Mahalanobis distance |
| Geometric guarantee | None | Triangle inequality regularized |
| Position encoding | RoPE | BlackHoleRoPE (learned phase + bounded V energy) |
| Attention sparsity | Causal mask only | Ball pruning + top-k routing |
| Head combination | Concatenation | Per-token routed mixture of 4 path types |
| FFN | Single MLP | 3-branch routed (SwiGLU + CausalConv + LowRank) |
| LM head | Linear projection | Full MoA mixture β†’ SwiGLU β†’ tied projection |

## Training

### Data

| Dataset | Domain |
|---|---|
| [Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Multi-step reasoning |
| [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | Mathematical problem solving |
| [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | General instruction following |

### Hyperparameters

| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3e-4 β†’ 0 (cosine) |
| Batch size | 4 |
| Max sequence length | 1,024 |
| Steps | 512 |
| Epochs | 8 |
| Tokens seen | 262,144 |
| Precision | fp32 |
| Hardware | NVIDIA H100 (Colab) |
| TI regularization | Ξ»=0.01, 64 samples/batch |
| Router top-k | 2 of 4 paths |

### Results

| Epoch | Avg Loss | Min Loss | Οƒ | Token Accuracy |
|---|---|---|---|---|
| 1 | 2.887 | 2.285 | 0.291 | 59.2% |
| 2 | 2.324 | 1.651 | 0.259 | 63.4% |
| 3 | 1.931 | 1.232 | 0.211 | 68.4% |
| 4 | 1.616 | 1.012 | 0.201 | 74.4% |
| 5 | 1.432 | 0.954 | 0.169 | 77.0% |
| 6 | 1.211 | 0.677 | 0.180 | 79.0% |
| 7 | 1.075 | 0.599 | 0.151 | 80.1% |
| 8 | 1.014 | 0.718 | 0.142 | 80.8% |

**Best single step:** 393 β€” loss **0.599**, token accuracy **88.4%**

Loss variance halved across training (Οƒ: 0.291 β†’ 0.142), indicating the mixture-of-attentions learned stable routing preferences as training progressed.

## Configuration

```json
{
  "dim": 512,
  "num_layers": 4,
  "attn_heads": 64,
  "mqa_q_heads": 64,
  "lm_attn_heads": 32,
  "lm_mqa_q_heads": 32,
  "metric": "maha_diag",
  "vocab_size": 48000,
  "max_position_embeddings": 1024,
  "ffn_hidden": 1536,
  "mixer_hidden": 768,
  "n_branches": 3,
  "router_topk": 2,
  "use_balls": true,
  "radius_init": 3.5,
  "ti_reg_weight": 0.01,
  "ti_reg_samples": 64,
  "energy_amplification": 9.87,
  "theta_base": 10000.0,
  "tie_word_embeddings": true
}
```

### Tokenizer

Custom 48K vocabulary tokenizer with structured generation tokens built in:

```json
{
  "backend": "tokenizers",
  "model_max_length": 2048,
  "bos_token": "<|bos|>",
  "eos_token": "<|eos|>",
  "pad_token": "<|pad|>",
  "unk_token": "<|unk|>",
  "extra_special_tokens": [
    "<|system|>", "<|user|>", "<|assistant|>",
    "<|think|>", "<|reasoning|>"
  ]
}
```

## Usage

```python
from transformers import AutoTokenizer
from MoA import MoAMetricLM, MoAMetricConfig

tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M")

inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Chat Format

The tokenizer includes built-in special tokens for structured generation:

| Token | Role |
|---|---|
| `<\|system\|>` | System prompt boundary |
| `<\|user\|>` | User turn boundary |
| `<\|assistant\|>` | Assistant turn boundary |
| `<\|think\|>` | Internal reasoning start |
| `<\|reasoning\|>` | Reasoning chain marker |
| `<\|bos\|>` | Beginning of sequence |
| `<\|eos\|>` | End of sequence |
| `<\|pad\|>` | Padding |

```python
# Chat-style prompting
prompt = "<|system|>You are DiscoverLM, a small language model with metric attention.<|user|>What is the triangle inequality?<|assistant|><|think|><|reasoning|>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
```

## Mathematical Foundations: Discrepancy Calculus (DISC)

The metric attention mechanism is grounded in Discrepancy Calculus β€” a measure-theoretic framework for singularity analysis developed by the author. The core operator:

$$Df(x) := \limsup_{r \downarrow 0} \sup_{0 < d(x,y) < r} \frac{|f(y) - f(x)|}{d(x,y)}$$

On metric-measure spaces (Axiom 11.1), this replaces the classical derivative with a scale-free local Lipschitz seminorm. The triangle inequality regularizer enforces that the learned attention geometry satisfies $d(a,c) \leq d(a,b) + d(b,c)$ across sampled triples β€” ensuring the distance function is a proper metric, not merely a similarity function.

The **ball pruning mechanism** (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from **gap geometry** (Axiom 11.4): the gap set $\Delta(E) = \{x : \theta^*_E(x) > \theta_{E*}(x)\}$ defines where attention should be sparse. Tokens outside the ball are in the gap β€” structurally irrelevant to the current computation.

**BlackHoleRoPE** extends standard rotary position encoding with learned phase perturbations from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V. The boundedness condition is the DISC **discrepancy energy** criterion: $E_{\text{disc}} < \infty$ ensures Lyapunov stability.

**L2-star discrepancy** for thermodynamic governance measures training dynamics as a discrepancy field β€” gradient magnitude distributions analyzed through the same measure-theoretic lens that DISC applies to function spaces.

Full theory: *"On the Formal Analysis of Discrepancy Calculus"* (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165).

## Lineage

This architecture derives from research in metric-native neural computation:

- **DISC** β€” Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025)
- **MoA** β€” Mixture-of-Attentions with triangle inequality enforcement
- **BlackHoleRoPE** β€” Learned rotary position encoding with bounded energy gating

## Limitations

- Trained on 262K tokens β€” the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated.
- No eval split was used; training metrics only.
- 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale.
- fp32 training only β€” bf16/fp16 behavior untested.

## Citation

```bibtex
@misc{CILLC2026discoverLM,
  author = {Convergent Intelligence LLC: Research Division},
  title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M}
}
```

## Author

Roy Colca Jr. β€” [Convergent Intelligence LLC](https://convergentintel.com)

HuggingFace: [reaperdoesntknow](https://huggingface.co/reaperdoesntknow)

---

## Convergent Intelligence Portfolio

*Part of the [Discover Series](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*


### Related Models

| Model | Downloads | Format |
|-------|-----------|--------|
| [Discovery](https://huggingface.co/reaperdoesntknow/Discovery) | 31 | HF |
| [Discovered](https://huggingface.co/reaperdoesntknow/Discovered) | 55 | HF |

### Top Models from Our Lab

| Model | Downloads |
|-------|-----------|
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 |
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |

**Total Portfolio: 49 models, 22,598 total downloads**


*Last updated: 2026-03-28 12:55 UTC*

<!-- CIX-CROSSLINK-START -->

---

## From the Convergent Intelligence Portfolio

**[DistilQwen Collection](https://huggingface.co/collections/reaperdoesntknow/distilqwen-69bf40ec669117e3f069ef1c)** β€” Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B β†’ 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) β€” 508 downloads

Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)

*Convergent Intelligence LLC: Research Division*

<!-- CIX-CROSSLINK-END -->

---
<sub>Part of the [reaperdoesntknow research portfolio](https://huggingface.co/reaperdoesntknow) β€” 49 models, 22,598 total downloads | Last refreshed: 2026-03-30 12:10 UTC</sub>