File size: 10,430 Bytes
afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f f86847c afaf33f 8c214f5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | ---
library_name: transformers
tags:
- trl
- sft
- metric-attention
- mixture-of-attentions
- triangle-inequality
- blackhole-rope
- discrepancy-calculus
- discover
license: cc
datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- openbmb/UltraData-Math
- yahma/alpaca-cleaned
language:
- en
pipeline_tag: text-generation
---
# DiscoverLM-70M-Base
A 70M parameter causal language model built on the **Mixture-of-Attentions (MoA)** architecture β distance-based metric attention that respects the triangle inequality by construction, not approximation.
Every attention head operates in a proper metric space. The geometry is enforced, not hoped for.
## What Makes This Different
Standard transformers compute attention as a dot product: QΒ·Kα΅. This has no geometric meaning β it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties.
MoA replaces this with **negative squared distance** under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) β€ d(a,b) + d(b,c) holds.
This isn't a constraint that fights the model. It's structure the model uses.
## Architecture
```
Input β Token Embedding (48K vocab, custom tokenizer)
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MoA Block Γ 4 β
β β
β βββββββββββ ββββββββββββ ββββββββββ ββββββββββ β
β β Local β β Global β βChannel β β MQA β β
β β Conv β β Metric β β Mix β β Metric β β
β β β β (64 heads)β β β β(64 Q) β β
β ββββββ¬βββββ ββββββ¬ββββββ βββββ¬βββββ βββββ¬βββββ β
β ββββββββ¬βββββ΄ββββββββββββ΄ββββββββββββ β
β βΌ β
β Feature Gates + Token Router (top-2) β
β βΌ β
β Residual + DropPath β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
βΌ
HyperFFN (SwiGLU + CausalConv + LowRank)
βΌ
LayerNorm
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MoA Language Model Head β
β (same 4-path mixture β SwiGLU β tied vocab) β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
βΌ
Logits (48,000)
```
### Core Components
**Metric Attention.** Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax.
**Mixture-of-Attentions Routing.** Four parallel paths per token β local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing.
**BlackHoleRoPE.** Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation.
**HyperFFN.** Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck β routed per-token with top-2 sparse selection.
**MoA LM Head.** The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding.
## Parameter Budget
| Component | Parameters | % |
|---|---|---|
| Token embedding (tied) | 24.6M | 35.5% |
| MoA blocks Γ 4 | 28.9M | 41.8% |
| HyperFFN (shared) | 4.2M | 6.1% |
| MoA LM head | 10.8M | 15.6% |
| RoPE + norms | 0.6M | 0.9% |
| **Total** | **69.1M** | |
## vs Standard Transformers
| | Transformer | MoA |
|---|---|---|
| Attention scoring | Dot product (QΒ·Kα΅) | Negative Mahalanobis distance |
| Geometric guarantee | None | Triangle inequality regularized |
| Position encoding | RoPE | BlackHoleRoPE (learned phase + bounded V energy) |
| Attention sparsity | Causal mask only | Ball pruning + top-k routing |
| Head combination | Concatenation | Per-token routed mixture of 4 path types |
| FFN | Single MLP | 3-branch routed (SwiGLU + CausalConv + LowRank) |
| LM head | Linear projection | Full MoA mixture β SwiGLU β tied projection |
## Training
### Data
| Dataset | Domain |
|---|---|
| [Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Multi-step reasoning |
| [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | Mathematical problem solving |
| [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | General instruction following |
## Usage
```python
from transformers import AutoTokenizer
from MoA import MoAMetricLM, MoAMetricConfig
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Chat Format
The tokenizer includes built-in special tokens for structured generation:
| Token | Role |
|---|---|
| `<\|system\|>` | System prompt boundary |
| `<\|user\|>` | User turn boundary |
| `<\|assistant\|>` | Assistant turn boundary |
| `<\|think\|>` | Internal reasoning start |
| `<\|reasoning\|>` | Reasoning chain marker |
| `<\|bos\|>` | Beginning of sequence |
| `<\|eos\|>` | End of sequence |
| `<\|pad\|>` | Padding |
```python
# Chat-style prompting
prompt = "<|system|>You are DiscoverLM, a small language model with metric attention.<|user|>What is the triangle inequality?<|assistant|><|think|><|reasoning|>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
```
## Mathematical Foundation
The metric attention mechanism is grounded in the Discrepancy Calculus (DISC), a measure-theoretic framework for singularity analysis developed by the author. The triangle inequality regularizer enforces that the learned attention geometry satisfies d(a,c) β€ d(a,b) + d(b,c) across sampled triples, ensuring the distance function used for attention scoring is a proper metric β not merely a similarity function.
The ball pruning mechanism (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from the geometry itself rather than from fixed masking heuristics.
BlackHoleRoPE extends standard rotary position encoding with learned phase perturbations synthesized from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V β ensuring position-dependent energy gating stays within Lyapunov-stable bounds.
## Lineage
This architecture derives from research in metric-native neural computation:
- **DISC** β Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025)
- **MoA** β Mixture-of-Attentions with triangle inequality enforcement
- **BlackHoleRoPE** β Learned rotary position encoding with bounded energy gating
## Limitations
- Trained on 262K tokens β the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated.
- No eval split was used; training metrics only.
- 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale.
- fp32 training only β bf16/fp16 behavior untested.
## Citation
```bibtex
@misc{CILLC2026discoverLM,
author = {Convergent Intelligence LLC: Research Division},
title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M}
}
```
## Author
Roy Colca Jr. β [Convergent Intelligence LLC](https://convergentintel.com)
HuggingFace: [reaperdoesntknow](https://huggingface.co/reaperdoesntknow)
---
## Convergent Intelligence Portfolio
*Part of the [Discover Series](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)*
### Related Models
| Model | Downloads | Format |
|-------|-----------|--------|
| [Discovered](https://huggingface.co/reaperdoesntknow/Discovered) | 55 | HF |
| [DiscoverLM-70M](https://huggingface.co/reaperdoesntknow/DiscoverLM-70M) | 107 | HF |
### Top Models from Our Lab
| Model | Downloads |
|-------|-----------|
| [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 |
| [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 |
| [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 |
| [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 |
| [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 |
**Total Portfolio: 41 models | 2,781 total downloads**
*Last updated: 2026-03-28 12:56 UTC*
|