| --- |
| library_name: transformers |
| tags: |
| - trl |
| - sft |
| - metric-attention |
| - mixture-of-attentions |
| - triangle-inequality |
| - blackhole-rope |
| - discrepancy-calculus |
| - discover |
| license: cc |
| datasets: |
| - nohurry/Opus-4.6-Reasoning-3000x-filtered |
| - openbmb/UltraData-Math |
| - yahma/alpaca-cleaned |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # DiscoverLM-70M-Base |
|
|
| A 70M parameter causal language model built on the **Mixture-of-Attentions (MoA)** architecture — distance-based metric attention that respects the triangle inequality by construction, not approximation. |
|
|
| Every attention head operates in a proper metric space. The geometry is enforced, not hoped for. |
|
|
| ## What Makes This Different |
|
|
| Standard transformers compute attention as a dot product: Q·Kᵀ. This has no geometric meaning — it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties. |
|
|
| MoA replaces this with **negative squared distance** under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) ≤ d(a,b) + d(b,c) holds. |
|
|
| This isn't a constraint that fights the model. It's structure the model uses. |
|
|
| ## Architecture |
|
|
| ``` |
| Input → Token Embedding (48K vocab, custom tokenizer) |
| │ |
| ▼ |
| ┌──────────────────────────────────────────────────┐ |
| │ MoA Block × 4 │ |
| │ │ |
| │ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ │ |
| │ │ Local │ │ Global │ │Channel │ │ MQA │ │ |
| │ │ Conv │ │ Metric │ │ Mix │ │ Metric │ │ |
| │ │ │ │ (64 heads)│ │ │ │(64 Q) │ │ |
| │ └────┬────┘ └────┬─────┘ └───┬────┘ └───┬────┘ │ |
| │ └──────┬────┴───────────┴───────────┘ │ |
| │ ▼ │ |
| │ Feature Gates + Token Router (top-2) │ |
| │ ▼ │ |
| │ Residual + DropPath │ |
| └──────────────────────┬───────────────────────────┘ |
| ▼ |
| HyperFFN (SwiGLU + CausalConv + LowRank) |
| ▼ |
| LayerNorm |
| ▼ |
| ┌──────────────────────────────────────────────────┐ |
| │ MoA Language Model Head │ |
| │ (same 4-path mixture → SwiGLU → tied vocab) │ |
| └──────────────────────┬───────────────────────────┘ |
| ▼ |
| Logits (48,000) |
| ``` |
|
|
| ### Core Components |
|
|
| **Metric Attention.** Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax. |
|
|
| **Mixture-of-Attentions Routing.** Four parallel paths per token — local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing. |
|
|
| **BlackHoleRoPE.** Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation. |
|
|
| **HyperFFN.** Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck — routed per-token with top-2 sparse selection. |
|
|
| **MoA LM Head.** The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding. |
| |
| ## Parameter Budget |
| |
| | Component | Parameters | % | |
| |---|---|---| |
| | Token embedding (tied) | 24.6M | 35.5% | |
| | MoA blocks × 4 | 28.9M | 41.8% | |
| | HyperFFN (shared) | 4.2M | 6.1% | |
| | MoA LM head | 10.8M | 15.6% | |
| | RoPE + norms | 0.6M | 0.9% | |
| | **Total** | **69.1M** | | |
| |
| ## vs Standard Transformers |
| |
| | | Transformer | MoA | |
| |---|---|---| |
| | Attention scoring | Dot product (Q·Kᵀ) | Negative Mahalanobis distance | |
| | Geometric guarantee | None | Triangle inequality regularized | |
| | Position encoding | RoPE | BlackHoleRoPE (learned phase + bounded V energy) | |
| | Attention sparsity | Causal mask only | Ball pruning + top-k routing | |
| | Head combination | Concatenation | Per-token routed mixture of 4 path types | |
| | FFN | Single MLP | 3-branch routed (SwiGLU + CausalConv + LowRank) | |
| | LM head | Linear projection | Full MoA mixture → SwiGLU → tied projection | |
| |
| ## Training |
| |
| ### Data |
| |
| | Dataset | Domain | |
| |---|---| |
| | [Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Multi-step reasoning | |
| | [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | Mathematical problem solving | |
| | [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | General instruction following | |
| |
| |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoTokenizer |
| from MoA import MoAMetricLM, MoAMetricConfig |
| |
| tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/DiscoverLM-70M") |
| model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M") |
| |
| inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=128) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
| |
| ### Chat Format |
| |
| The tokenizer includes built-in special tokens for structured generation: |
| |
| | Token | Role | |
| |---|---| |
| | `<\|system\|>` | System prompt boundary | |
| | `<\|user\|>` | User turn boundary | |
| | `<\|assistant\|>` | Assistant turn boundary | |
| | `<\|think\|>` | Internal reasoning start | |
| | `<\|reasoning\|>` | Reasoning chain marker | |
| | `<\|bos\|>` | Beginning of sequence | |
| | `<\|eos\|>` | End of sequence | |
| | `<\|pad\|>` | Padding | |
| |
| ```python |
| # Chat-style prompting |
| prompt = "<|system|>You are DiscoverLM, a small language model with metric attention.<|user|>What is the triangle inequality?<|assistant|><|think|><|reasoning|>" |
| inputs = tokenizer(prompt, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=256) |
| ``` |
| |
| ## Mathematical Foundation |
| |
| The metric attention mechanism is grounded in the Discrepancy Calculus (DISC), a measure-theoretic framework for singularity analysis developed by the author. The triangle inequality regularizer enforces that the learned attention geometry satisfies d(a,c) ≤ d(a,b) + d(b,c) across sampled triples, ensuring the distance function used for attention scoring is a proper metric — not merely a similarity function. |
| |
| The ball pruning mechanism (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from the geometry itself rather than from fixed masking heuristics. |
| |
| BlackHoleRoPE extends standard rotary position encoding with learned phase perturbations synthesized from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V — ensuring position-dependent energy gating stays within Lyapunov-stable bounds. |
| |
| ## Lineage |
| |
| This architecture derives from research in metric-native neural computation: |
| |
| - **DISC** — Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025) |
| - **MoA** — Mixture-of-Attentions with triangle inequality enforcement |
| - **BlackHoleRoPE** — Learned rotary position encoding with bounded energy gating |
| |
| ## Limitations |
| |
| - Trained on 262K tokens — the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated. |
| - No eval split was used; training metrics only. |
| - 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale. |
| - fp32 training only — bf16/fp16 behavior untested. |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{CILLC2026discoverLM, |
| author = {Convergent Intelligence LLC: Research Division}, |
| title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M} |
| } |
| ``` |
| |
| ## Author |
| |
| Roy Colca Jr. — [Convergent Intelligence LLC](https://convergentintel.com) |
| |
| HuggingFace: [reaperdoesntknow](https://huggingface.co/reaperdoesntknow) |
| |
| --- |
| |
| ## Convergent Intelligence Portfolio |
| |
| *Part of the [Discover Series](https://huggingface.co/reaperdoesntknow) by [Convergent Intelligence LLC: Research Division](https://huggingface.co/reaperdoesntknow)* |
| |
| |
| ### Related Models |
| |
| | Model | Downloads | Format | |
| |-------|-----------|--------| |
| | [Discovered](https://huggingface.co/reaperdoesntknow/Discovered) | 55 | HF | |
| | [DiscoverLM-70M](https://huggingface.co/reaperdoesntknow/DiscoverLM-70M) | 107 | HF | |
| |
| ### Top Models from Our Lab |
| |
| | Model | Downloads | |
| |-------|-----------| |
| | [Qwen3-1.7B-Thinking-Distil](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Thinking-Distil) | 501 | |
| | [LFM2.5-1.2B-Distilled-SFT](https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT) | 342 | |
| | [Qwen3-1.7B-Coder-Distilled-SFT](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT) | 302 | |
| | [Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF) | 203 | |
| | [Qwen3-1.7B-Coder-Distilled-SFT-GGUF](https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF) | 194 | |
| |
| **Total Portfolio: 41 models | 2,781 total downloads** |
| |
| |
| *Last updated: 2026-03-28 12:56 UTC* |
| |